DZone Spotlight

Friday, July 3 View All Articles »

Dead Letter Queue Patterns in Apache Flink: Handling Poison Messages Without Stopping Your Stream

By Rohit Muthyala

Streaming systems usually fail in one of two ways: Loudly, when infrastructure breaksQuietly, when one bad record keeps replaying until the pipeline is effectively dead The second failure mode is more dangerous because it often starts with something small: malformed JSON, an unexpected schema change, a missing required field, or a downstream timeout that was never handled correctly. In Apache Flink, one unhandled exception can trigger a restart. If the same poison message is still sitting in Kafka after recovery, the job reads it again, fails again, restarts again, and enters a loop. At that point, the pipeline is technically "recovering," but operationally it is down. This is exactly why production Flink jobs need a Dead Letter Queue (DLQ) strategy from day one. A proper DLQ pattern does three things: Isolates bad records so they do not stop good onesCaptures enough failure context to debug the issue laterPreserves replayability so quarantined records can be reprocessed after the root cause is fixed Anything less is not really a DLQ. It is either silent data loss or delayed outage. In this article, I will walk through the most practical DLQ patterns for Apache Flink 1.18: Side outputs as the core DLQ primitiveRetry with exponential backoff for transient failuresTiered DLQ routing by error classKafka and S3 sink patternsMetrics and alertingReplay with a dedicated reprocessing jobA PyFlink version of the side output pattern The goal is simple: a bad message should never silently disappear, and it should never silently stop the stream. Why Poison Messages Break Otherwise Healthy Pipelines A poison message is any record that consistently fails processing. Typical examples include: Malformed JSONIncompatible schema versionsMissing required fieldsInvalid business valuesRecords that trigger unexpected code pathsMessages that repeatedly fail downstream enrichment calls Without DLQ handling, the failure path usually looks like this: The record enters the pipelineDeserialization or validation throws an exceptionThe operator failsFlink restarts from the last checkpointThe same record is consumed againThe same exception happens again That loop can continue indefinitely. The result is predictable: Throughput drops to zeroDownstream consumers starveCheckpoint recovery does not helpOn-call engineers get paged for a problem caused by one record This is why DLQ handling is not just an error-handling convenience. It is a core reliability pattern. What a DLQ Should Look Like in Flink In a streaming architecture, a DLQ is a durable destination for records that could not be processed successfully. For Flink, that means the DLQ record should usually include: Raw payloadError typeError messageStack trace or summarized failure contextFailure timestampSource metadata such as topic, partition, or offset when available That information matters because a DLQ is only useful if someone can answer two questions later: Why did this record fail?How do I replay it safely once the issue is fixed? If you only log the exception, you lose replayability. If you only store the payload, you lose debugging context. If you drop the record entirely, you lose both. So the design target is not "catch exceptions." The design target is durable, observable, replayable failure handling. Pattern 1: Use Side Outputs as the Core DLQ Primitive The most natural DLQ mechanism in Flink is the side output. A side output allows one operator to emit records to multiple streams: The main stream for successful recordsOne or more side streams for failures, late data, or quarantined records That makes it the right primitive for DLQ routing. Define the DLQ Envelope and Output Tag Java import org.apache.flink.util.OutputTag; import org.apache.flink.streaming.api.functions.ProcessFunction; import org.apache.flink.util.Collector; public static final OutputTag<DeadLetterRecord> DLQ_TAG = new OutputTag<DeadLetterRecord>("dead-letter-queue") {}; public record DeadLetterRecord( String rawPayload, String errorType, String errorMessage, String stackTrace, long failedAtEpochMs, String sourceTopicPartition, long sourceOffset ) {} The important point here is that the DLQ record is not just the failed payload. It is an envelope that preserves enough context for triage and replay. Route Failures Inside a ProcessFunction Java public class EntityEventProcessor extends ProcessFunction<String, EntityEvent> { @Override public void processElement( String rawMessage, Context ctx, Collector<EntityEvent> out) { try { EntityEvent event = parseAndValidate(rawMessage); out.collect(event); } catch (JsonParseException e) { ctx.output(DLQ_TAG, new DeadLetterRecord( rawMessage, "JSON_PARSE_FAILURE", e.getMessage(), getStackTrace(e), System.currentTimeMillis(), ctx.element().toString(), -1L )); } catch (SchemaValidationException e) { ctx.output(DLQ_TAG, new DeadLetterRecord( rawMessage, "SCHEMA_VALIDATION_FAILURE", e.getMessage(), getStackTrace(e), System.currentTimeMillis(), ctx.element().toString(), -1L )); } catch (Exception e) { ctx.output(DLQ_TAG, new DeadLetterRecord( rawMessage, "UNKNOWN_FAILURE", e.getMessage(), getStackTrace(e), System.currentTimeMillis(), ctx.element().toString(), -1L )); } } private EntityEvent parseAndValidate(String raw) throws JsonParseException, SchemaValidationException { EntityEvent event = objectMapper.readValue(raw, EntityEvent.class); if (event.entityId() == null || event.entityId().isBlank()) { throw new SchemaValidationException("entityId is required"); } if (event.timestamp() <= 0) { throw new SchemaValidationException("timestamp must be positive"); } return event; } } This is the minimum viable DLQ pattern, and it already solves the most important operational problem: bad records no longer stop good ones. Wire the Main Stream and DLQ Stream Java StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); DataStream<String> kafkaSource = env .fromSource(buildKafkaSource(), WatermarkStrategy.noWatermarks(), "entity-events-source"); SingleOutputStreamOperator<EntityEvent> processed = kafkaSource.process(new EntityEventProcessor()); DataStream<EntityEvent> goodEvents = processed; DataStream<DeadLetterRecord> deadLetters = processed.getSideOutput(DLQ_TAG); goodEvents.sinkTo(buildDownstreamKafkaSink()); deadLetters.sinkTo(buildDlqKafkaSink()); env.execute("Entity Resolution Pipeline"); If you do nothing else, do this. Side outputs should be the default DLQ foundation in Flink. Pattern 2: Retry Transient Failures Before Escalating to DLQ Not every failure belongs in the DLQ immediately. Some failures are transient: A downstream service is temporarily unavailableA database call times outAn external API is rate-limitedA network dependency is briefly unstable If you send all of those directly to the DLQ, you create noise and bury the truly bad records. The better pattern is: Retry transient failures a limited number of timesUse exponential backoffEscalate to DLQ only after retries are exhausted Retry With KeyedProcessFunction and Timers Java public class RetryingEnrichmentProcessor extends KeyedProcessFunction<String, EntityEvent, EnrichedEvent> { private static final int MAX_RETRIES = 3; private static final long BASE_BACKOFF_MS = 500L; private transient ValueState<Integer> retryCountState; private transient ValueState<EntityEvent> pendingEventState; @Override public void open(Configuration parameters) { retryCountState = getRuntimeContext().getState( new ValueStateDescriptor<>("retry-count", Integer.class)); pendingEventState = getRuntimeContext().getState( new ValueStateDescriptor<>("pending-event", EntityEvent.class)); } @Override public void processElement( EntityEvent event, Context ctx, Collector<EnrichedEvent> out) throws Exception { try { EnrichedEvent enriched = callEnrichmentService(event); retryCountState.clear(); pendingEventState.clear(); out.collect(enriched); } catch (TransientServiceException e) { int retries = retryCountState.value() == null ? 0 : retryCountState.value(); if (retries >= MAX_RETRIES) { retryCountState.clear(); pendingEventState.clear(); ctx.output(DLQ_TAG, new DeadLetterRecord( event.toString(), "MAX_RETRIES_EXCEEDED", "Failed after " + MAX_RETRIES + " retries: " + e.getMessage(), getStackTrace(e), System.currentTimeMillis(), ctx.getCurrentKey(), -1L )); } else { retryCountState.update(retries + 1); pendingEventState.update(event); long backoffMs = BASE_BACKOFF_MS * (long) Math.pow(2, retries); ctx.timerService().registerProcessingTimeTimer( System.currentTimeMillis() + backoffMs ); } } catch (PoisonMessageException e) { ctx.output(DLQ_TAG, new DeadLetterRecord( event.toString(), "POISON_MESSAGE", e.getMessage(), getStackTrace(e), System.currentTimeMillis(), ctx.getCurrentKey(), -1L )); } } @Override public void onTimer( long timestamp, OnTimerContext ctx, Collector<EnrichedEvent> out) throws Exception { EntityEvent pending = pendingEventState.value(); if (pending == null) return; try { EnrichedEvent enriched = callEnrichmentService(pending); retryCountState.clear(); pendingEventState.clear(); out.collect(enriched); } catch (TransientServiceException e) { int retries = retryCountState.value(); if (retries >= MAX_RETRIES) { retryCountState.clear(); pendingEventState.clear(); ctx.output(DLQ_TAG, new DeadLetterRecord( pending.toString(), "MAX_RETRIES_EXCEEDED", "Timer retry exhausted: " + e.getMessage(), getStackTrace(e), System.currentTimeMillis(), ctx.getCurrentKey(), -1L )); } else { retryCountState.update(retries + 1); long backoffMs = BASE_BACKOFF_MS * (long) Math.pow(2, retries); ctx.timerService().registerProcessingTimeTimer( timestamp + backoffMs ); } } } } Why This Works Especially Well in Flink This pattern is stronger in Flink than in many other stream processors because timers and state are checkpointed. That means: Retry counters survive restartsPending events survive restartsScheduled retries resume after recovery In other words, the retry workflow itself is fault-tolerant. That is exactly what you want when handling transient failures in a long-running stream. Pattern 3: Split the DLQ by Failure Type Once a pipeline matures, a single DLQ topic usually becomes too coarse. Schema failures, business validation failures, exhausted retries, and unknown exceptions all end up mixed together. That makes triage slower and replay harder. A better pattern is to classify failures and route them to separate DLQ streams. Define Failure Tiers Java public enum DlqTier { TRANSIENT_EXHAUSTED, SCHEMA_INVALID, BUSINESS_RULE, UNKNOWN } Route by Exception Class Java public class TieredDlqRouter extends ProcessFunction<String, EntityEvent> { @Override public void processElement( String raw, Context ctx, Collector<EntityEvent> out) { try { EntityEvent event = parse(raw); validate(event); out.collect(event); } catch (JsonParseException | MappingException e) { route(ctx, raw, DlqTier.SCHEMA_INVALID, e); } catch (BusinessValidationException e) { route(ctx, raw, DlqTier.BUSINESS_RULE, e); } catch (Exception e) { route(ctx, raw, DlqTier.UNKNOWN, e); } } private void route(Context ctx, String raw, DlqTier tier, Exception e) { OutputTag<DeadLetterRecord> tag = getTierTag(tier); ctx.output(tag, new DeadLetterRecord( raw, tier.name(), e.getMessage(), getStackTrace(e), System.currentTimeMillis(), "", -1L )); } } Define One Output Tag Per Tier Java public static final OutputTag<DeadLetterRecord> DLQ_SCHEMA = new OutputTag<>("dlq-schema-invalid") {}; public static final OutputTag<DeadLetterRecord> DLQ_BUSINESS = new OutputTag<>("dlq-business-rule") {}; public static final OutputTag<DeadLetterRecord> DLQ_UNKNOWN = new OutputTag<>("dlq-unknown") {}; Sink Each Tier Independently Java SingleOutputStreamOperator<EntityEvent> processed = kafkaSource.process(new TieredDlqRouter()); processed.getSideOutput(DLQ_SCHEMA) .sinkTo(buildKafkaSink("dlq.schema-invalid")); processed.getSideOutput(DLQ_BUSINESS) .sinkTo(buildKafkaSink("dlq.business-rule")); processed.getSideOutput(DLQ_UNKNOWN) .sinkTo(buildKafkaSink("dlq.unknown")); This makes the DLQ operationally useful instead of just technically correct. For example: Schema failures can be routed to the producer teamBusiness rule failures can feed data quality workflowsUnknown failures can trigger higher-severity alerting Pattern 4: Choose DLQ Sinks Based on How You Plan To Recover Once records are routed to a DLQ stream, they need a durable destination. In practice, the two most common choices are Kafka and object storage. Kafka DLQ Sink Kafka is the right choice when you want: Near-real-time inspectionStreaming replayOperational integration with existing consumers Java private static KafkaSink<DeadLetterRecord> buildDlqKafkaSink( String topicName) { return KafkaSink.<DeadLetterRecord>builder() .setBootstrapServers("kafka-broker:9092") .setRecordSerializer( KafkaRecordSerializationSchema.builder() .setTopic(topicName) .setValueSerializationSchema( new JsonSerializationSchema<>(DeadLetterRecord.class)) .setKeySerializationSchema( record -> record.errorType().getBytes()) .build() ) .setDeliveryGuarantee(DeliveryGuarantee.AT_LEAST_ONCE) .build(); } S3 DLQ Sink Object storage is the better choice when you want: Long retentionLow-cost quarantineBatch replay with Spark or AthenaPartitioned storage by date or error type Java private static FileSink<DeadLetterRecord> buildS3DlqSink() { return FileSink .forRowFormat( new Path("s3://your-bucket/dlq/entity-resolution/"), new JsonRowEncoder<>(DeadLetterRecord.class) ) .withRollingPolicy( DefaultRollingPolicy.builder() .withRolloverInterval(Duration.ofMinutes(15)) .withInactivityInterval(Duration.ofMinutes(5)) .withMaxPartSize(MemorySize.ofMebiBytes(128)) .build() ) .withBucketAssigner( new DateTimeBucketAssigner<>( "error-type='unknown'/year=yyyy/month=MM/day=dd/hour=HH") ) .build(); } A practical production pattern is to use: Kafka for short-term operational handlingS3 for long-term quarantine and replay That gives you both fast response and durable history. Pattern 5: Monitor DLQ Rate, Not Just Job Uptime A DLQ that nobody watches is just a backlog with better branding. Job uptime alone is not enough. A Flink job can stay green while quietly routing 10% of traffic to the DLQ. That is still a production incident. Add Metrics Inside the Operator Java public class MonitoredEntityEventProcessor extends ProcessFunction<String, EntityEvent> { private transient Counter dlqCounter; private transient Counter successCounter; private transient Histogram processingLatency; @Override public void open(Configuration parameters) { MetricGroup metrics = getRuntimeContext() .getMetricGroup() .addGroup("entity_resolution"); dlqCounter = metrics.counter("dlq_routed_total"); successCounter = metrics.counter("processed_success_total"); processingLatency = metrics.histogram( "processing_latency_ms", new DescriptiveStatisticsHistogram(1000) ); } @Override public void processElement( String raw, Context ctx, Collector<EntityEvent> out) { long start = System.currentTimeMillis(); try { EntityEvent event = parseAndValidate(raw); successCounter.inc(); out.collect(event); } catch (Exception e) { dlqCounter.inc(); ctx.output(DLQ_TAG, buildDeadLetter(raw, e)); } finally { processingLatency.update(System.currentTimeMillis() - start); } } } Alert on DLQ Rate A useful alert is DLQ throughput relative to successful throughput: YAML - alert: FlinkDlqRateHigh expr: | rate(flink_entity_resolution_dlq_routed_total[5m]) / rate(flink_entity_resolution_processed_success_total[5m]) > 0.01 for: 2m labels: severity: warning annotations: summary: "DLQ rate exceeds 1% of total throughput" description: "Check dlq.unknown Kafka topic for upstream schema changes" As a rule of thumb: above 1% often indicates schema drift or producer issuesabove 5% usually indicates a broader systemic problem The exact thresholds depend on the pipeline, but the principle does not: monitor DLQ rate as a first-class health signal. Pattern 6: Replay With a Dedicated Reprocessing Job A DLQ is only complete when replay is possible. The cleanest design is a separate Flink job that reads from the DLQ topic and routes records back through the main processing logic. Example Replay Job Java public class DlqReprocessingJob { public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); DataStream<DeadLetterRecord> dlqStream = env .fromSource( buildKafkaSource("dlq.schema-invalid"), WatermarkStrategy.noWatermarks(), "dlq-source" ); DataStream<String> replayStream = dlqStream .filter(r -> r.failedAtEpochMs() >= START_EPOCH && r.failedAtEpochMs() <= END_EPOCH) .map(DeadLetterRecord::rawPayload); SingleOutputStreamOperator<EntityEvent> reprocessed = replayStream.process(new EntityEventProcessor()); reprocessed.sinkTo(buildDownstreamKafkaSink()); reprocessed.getSideOutput(DLQ_TAG) .sinkTo(buildKafkaSink("dlq.permanent-quarantine")); env.execute("DLQ Reprocessing Job"); } } Why Replay Should Be a Separate Job Keeping replay separate from the main pipeline gives you: Independent scalingIndependent schedulingCleaner checkpoint behaviorSafer operational control It also lets you drain backlogs on your own terms: Off-peak hoursReduced parallelismOr maximum parallelism when you need to catch up quickly That separation keeps the main pipeline stable while still making recovery practical. PyFlink Version: Same Pattern, Same Principle If your team uses PyFlink, the same side output pattern applies. Python from pyflink.datastream import StreamExecutionEnvironment from pyflink.datastream.functions import ProcessFunction from pyflink.common.typeinfo import Types from pyflink.datastream.output_tag import OutputTag DLQ_TAG = OutputTag( "dead-letter-queue", Types.ROW_NAMED( ["raw_payload", "error_type", "error_message", "failed_at_ms"], [Types.STRING(), Types.STRING(), Types.STRING(), Types.LONG()] ) ) class EntityEventProcessor(ProcessFunction): def process_element(self, value, ctx): try: event = parse_and_validate(value) yield event except Exception as e: from pyflink.common import Row yield DLQ_TAG, Row( raw_payload=str(value), error_type=type(e).__name__, error_message=str(e), failed_at_ms=int(time.time() * 1000) ) env = StreamExecutionEnvironment.get_execution_environment() source_stream = env.from_source(...) processed = source_stream.process( EntityEventProcessor(), output_type=Types.STRING() ) good_events = processed dead_letters = processed.get_side_output(DLQ_TAG) good_events.sink_to(build_downstream_sink()) dead_letters.sink_to(build_dlq_sink()) env.execute("Entity Resolution Pipeline") The syntax changes, but the design principle stays the same: good records continue, bad records are isolated and persisted. Production Checklist Before shipping a Flink pipeline, verify the following: RequirementWhy It MattersRisky operators wrapped in try/catchPrevents restart loops from unhandled exceptionsDLQ output tags use explicit typingAvoids runtime serialization failuresDLQ sink is durableFailed records must survive restartsDLQ metrics are exportedSilent DLQ growth is otherwise invisibleReplay path exists and is testedA DLQ without replay is just storageDLQ retention is long enoughTeams need time to diagnose and replayPermanent quarantine existsPrevents infinite replay loopsAlerting is based on DLQ rateJob health alone is not enough This checklist is worth automating in code review or deployment readiness checks. DLQ handling is too important to leave to convention. Key Takeaways If you are building Flink pipelines in production, the safest default is: Use side outputs for DLQ routingRetry transient failures before escalationClassify failures into separate DLQ streamsSink DLQ records durablyExport DLQ metricsReplay through a dedicated job The core rule is simple: A bad message should never silently disappear, and it should never silently stop the stream. That is what turns DLQ handling from a defensive coding trick into a real reliability pattern. Environment Notes The examples in this article target: Apache Flink 1.18Java 17PyFlink 1.18 A few implementation notes: The retry timer pattern requires a keyed stream before KeyedProcessFunctionRocksDB is usually the safer state backend for larger retry stateHashMap state backend can work well for smaller, latency-sensitive workloadsAT_LEAST_ONCE is usually sufficient for DLQ sinks Final Thoughts Poison messages are not rare in streaming systems. They are inevitable. The real question is whether one bad record can take down an otherwise healthy pipeline. With the right DLQ design in Flink, the answer becomes no. The stream keeps moving. Good records continue. Bad records are quarantined. Alerts fire. Replay remains possible. And the pipeline stays operational while the root cause is fixed. That is the difference between a stream that works in staging and one that survives production. More

Beyond Root Cause: Building Effective Blameless Postmortems for Cloud-Native Systems

By Akshay Pratinav

Production incidents are inevitable. No matter how much testing, automation, observability, or resilience engineering an organization invests in, complex distributed systems will eventually fail in unexpected ways. The real differentiator between high-performing engineering organizations and everyone else is not whether incidents occur — it is how effectively organizations learn from them. Unfortunately, many root cause analysis (RCA) processes fail to achieve this objective. Instead of uncovering systemic weaknesses, they often focus on identifying a single mistake, a specific engineer, or a single technical failure. The resulting report may satisfy a compliance requirement, but it rarely produces meaningful improvements in reliability. As cloud-native architectures become increasingly distributed and interconnected, organizations must evolve beyond traditional RCA practices and adopt blameless postmortems that focus on organizational learning and continuous improvement. The Traditional RCA Trap Most incident investigations begin with a simple question: "What caused the outage?" At first glance, this seems reasonable. However, the question itself often leads teams toward finding a single root cause. Common conclusions include: An engineer deployed an incorrect configuration.A database migration introduced an error.An operator executed the wrong command.A monitoring alert was ignored.A service exceeded capacity limits. While these statements may be factually correct, they often represent only the final event in a much larger chain of failures. Consider a scenario where a configuration change causes a critical service outage. A traditional RCA might conclude: The outage occurred because an engineer deployed an invalid configuration file. While technically true, this explanation leaves many important questions unanswered: Why was the invalid configuration allowed into production?Why did automated validation fail to detect the issue?Why did monitoring not identify the problem immediately?Why was the blast radius so large?Why was rollback difficult?Why did recovery take longer than expected? These questions often reveal the real opportunities for improvement. Modern Incidents Rarely Have a Single Root Cause One of the most important lessons from operating distributed systems is that incidents are almost never caused by a single failure. Modern cloud environments contain thousands of interacting components: Microservices, APIs, Databases, Service meshes, Kubernetes clusters, CI/CD pipelines, Infrastructure automation, Third-party dependencies A seemingly simple outage often emerges from a combination of factors. For example: Contributing FactorImpactIncomplete testingAllowed faulty configurationMissing safeguardsFailed to block deploymentWeak observabilityDelayed detectionDocumentation gapsSlowed troubleshootingComplex architectureIncreased blast radiusManual recovery processExtended outage duration No single factor caused the outage. Rather, the outage occurred because multiple layers of defense failed simultaneously. This is why mature organizations increasingly focus on contributing causes rather than searching for a single root cause. What Does "Blameless" Actually Mean? One of the most misunderstood concepts in incident management is the idea of a blameless postmortem. Some teams incorrectly assume that blameless means avoiding accountability. It does not. Blameless means recognizing that engineers make decisions based on the information available to them at a given moment. During an active incident: Information is incomplete.Time pressure is high.Monitoring signals may be conflicting.Customer impact is increasing.Stress levels are elevated. The objective of a postmortem is therefore not to judge whether an individual made a perfect decision. The objective is to understand: Why the decision seemed reasonable at the time.What information was available.What information was missing.What systemic conditions contributed to the outcome. When teams focus on learning instead of blame, they become far more willing to share details openly and honestly. Anatomy of an Effective Postmortem High-quality postmortems typically follow a structured approach. 1. Incident Summary Begin with a concise overview: What happened?When did it occur?How long did it last?Who was affected?What was the business impact? Example: "On March 12, Service X experienced elevated latency following a configuration deployment. Approximately 15% of customer requests failed for 42 minutes before service was fully restored." 2. Timeline Reconstruction The timeline is often the most valuable section of a postmortem. Document key events chronologically: TimeEvent09:00Deployment initiated09:05Error rate increased09:08Customer complaints received09:12Incident declared09:18Rollback initiated09:25Error rate returned to normal09:42Incident resolved A detailed timeline helps teams understand exactly how events unfolded. 3. Contributing Factors Analysis Rather than searching for a single root cause, identify all meaningful contributors. Examples include: Technical Contributors Configuration validation gapsCapacity limitationsMonitoring deficienciesDependency failuresArchitectural constraints Process Contributors Incomplete deployment reviewsMissing runbooksEscalation delaysLack of disaster recovery testing Organizational Contributors Knowledge silosStaffing limitationsUnclear ownership boundariesTraining gaps The goal is to build a complete picture of the incident. 4. Recovery Assessment Analyze the effectiveness of the response. Questions worth asking: Was detection timely?Were alerts actionable?Was ownership clear?Did responders have the necessary tools?Were runbooks useful?Could recovery have been automated? Many organizations discover that recovery challenges contribute more customer impact than the original failure itself. The Five Whys: Useful But Limited Many organizations use the "Five Whys" technique. Example: 1. Why did the outage occur? Because a configuration was invalid. 2. Why was it invalid? Because validation checks were incomplete. 3. Why were validation checks incomplete? Because a new deployment framework was introduced. 4. Why was the framework deployed without complete validation? Because release deadlines prioritized delivery. 5. Why were deadlines prioritized? Because organizational risk was underestimated. The Five Whys can uncover valuable insights. However, distributed systems are rarely linear. Multiple parallel factors often contribute simultaneously. Treat them as one investigative tool, not the entire analysis framework. Turning Findings Into Action A postmortem without action items is merely documentation. Every significant finding should produce a measurable improvement initiative. Examples include: FindingActionConfiguration errors reach productionAdd automated validationDetection delayed by 10 minutesImprove alert coverageRollback requires manual interventionImplement automated rollbackTroubleshooting knowledge unavailableCreate operational runbooksRecovery depends on expertsExpand team training Action items should be: specific, assigned, prioritized, and trackable. Without ownership, lessons learned quickly become lessons forgotten. Measuring Postmortem Effectiveness Many organizations measure success by counting completed postmortems. A more meaningful approach is measuring operational improvement. Consider tracking: Mean time to detect (MTTD)Mean time to recover (MTTR)Repeat incident frequencyAutomated recovery rateManual intervention reductionCustomer impact reduction The ultimate goal is not producing better reports. The goal is producing more resilient systems. The Future: AI-Assisted Incident Learning As incident management platforms evolve, AI is beginning to transform postmortem creation. Modern systems can automatically: Build incident timelinesCorrelate alertsSummarize communication channelsExtract remediation actionsIdentify recurring failure patternsGenerate draft postmortems This allows responders to spend less time gathering information and more time analyzing systemic weaknesses. However, AI should augment human investigation — not replace it. Understanding organizational context, operational tradeoffs, and architectural decisions still requires human expertise. Final Thoughts The most valuable outcome of an incident is not service restoration. It is learning. Organizations that focus solely on identifying who made a mistake often repeat the same failures. Organizations that focus on understanding how their systems allowed failures to occur continuously improve their resilience. Blameless postmortems shift the conversation from: "Who caused this incident?" to "What can we learn from this incident, and how can we make the system stronger?" That mindset is ultimately what transforms incident management from a reactive operational function into a strategic capability that improves reliability, resilience, and engineering excellence over time. More

From Pilot to Production: The Six Agent Patterns That Determine Whether Your AI Program Scales or Stalls

By BALAJI BARMAVAT

Refcard #291

Code Review Core Practices

By Vidyasagar (Sarath Chandra) Machupalli FBCS

CORE

Refcard #403

Shipping Production-Grade AI Agents

By Vidyasagar (Sarath Chandra) Machupalli FBCS

CORE

Multi-Agent Software Engineering: One Coding Agent Isn't Enough

Coding agents are good now. They can write a function, fix a failing test, or walk you through a chunk of legacy code you'd rather not read. That part is settled. The harder question is what happens when you hand one a real piece of delivery work, something that has to change the database and the API and the UI and the tests all together, and keeps running long after you've stepped away from your desk. That's usually where a single agent starts to struggle, and it isn't because the model isn't smart enough. The limit is human attention. A team might have fifty things sitting in its backlog that an agent could help with, but somebody still has to scope each one, keep an eye on it, review what comes back, and confirm it actually works. So you can generate code far faster than before and still ship at about the same pace. The slow part just moved. Long delivery work is a different animal from a quick coding task. It needs someone to hold the scope steady, keep the architecture consistent from one file to the next, make sure the tests check what the feature is meant to do rather than what the code happens to do, review the result, and hand off cleanly to whatever comes next. Ask one agent to carry all of that in a single context window across a long run, and it tends to drift. You've probably watched it happen: it loses the plot halfway through, writes tests that pass only because they were shaped around the code it just produced, uses one pattern here and a different one three files over, rebuilds something that already existed, and then can't quite tell you what it finished and what it didn't. So you read every diff yourself. The agent writes code, and you're still doing the planning, reviewing, QA, and firefighting. There's a limit to how far that stretches. From One Agent to a Team A more workable setup is to stop giving one agent the whole job and split it the way a functioning team already does. One agent plans the work, another builds it, another checks it. Three roles get you most of the way. RoleResponsibilityOrchestratorUnderstands the goal, asks the clarifying questions, writes the plan, sets milestones, and decides how the work is sequenced.WorkerImplements one feature from clean context and commits it in a controlled way.ValidatorChecks the implementation independently, runs the checks, verifies behavior, and flags follow-up work. Keeping the building and the checking in different hands matters for the same reason people review each other's code. Whoever wrote it is invested in it working, and that bias is hard to spot from the inside. A fresh agent that had no part in those decisions tends to catch what the author missed. How Agents Coordinate Underneath the roles, the agents end up talking to each other in a few recurring ways, and it helps to have names for them. Delegation is the obvious one, and usually the first that teams build. An agent hands a scoped task to another and waits for the result. Creator-verifier is the one that matters most for software. One agent writes the code and a separate one, working from its own context, checks it. That separation is what stops an agent from grading its own homework. Direct communication lets agents talk without a coordinator in the middle. It's tempting and it's fragile, since state scatters across separate conversations and sooner or later somebody acts on something out of date. Negotiation is what happens when agents share a resource, which for us usually means the codebase. Two agents about to edit the same file have to work out who does what before they overwrite each other. Broadcast is one agent telling the rest about something that changed, like a new constraint or a failure everyone needs to know about. It's the least exciting of the five, and the one that quietly keeps the long run from falling out of sync. Define "Done" Before Any Code Gets Written Settling what "correct" means before anyone writes code does more for reliability than any amount of prompt tuning. It heads off a specific and very common failure. An agent builds a feature, then writes tests that wrap neatly around the feature it just built. Everything passes, coverage looks healthy, and none of it tells you whether the feature does what was actually asked for. Tests written after the code mostly confirm whatever the code already does. They don't find the bugs. A validation contract flips that order. During planning, before there's any code, you write down what the feature has to do: the behavior that has to exist, the edge cases that matter, the flows that have to work, the regressions you can't allow. A small change might need a handful of those. A big feature can need hundreds, spread across the backend, the API, the front end, and the full end-to-end paths. Each one gets tied to a feature, and a feature isn't finished until it satisfies the ones assigned to it. The effect is that "done" gets defined separately from however the code happens to come out. Workers build against the contract, validators check against it, and you stop relying on whether the code looks right and start measuring whether it works. Passing Tests Aren't the Same as Working Software You still want lint, type checks, unit tests, and code review. The trouble is that once an agent is shipping whole features on its own, those checks stop being enough. Plenty of changes pass every unit test and are still broken where it counts. The form renders fine, but the submit button does nothing. The endpoint returns exactly the right shape, filled with stale data. A flow that worked in isolation falls apart once it sits behind a login. A migration runs clean on a laptop and chokes on production-scale data. So the better systems add a validator that works more like a QA engineer than a linter. It launches the app, clicks around, fills in forms, and confirms the whole path works end to end. That's slow, and on a long task it's where most of the wall-clock time goes: not generating tokens, but waiting on a live application to do something and watching what it does. The trade is worth it, since generating code quickly without really checking it only gets you to the wrong answer faster. In one production run an engineer at Factory described, building a clone of Slack, the project finished with about half its lines of code being tests, and roughly 90% coverage, and the validation step never passed on its first try. That last part is the whole reason the loop exists. Long Runs Can't Rely on Memory Run something for hours or days and context starts leaking between the agents. A bigger context window doesn't really fix it. What helps is not letting a worker close out a task by simply announcing it's done. Instead, each worker leaves a written handoff: what it built, which files it touched, which commands it ran and how they exited, what it assumed along the way, what it ran into, and what it left unfinished. That makes the run auditable. When validation fails, the orchestrator reads back through the handoffs, works out where things went sideways, scopes the fix, and pulls the run back on track at the next milestone instead of discovering the mess at the very end. The teams who make this work don't count on their agents remembering anything; they write enough down that the next agent can safely pick up where the last one stopped. Factory has reported runs lasting as long as sixteen days on this kind of setup. More Agents Isn't More Throughput The instinct is to run everything in parallel. Ten agents should mean ten times the work, right? For software, it usually doesn't play out that way. Agents running at the same time tend to edit the same files, redo work that's already done, and make architectural choices that don't line up with each other. The effort of untangling all that eats whatever speed you gained, and you pay for the conflict in tokens on top of it. What works better is to run the actual changes one at a time and save the parallelism for read-only work, like searching the codebase, reading docs, looking up an API, or reviewing code. On paper that's slower. Over a long task it comes out ahead, because you spend far less time cleaning up conflicts, the handoffs stay cleaner, and the whole thing behaves more predictably. Pile on more agents without coordinating them and you don't get speed so much as a codebase that disagrees with itself. The Right Model in Each Seat These systems also change how you pick models, because no single model is the right choice for every seat. Planning tends to go better with a model that reasons slowly and carefully. Writing code rewards speed and fluency instead. Checking the work rewards something closer to stubbornness: following the instructions exactly and giving nothing the benefit of the doubt. The model that writes the best code is often not the one you'd trust to grade it. There's even a case for running the validator on a different provider, so it doesn't carry the same blind spots as the model that wrote the code. That's the argument for staying model-agnostic. You want to put the right model in each role and swap it out as models get better at particular things, rather than getting stuck with one vendor's weakest area showing up everywhere. It works in the other direction too. A solid scaffolding of contracts, checkpoints, and independent validators can prop up a weaker or open-weight model and get more out of it than it would manage alone. Most of the orchestration in these systems lives in prompts and skills rather than hardcoded logic, which is the reason a new model release tends to make them better instead of obsolete. The Case for Fewer Agents Everything up to here makes the case for splitting work across agents, so it's only fair to take the strongest counterargument seriously. In 2025, the team behind Devin put out a post titled "Don't Build Multi-Agents," and the heart of it is hard to dismiss. They argue that most multi-agent failures come down to context getting fragmented. When you fan work out to parallel subagents, each one quietly makes its own assumptions, and those assumptions don't reconcile when the pieces come back together. One subagent picks a naming convention, another picks a different one, and you're left with something that reads as coherent but doesn't actually fit. Their advice is to keep one agent on a single thread and compress the context as it grows instead of spreading it across a crowd of workers. Anthropic landed somewhere close, though more conditional, when it wrote up its own multi-agent research system around the same time. Splitting work across agents paid off for broad, parallel tasks like searching many sources at once, but it struggled on anything that needed one shared context and tight coordination, which is most of what software work is. Both write-ups end up pointing at the same shape described here. Don't run agents in parallel on tightly coupled work. Split the work by role, and let the coupled parts happen in order. What the Failure Data Shows This isn't only field intuition, either. In 2025, a group at Berkeley published a study called "Why Do Multi-Agent LLM Systems Fail?" that went through failure traces from several well-known frameworks and grouped what went wrong. What stood out was where the failures landed. They mostly weren't about the model being too weak. They were about design, with agents given vague roles or ignoring the roles they had; about coordination, with one agent sitting on information another needed or a conversation getting reset partway through; and about verification, with work marked finished that nobody really checked, or a run quitting too early. Those are the same three places this whole architecture tries to shore up, with clear roles, written handoffs, and validators that don't simply take an agent at its word. There's also hard evidence that giving each worker fresh context is more than tidiness. The "lost in the middle" research found that models pay the most attention to the start and end of their context and the least to whatever sits in the middle. Later work on "context rot" found accuracy slipping as the input gets longer, even on simple lookups. A worker drowning in a long accumulated history is a real, measured liability, not a theoretical one, and handing each worker a clean slate keeps the model working in the range where it's actually reliable. The Bill Comes Due It's easy to underestimate what these systems cost. More agents running for longer means a lot more tokens. Anthropic reported that a single agent already burns through several times the tokens of an ordinary chat, and a multi-agent system can use roughly an order of magnitude more on top of that. That only pencils out on work that's worth the spend. Running a multi-agent system to fix a typo is just an expensive way to fix a typo. A couple of things keep it in check. One is prompt caching. A long run reads the same stable context over and over, the system prompt, the codebase, the plan, and caching that material so it isn't reprocessed every time cuts the bill sharply, which is why anyone running these in production leans on it. The other is the serial discipline from earlier: every conflict you don't create is a repair cycle you don't pay for, and repair cycles are where a lot of tokens quietly disappear. How much these systems cost is mostly a design question, not a billing one. A Bigger Attack Surface Security rarely shows up on the architecture diagram, and every agent you add is another door. Even a single agent has a well-known soft spot in prompt injection, where instructions tucked into a web page or a file or a tool's output get read as commands rather than data. Add more agents and the problem grows. A poisoned document that one worker reads can smuggle instructions through a handoff into another worker with more access, or one that touches production directly. The shared state and the messages agents pass around become a channel an attacker can aim at on purpose. This is the kind of thing you build in from the start, because it's painful to bolt on later. The same controls that keep these systems correct also keep them safer. Validators that won't take an agent's own word for it, handoffs that record exactly which commands ran and what came back, limits on what any single worker is allowed to reach, all of that doubles as containment, so one compromised step can't quietly become a compromised system. The audit trail that helps a run recover from its own mistakes is the same one you'll be glad to have when something goes wrong on purpose. Where This Leaves the Engineer None of this puts engineers out of work. It moves the work up a level. Instead of hand-driving every step of an implementation, you spend your time deciding what should get built, what the real constraints are, what counts as correct, which parts of the architecture are worth protecting, and when a human has to sign off. It feels more like running a delivery operation than like chatting with a bot. And the biggest gain usually isn't speed. It's keeping several streams of work moving at once without quality slipping, and often ending up with a codebase in better shape than when you started, since the tests and checks and handoffs all become part of what ships. The real skill is knowing when to reach for any of this. For a small, contained change, one good agent on a single thread is simpler and cheaper and less likely to wander off. For serious delivery at scale, you need the planning and checking and recovery that a team provides, and the only way agents can do that work is inside the same kind of structure a team uses: real roles, a shared definition of done agreed before anyone starts, honest handoffs, shared state, and execution kept under control rather than just turned up to full speed.

By Jithu Paulose

Apache Spark Query Optimization on Databricks: Catalyst, AQE, and Photon Engine

Why Query Optimization Matters A Spark query written by a human and a Spark query executed by the engine are often very different things. The gap between them — the optimization — is what separates a job that runs in 3 minutes from one that runs in 3 hours on identical hardware. Databricks compounds Spark's native Catalyst optimizer with two additional layers: Adaptive Query Execution (AQE) – re-optimizes the query at runtime using actual statistics collected mid-jobPhoton – a C++ vectorized execution engine that replaces the JVM-based Spark executor for eligible operators Understanding all three lets you write queries that cooperate with the engine rather than fight it. The Catalyst Optimizer Pipeline Catalyst is Spark's rule-based and cost-based query optimizer. Every query — whether written in SQL, DataFrame API, or Dataset API — passes through the same four-stage pipeline before a single byte of data is read. Stage 1: Parsing — From SQL to Unresolved Logical Plan Python # ── Catalyst Stage 1: Parsing ───────────────────────────────────────────────── # Spark uses ANTLR4 to parse SQL into an Abstract Syntax Tree (AST). # At this point column names are NOT validated — the plan is "unresolved". from pyspark.sql import SparkSession spark = SparkSession.builder.appName("catalyst-demo").getOrCreate() # Both of these produce identical internal representations df_api = ( spark.table("prod.silver.events_clean") .filter("event_type = 'purchase'") .groupBy("platform") .agg({"revenue": "sum"}) ) sql_api = spark.sql(""" SELECT platform, SUM(revenue) AS total_revenue FROM prod.silver.events_clean WHERE event_type = 'purchase' GROUP BY platform """) # Inspect the unresolved logical plan (before analysis) df_api.explain(mode="formatted") # Output includes: # == Parsed Logical Plan == # 'Aggregate ['platform], ['platform, unresolvedAlias('sum('revenue), None)] # +- 'Filter ('event_type = 'purchase) # +- 'UnresolvedRelation [prod, silver, events_clean] The key insight here: UnresolvedRelation and unresolvedAlias mean Spark hasn't touched the catalog yet. Column names could be typos at this point and Catalyst doesn't know. Stage 2: Analysis — Binding to the Catalog The Analyzer walks the unresolved AST and looks up every relation and attribute against the Catalog (in Databricks, this is Unity Catalog). It resolves column names, infers data types, validates references, and binds functions. Python # ── Catalyst Stage 2: Analysis ──────────────────────────────────────────────── # After analysis, every column is resolved to a specific attribute with a type. # AnalysisException is thrown HERE if a column doesn't exist. from pyspark.sql import functions as F from pyspark.sql.utils import AnalysisException # Example of what Analysis catches: try: spark.table("prod.silver.events_clean") \ .select("nonexistent_column") \ .show() except AnalysisException as e: print(f"Analysis failed: {e}") # → AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] # A column or function parameter with name `nonexistent_column` cannot be resolved. # After successful analysis, inspect the resolved plan df = ( spark.table("prod.silver.events_clean") .filter(F.col("event_type") == "purchase") .select("platform", "revenue", "user_id") ) # The analyzed plan shows fully qualified attribute IDs like: # == Analyzed Logical Plan == # platform: string, revenue: double, user_id: string # Project [platform#42, revenue#67, user_id#31] # +- Filter (event_type#39 = purchase) # +- Relation prod.silver.events_clean[...] parquet print(df._jdf.queryExecution().analyzed()) Stage 3: Logical Optimization — Rule-Based Rewrites This is where Catalyst applies its ~100+ built-in rules to produce an equivalent but cheaper logical plan. Rules fire repeatedly in fixed-point iteration until the plan stabilises. Python # ── Catalyst Stage 3: Key Optimization Rules ────────────────────────────────── # RULE 1: Predicate Pushdown # Catalyst moves filters as close to the data source as possible, # so Spark reads fewer rows from Parquet. df_before = ( spark.table("prod.silver.events_clean") .join( spark.table("prod.silver.users_clean"), on="user_id" ) .filter(F.col("event_type") == "purchase") # ← filter AFTER join ) # Catalyst rewrites this internally as if you wrote: df_after_equivalent = ( spark.table("prod.silver.events_clean") .filter(F.col("event_type") == "purchase") # ← filter BEFORE join .join( spark.table("prod.silver.users_clean"), on="user_id" ) ) # Result: potentially millions fewer rows shuffled during the join # RULE 2: Column Pruning # Catalyst removes columns not needed by downstream operators. # Even if you select(*), Spark will only read the columns it needs. df_pruned = ( spark.table("prod.silver.events_clean") .select("*") .filter(F.col("event_type") == "purchase") .groupBy("platform") .agg(F.sum("revenue").alias("total_revenue")) ) # Internally, Catalyst prunes all columns except: event_type, platform, revenue # RULE 3: Constant Folding # Expressions with only literals are evaluated at plan time, not per-row. df_constants = spark.range(1000).select( F.lit(2 + 3 * 4).alias("always_14"), # folded to Literal(14) at plan time F.col("id") * F.lit(1).alias("same_id"), # simplified to just col("id") ) # RULE 4: Boolean Simplification # AND/OR chains with tautologies or contradictions are collapsed df_simplified = spark.range(100).filter( (F.col("id") > 10) & F.lit(True) # simplified to just (col("id") > 10) ) # See all optimizations applied: print(df_pruned._jdf.queryExecution().optimizedPlan()) Stage 4: Physical Planning — Strategies and Cost Models The physical planner maps each logical operator to one or more physical implementations and selects the best one using a cost model. The most impactful decision here is join strategy selection. Python # ── Catalyst Stage 4: Physical Planning & Join Strategies ──────────────────── # JOIN STRATEGY 1: Broadcast Hash Join (BHJ) # Best when one side is small enough to fit in executor memory. # No shuffle — the small table is broadcast to all workers. spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "10mb") # default large_df = spark.table("prod.silver.events_clean") # 500GB small_df = spark.table("prod.gold.product_catalog") # 8MB ← will be broadcast result_bhj = large_df.join(small_df, on="product_id") # BHJ auto-selected # Force BHJ with a broadcast hint (overrides threshold check): from pyspark.sql.functions import broadcast result_forced = large_df.join(broadcast(small_df), on="product_id") # JOIN STRATEGY 2: Sort Merge Join (SMJ) # Default for large-large joins. Both sides are sorted and merged. # Requires a full shuffle — expensive but handles any size. spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1") # disable BHJ large_df2 = spark.table("prod.silver.transactions_clean") # 200GB result_smj = large_df.join(large_df2, on="user_id") # SMJ selected # JOIN STRATEGY 3: Shuffle Hash Join (SHJ) # Hash-based, no sort. Chosen by AQE when one side is much smaller # than the other but still above the broadcast threshold. spark.conf.set("spark.sql.join.preferSortMergeJoin", "false") # WHOLE-STAGE CODEGEN: Spark fuses multiple operators into a single # Java function to avoid virtual dispatch overhead and intermediate objects. # Verify it's active in your plan: spark.conf.set("spark.sql.codegen.wholeStage", "true") # default result_bhj.explain(mode="formatted") # Look for: *(1) BroadcastHashJoin — the *(N) prefix = WholeStageCodegen stage N Adaptive Query Execution (AQE) AQE is Databricks' most impactful runtime optimization layer. It materializes shuffle map output statistics at shuffle boundaries and uses them to make three key decisions after data has been partially processed. Python # ── AQE Configuration ───────────────────────────────────────────────────────── # AQE is ON by default in Databricks Runtime 7.3+ spark.conf.set("spark.sql.adaptive.enabled", "true") # 1. Dynamic Partition Coalescing # Merges small post-shuffle partitions to avoid thousands of tiny tasks spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true") spark.conf.set("spark.sql.adaptive.advisoryPartitionSizeInBytes", "128mb") spark.conf.set("spark.sql.adaptive.coalescePartitions.minPartitionNum", "1") # 2. Dynamic Join Strategy Switching # Allows AQE to downgrade SMJ → BHJ at runtime if a side turns out small spark.conf.set("spark.sql.adaptive.localShuffleReader.enabled", "true") # AQE broadcast threshold (can be higher than static threshold since # we now KNOW the actual size) spark.conf.set("spark.sql.adaptive.autoBroadcastJoinThreshold", "30mb") # 3. Skew Join Optimization # Splits oversized partitions and replicates the non-skewed side spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true") spark.conf.set("spark.sql.adaptive.skewJoin.skewedPartitionFactor", "5") # 5x median spark.conf.set("spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes", "256mb") # Verify AQE decisions in the query plan: df = ( spark.table("prod.silver.events_clean") .join(spark.table("prod.silver.users_clean"), on="user_id") .groupBy("platform") .agg(F.sum("revenue").alias("total")) ) df.explain(mode="formatted") # Look for: AdaptiveSparkPlan isFinalPlan=true # and: == Final Physical Plan == (shows post-AQE decisions) The Photon Engine Photon is Databricks' native vectorized query engine written in C++. It replaces the JVM-based Spark executor for eligible operations, processing data in column-oriented batches (vectors) rather than row-by-row. Python # ── Photon Configuration & Verification ─────────────────────────────────────── # Photon is available on Databricks Runtime 9.1+ with Photon-enabled clusters. # Enable it at the cluster level (UI: Cluster > Configuration > Enable Photon) # or via config: spark.conf.set("spark.databricks.photon.enabled", "true") # Photon-accelerated operators (as of DBR 13.x): # ✅ Scan (Parquet, Delta) ✅ Filter / Project # ✅ Hash Aggregate ✅ Sort # ✅ Broadcast Hash Join ✅ Sort Merge Join # ✅ Window functions ✅ Union / Expand # ✅ String functions ✅ Math functions # ❌ UDFs (Python/Scala) ❌ Some complex types # ❌ Streaming (partial) ❌ RDD-based operations # Verify Photon is executing your query: df = spark.sql(""" SELECT platform, DATE_TRUNC('month', event_ts) AS month, SUM(revenue) AS total_revenue, COUNT(DISTINCT user_id) AS unique_buyers, AVG(revenue) AS avg_order_value FROM prod.silver.events_clean WHERE event_type = 'purchase' AND event_ts >= '2024-01-01' GROUP BY platform, DATE_TRUNC('month', event_ts) ORDER BY month DESC, total_revenue DESC """) df.explain(mode="formatted") # Look for operators prefixed with "Photon" in the physical plan: # == Physical Plan == # PhotonResultStage # +- PhotonSort [month DESC NULLS LAST, total_revenue DESC NULLS LAST] # +- PhotonShuffleExchangeSink hashpartitioning(platform, month) # +- PhotonGroupingAgg [platform, month], [sum(revenue), count(user_id), avg(revenue)] # +- PhotonFilter (event_type = purchase AND event_ts >= 2024-01-01) # +- PhotonScan parquet prod.silver.events_clean # Photon performance metrics appear in Spark UI under "Photon Metrics": # - Photon scan time # - Photon total compute time # - Rows processed by Photon vs fallback JVM Reading Explain Plans The explain(mode="formatted") output is your primary debugging tool. Here's how to read it efficiently: Python # ── Explain Plan Modes ──────────────────────────────────────────────────────── df = ( spark.table("prod.silver.events_clean") .filter(F.col("event_type") == "purchase") .join(broadcast(spark.table("prod.gold.product_catalog")), on="product_id") .groupBy("platform", "category") .agg( F.sum("revenue").alias("total_revenue"), F.count("*").alias("transaction_count") ) ) # Mode 1: simple (default) — compact tree df.explain() # Mode 2: extended — all 4 plan stages side by side df.explain(mode="extended") # Mode 3: formatted — human-readable with operator details (RECOMMENDED) df.explain(mode="formatted") # Mode 4: cost — includes estimated row counts and sizes (requires ANALYZE TABLE) df.explain(mode="cost") # Mode 5: codegen — shows generated Java code for WholeStageCodegen df.explain(mode="codegen") # ── Key Signals to Look For ─────────────────────────────────────────────────── # ✅ GOOD signs: # *(N) prefix → WholeStageCodegen active (operators fused) # BroadcastHashJoin → small table correctly broadcast, no shuffle # PhotonXxx → Photon accelerating this operator # AdaptiveSparkPlan → AQE is engaged # PartitionFilters → Delta/Parquet file skipping active # PushedFilters → filters pushed to Parquet reader # ❌ WARNING signs: # Exchange (shuffle) → unexpected shuffle (missing broadcast hint?) # SortMergeJoin → large-large join (may need Z-ORDER or AQE tuning) # HashAggregate x2 → partial + final agg = shuffle involved # CartesianProduct → missing join condition! Will OOM on large tables # ObjectHashAggregate → non-codegen path, JVM overhead # GenerateXxx → explode() or similar, can't be fused # ── ANALYZE TABLE: feed statistics to CBO ───────────────────────────────────── # Without stats, Catalyst uses default estimates (1M rows, 8 bytes/col). # Run ANALYZE to give the Cost-Based Optimizer real numbers. spark.sql("ANALYZE TABLE prod.silver.events_clean COMPUTE STATISTICS") spark.sql(""" ANALYZE TABLE prod.silver.events_clean COMPUTE STATISTICS FOR COLUMNS user_id, event_type, platform, revenue """) # Now explain(mode="cost") shows real row counts and sizes Tuning Reference Table A quick-reference guide for the most impactful Spark/Databricks configs, what they control, and when to change them: Config KeyDefaultWhat It ControlsWhen to Tunespark.sql.adaptive.enabledtrueMaster AQE switchKeep on; only disable for debuggingspark.sql.adaptive.advisoryPartitionSizeInBytes64mbTarget post-coalesce partition sizeIncrease to 128mb–256mb for large shufflesspark.sql.adaptive.skewJoin.enabledtrueAQE skew splitKeep on; tune skewedPartitionFactor if neededspark.sql.autoBroadcastJoinThreshold10mbStatic BHJ thresholdIncrease to 50mb–100mb if executor memory allowsspark.sql.adaptive.autoBroadcastJoinThreshold30mbAQE runtime BHJ thresholdIncrease if AQE isn't catching small tablesspark.sql.shuffle.partitions200Default shuffle partition countSet to 8 × num_cores for your clusterspark.sql.files.maxPartitionBytes128mbMax bytes per Parquet read partitionReduce for high-parallelism scansspark.databricks.photon.enabledtruePhoton vectorized engineKeep on; disable only for UDF-heavy jobsspark.sql.codegen.wholeStagetrueWhole-Stage CodeGen fusionKeep on; disable only for debuggingspark.sql.statistics.histogram.enabledfalseColumn histograms for CBOEnable after running ANALYZE TABLEspark.sql.cbo.enabledtrueCost-Based OptimizerKeep on; requires ANALYZE TABLE to be usefulspark.databricks.delta.optimizeWrite.enabledtrueAuto bin-pack write filesKeep on for all Delta writes Key Takeaways Catalyst has four stages: Parse → Analyze → Optimize → Plan. Each stage has a distinct job, and understanding them tells you exactly where to look when a query misbehaves.Predicate pushdown and column pruning are the two most impactful automatic optimizations — they reduce the data volume Spark has to move before any aggregation or join.AQE is not a set-and-forget feature: tune advisoryPartitionSizeInBytes to your actual data sizes, and verify its decisions with explain(mode="formatted") — look for AdaptiveSparkPlan isFinalPlan=true.Photon drops in transparently for most SQL and DataFrame operations. The exceptions are Python UDFs, RDD operations, and some complex types — refactor these away from hot paths.Run ANALYZE TABLE ... COMPUTE STATISTICS FOR COLUMNS on your most-joined tables. The CBO's join ordering and strategy decisions improve dramatically with real statistics vs. default estimates.explain(mode="formatted") is your most important debugging tool — learn to read it before reaching for cluster config changes. References Apache Spark — Catalyst Optimizer (Deep Dive Paper, Armbrust et al., SIGMOD 2015)Databricks — Adaptive Query ExecutionApache Spark Docs — Adaptive Query ExecutionDatabricks — Photon RuntimeDatabricks Blog — Photon: A Fast Query Engine for Lakehouse SystemsDatabricks — Cost-Based OptimizerApache Spark — Performance Tuning GuideDatabricks — Broadcast Join Hints"Photon: A Fast Query Engine for Lakehouse Systems" (Behm et al., SIGMOD 2022)Spark by Examples — Explain Plan Modes

By Jubin Abhishek Soni

CORE

The Inter-Agent Protocol Problem

Every major agent framework now has a story for multi-agent systems. Most of them are incompatible with each other. An agent built in AutoGen cannot natively receive a task from a deepagents orchestrator. An OpenAI Agents SDK cannot talk to a LangGraph subgraph. A CrewAI crew cannot delegate to a Pydantic AI team without custom glue code. This is the inter-agent protocol problem: We have multi-agent frameworks, but no agreed-upon protocol for agents to communicate across frameworks. This post breaks down the four main approaches, compares them, and examines the need for standardization. Why Is Inter-Agent Communication Hard? Single-agent systems have a clean interface: you send a prompt, you get a response. Multi-agent systems need to express: Task delegation – what work is being handed off and to whomContext transfer – what is the background the receiving agent needsState propagation – what the delegating agent needs backError and cancellation – what happens when the receiving agent failsStreaming – how partial results flow back during long tasks Most frameworks solve some of these, but with different schemas, different transport assumptions, and no interoperability. The 4 Current Approaches 1. ACP (Agent Communication Protocol): deepagents Installation: Shell pip install deepagents-acp # Spec: https://github.com/langchain-ai/deepagents/tree/main/libs/acp # deepagents-acp version: 0.0.6 (requires agent-client-protocol>=0.8.0) ACP is an open protocol: a published JSON schema for agent-to-agent communication. Langchain's deepagents ships deepagents-acp, a Python client and server implementation, as a separate package, so any framework can implement it. ACP Message Flow A deepagents orchestrator routing a task to an ACP-compatible worker: Python from deepagents import create_deep_agent from deepagents.middleware import AsyncSubAgentMiddleware, AsyncSubAgent # Declare an ACP-compatible worker agent research_worker = AsyncSubAgent( name="research", description="Deep research agent — use for tasks requiring web search and synthesis", url="http://research-agent:8080", # ACP server endpoint ) # The orchestrator routes tasks to workers via the ACP protocol automatically orchestrator = create_deep_agent( model="anthropic:claude-sonnet-4-6", middleware=[ AsyncSubAgentMiddleware(subagents=[research_worker]), ], ) Serving a deepagents agent as an ACP endpoint: Python import asyncio from acp import run_agent as run_acp_agent from deepagents import create_deep_agent from deepagents_acp.server import AgentServerACP agent = create_deep_agent(model="anthropic:claude-haiku-4-5") # Wrap the compiled graph and expose it over ACP acp_agent = AgentServerACP(agent=agent) asyncio.run(run_acp_agent(acp_agent)) What makes ACP different: It's an open schema, not a framework-internal call. Any framework can implement an ACP server or client, which means a CrewAI crew could delegate to a deepagents worker over ACP without any shared code. Current limitation: Adoption is early. As of v0.9.0, the primary implementations are deepagents-native. Support for other frameworks requires each to implement the server interface independently. 2. Handoffs: OpenAI Agents SDK Installation: Shell pip install openai-agents # Docs: https://openai.github.io/openai-agents-python/handoffs/ OpenAI's SDK handles delegation through handoffs: an agent declares which other agents it can hand off to. At runtime, the orchestrator agent decides when to delegate and execution transfers. Python from agents import Agent, Runner, handoff research_agent = Agent( name="ResearchAgent", instructions="You are an expert at web research. Answer research questions thoroughly.", tools=[web_search_tool], ) code_agent = Agent( name="CodeAgent", instructions="You write and review Python code.", tools=[run_code_tool], ) orchestrator = Agent( name="Orchestrator", instructions=( "Route tasks to the right specialist. " "Use ResearchAgent for questions requiring web search. " "Use CodeAgent for programming tasks." ), handoffs=[ handoff(research_agent), handoff(code_agent, tool_name_override="delegate_to_coder"), ], ) result = await Runner.run(orchestrator, "Write a Python script to fetch weather data") You can also add context and filters to handoffs: Python from agents import handoff, RunContextWrapper def on_handoff_to_research(ctx: RunContextWrapper, input_data: str) -> None: print(f"Handing off to research agent with: {input_data}") research_handoff = handoff( research_agent, on_handoff=on_handoff_to_research, input_filter=lambda inp: inp, # transform input before handoff ) Limitation: Handoffs are framework-internal. A handoff target must be an Agent instance from the same SDK. There's no published schema, no HTTP transport, no cross-framework interoperability. If your orchestrator is a LangGraph graph and your specialist is an OpenAI agent, handoffs can't bridge them. 3. Agent Delegation via Tools: Pydantic AI Installation: Shell pip install pydantic-ai # Docs: https://pydantic.dev/docs/ai/guides/multi-agent-applications/ Pydantic AI does not have a dedicated inter-agent messaging primitive. The idiomatic multi-agent pattern is agent delegation via tools: a specialist agent is called from within a tool of the orchestrator agent, and the result is returned to the orchestrator like any other tool return value. Python from pydantic_ai import Agent, RunContext research_agent = Agent( "anthropic:claude-haiku-4-5", system_prompt="You are a research specialist. Answer questions with citations.", ) code_agent = Agent( "anthropic:claude-sonnet-4-6", system_prompt="You are a Python expert. Write clean, tested code.", ) orchestrator = Agent( "anthropic:claude-sonnet-4-6", system_prompt="Route tasks to the right specialist tool.", ) @orchestrator.tool async def research(ctx: RunContext[None], query: str) -> str: """Delegate a research question to the research specialist.""" result = await research_agent.run(query, usage=ctx.usage) return result.output @orchestrator.tool async def write_code(ctx: RunContext[None], task: str) -> str: """Delegate a coding task to the code specialist.""" result = await code_agent.run(task, usage=ctx.usage) return result.output result = await orchestrator.run( "Research quantum computing and write a Python simulation" ) Passing usage=ctx.usage propagates token accounting from the delegate run back to the parent, so result.usage() that the orchestrator covers all sub-agent calls. Limitation: Delegation is framework-internal: all agents must be Pydantic AI Agent instances. There is no pub-sub bus, no HTTP transport, and no cross-framework protocol. For true cross-framework delegation, a pattern like ACP is required. 4. AgentProtocol Backbone: Agno Installation: Shell pip install agno # Docs: https://docs.agno.com/introduction Agno takes the most ambitious approach: an AgentProtocol backbone that acts as a multi-framework adapter. The goal is to let agents from different frameworks (LangGraph, DSPy, Claude's SDK) plug into the same Agno team: Python from agno.agent import Agent from agno.team import Team from agno.models.anthropic import Claude # Native Agno agent research_agent = Agent( name="Researcher", model=Claude(id="claude-haiku-4-5"), tools=[web_search_tool], description="Specializes in web research and synthesis.", ) # Native Agno agent code_agent = Agent( name="Coder", model=Claude(id="claude-sonnet-4-6"), tools=[python_repl_tool], description="Specializes in Python development.", ) team = Team( name="FullStackTeam", mode="coordinate", members=[research_agent, code_agent], model=Claude(id="claude-sonnet-4-6"), db=agno_db_storage, # sessions + memory persistence enable_agentic_state=True, ) team.print_response( "Research quantum computing trends and prototype a simulation in Python" ) Limitation: The multi-framework adapter is still maturing. Most production Agno deployments use native Agno agents. The cross-framework vision is real, but the published adapter surface for LangGraph and DSPy is sparse at the time of writing. Side-by-Side Comparison DimensionACP (deepagents)Handoffs (OpenAI)Agent delegation (Pydantic AI)AgentProtocol (Agno)Published open schema✓✗✗PartialHTTP transport✓✗✗✓Streaming support✓ (SSE)✓ (run-level)✗✓Cross-framework workers✓ (any ACP server)✗✗PartialContext/metadata passing✓✓ (input_filter)✓ (usage propagation)✓Error/cancellation schema✓Partial✗✓Built-in state persistence✗ (LangGraph handles it)✗✗✓ (Agno DB)Production deploymentsEarlyGrowingMatureGrowing The Fragmentation Cost in Practice Consider this real scenario: you have a research pipeline where: The orchestrator is a deepagents agent (LangGraph-backed)The research worker is a CrewAI crew (good at parallel research tasks)The code worker is an OpenAI agent (good at code + sandboxed execution) Today, wiring this up requires: Python # Option 1: Wrap every non-deepagents agent as a plain function tool # Loses: streaming, cancellation, structured error handling @tool def run_crewai_research(query: str) -> str: crew = ResearchCrew() result = crew.kickoff(inputs={"query": query}) return str(result) # no streaming, no structured output # Option 2: Host each agent as an HTTP service and call it manually # Loses: shared context, standard error handling, progress tracking import httpx @tool async def call_openai_agent(task: str) -> str: async with httpx.AsyncClient() as client: response = await client.post( "http://openai-agent-service/run", json={"task": task}, ) return response.json()["result"] With ACP, the same cross-framework delegation looks like this: Python from deepagents.middleware import AsyncSubAgentMiddleware, AsyncSubAgent # Any ACP-compliant server — regardless of what framework runs inside orchestrator = create_deep_agent( model="anthropic:claude-sonnet-4-6", middleware=[ AsyncSubAgentMiddleware(subagents=[ AsyncSubAgent(name="research", url="http://crewai-acp-server:8080"), AsyncSubAgent(name="coder", url="http://openai-acp-server:8081"), ]), ], ) The wrapping frameworks are invisible. The protocol standardizes streaming, cancellation, and structured error handling. Where Is This Heading? The inter-agent space is actively consolidating around a few patterns: ACP is the most explicit attempt at standardization. It bets that an open schema survives framework churn. You can swap the worker implementation without changing the orchestrator.Handoffs are winning on simplicity within the OpenAI ecosystem. For teams already on the OpenAI SDK, they're ergonomic and production-proven. The cross-framework limitation only matters if you leave the ecosystem.The agent-delegation model (Pydantic AI via agent-as-tool, AutoGen's event-driven redesign) is a better fit for peer networks where no single orchestrator coordinates everything.Framework-native protocols will likely remain dominant for the near term. Cross-framework standardization requires enough pain from fragmentation to motivate all players -> we're getting there, but not quite yet. If you're building a multi-agent system today and expect to stay within one framework, pick that framework's native protocol. If you're building infrastructure that needs to orchestrate agents across frameworks or you expect your team to evaluate multiple frameworks, investing in ACP-compatible interfaces from the start gives you the most flexibility.

By Ninaad Rao

Why Push-Based Systems Fail at Scale — and How Hybrid Fan-Out Fixes It

Real-time systems look simple on architecture diagrams. A user posts content, the backend publishes an event, and connected users instantly receive notifications through persistent WebSocket connections. At small scale, the model works beautifully. At large scale, it becomes one of the fastest ways to melt distributed infrastructure. Most push-based architectures fail for one reason: they assume traffic is evenly distributed. Production traffic never is. One user may have 50 followers. Another may have 10 million. Designing both scenarios using the same fan-out strategy creates massive operational problems during peak traffic. That is why large-scale platforms evolved from naive push delivery into hybrid push/pull systems optimized around uneven load distribution. The Naive Push Architecture The first design most engineers create is straightforward: A user publishes a postThe backend sends the event to a brokerWebSocket servers receive the eventNotifications are pushed to all connected followers On paper, the architecture looks clean. The system appears scalable because: WebSockets provide real-time deliveryBrokers decouple servicesHorizontal scaling seems possible But hidden underneath the simplicity is a dangerous scaling assumption: every user generates similar traffic patterns. That assumption collapses the moment a celebrity account posts. The Celebrity Fan-Out Problem Imagine a user with 10 million followers posting a new update. The system now attempts to: Generate millions of delivery events,Route them through brokers,Maintain millions of active socket writes,Deliver updates almost simultaneously. The bottleneck is no longer application logic. The bottleneck becomes: Broker throughputConnection managementQueue depthNetwork bandwidthRetry amplification This is where many real-time systems fail in production. As delivery pressure increases: Queues begin backing upConsumers lag behindWebSocket nodes become saturatedLatency grows from milliseconds into seconds or minutes Then retries begin. Clients retry because acknowledgments are delayed. Servers retry because deliveries fail. Load balancers redistribute unstable traffic. The system begins amplifying the overload condition itself. This behavior is common in distributed systems: Reliability mechanisms designed to recover from failure end up accelerating collapse under overload. The architecture appears stable during normal traffic. It fails at the exact moment traffic matters most. Why Pure Push Architectures Break The real issue is fan-out-on-write. Every post immediately creates work proportional to follower count. For small accounts, this is inexpensive. For celebrity-scale accounts, a single write operation generates massive downstream pressure: Enormous queue pressureHigh-volume socket deliveryEnormous broker traffic The system becomes optimized around worst-case fan-out instead of average workload. That is operationally expensive and difficult to stabilize. This is why most large-scale feed systems avoid pure push delivery for all users. The Hybrid Push/Pull Model Modern systems solve the problem differently. Instead of treating every account identically, they dynamically switch between: Push-on-writePull-on-read The decision is usually based on follower thresholds. Push-on-Write for Small Accounts For smaller accounts: Updates are immediately pushed,Queue workers fan out notifications,Followers receive low-latency real-time updates. This keeps the user experience fast while infrastructure costs remain manageable. Pull-on-Read for Large Accounts For celebrity-scale accounts: Posts are stored normallyFan-out is avoidedFeeds are assembled when users open the app Instead of generating millions of writes immediately, the workload shifts to read time. This dramatically reduces broker pressure and prevents large fan-out storms from destabilizing the platform. Twitter/X publicly discussed similar strategies years ago because global push fan-out becomes prohibitively expensive at scale. The important engineering insight is: Push and pull are not competing architectures. They are complementary scaling strategies selected dynamically based on traffic patterns. Feed Assembly Introduces New Complexity Once systems adopt pull-on-read, another problem appears: feed assembly. Now the platform must dynamically build personalized feeds using: Follower relationshipsRanking algorithmsMuted usersBlocked accountsRecent activityRecommendation signals This shifts complexity from writes to reads. To reduce repeated database work, systems commonly introduce: Redis timeline cachesMaterialized feed viewsAsynchronous feed buildersHot-feed caching layers The challenge becomes balancing: FreshnessLatencyConsistencyInfrastructure costCache invalidation The architecture is no longer just “real-time delivery.” It becomes distributed workload management. WebSockets Make Infrastructure Stateful Many system design discussions stop once WebSockets are introduced. Production systems become significantly harder after that point. WebSockets create stateful infrastructure. Now the platform must know: Which user is connectedWhich server owns the connectionHow to recover missed events after reconnects This changes routing behavior completely. Requests can no longer be routed blindly across stateless servers. Most systems introduce: Sticky sessions,Session affinity,Distributed connection registries,Redis pub/sub coordination. Then mobile networks create another challenge: temporary disconnects. A user loses connectivity for three seconds. What happened during that gap? Without replay recovery, notifications disappear permanently. Replay Buffers and Recovery Logic Reliable real-time systems usually implement: Sequence IDsReplay buffersReconnect checkpointsGap recovery logic When the client reconnects: It sends the last processed sequence IDThe server identifies missing eventsReplay buffers resend missed messagesLive streaming resumes This is where systems move beyond interview-level architecture. The challenge is no longer simply delivering events. The challenge is maintaining continuity during instability. Real-world distributed systems spend enormous engineering effort handling: Partial failuresReconnect stormsDuplicate deliveryInconsistent network conditions Operational Tradeoffs Teams Often Underestimate One of the biggest mistakes in real-time architectures is optimizing only for delivery speed while ignoring operational cost. Push-heavy systems keep large numbers of persistent connections open simultaneously. At global scale, this introduces pressure across multiple infrastructure layers: Connection memory usageBroker throughputNetwork egressHeartbeat trafficReconnect storms during outages Even healthy systems can become unstable during regional network disruptions. For example, if thousands of mobile clients reconnect at the same time after a temporary outage, WebSocket gateways may suddenly experience authentication spikes, replay requests, and connection churn simultaneously. This often creates secondary overload events long after the original incident is resolved. This is why mature systems introduce additional controls such as: Connection rate limitingReplay window expirationBackpressure handlingCircuit breakersAdaptive retry strategies Another overlooked problem is message ordering. In distributed fan-out systems, messages may arrive out of order because events are processed asynchronously across multiple workers or partitions. Without sequence tracking, users may briefly see inconsistent timelines or duplicate notifications. Production-grade systems therefore prioritize the following instead of assuming perfect real-time synchronization: Idempotent delivery,Sequence-aware replay,Eventual consistency handling The engineering challenge is not simply pushing events quickly. The challenge is maintaining stability while millions of users interact with the platform under unpredictable traffic conditions. Final Thoughts Most distributed systems look elegant until traffic becomes uneven. That is the hidden reality behind large-scale architecture. The difficult part is not handling average load. The difficult part is surviving pathological load without collapsing the platform. Real systems evolve through operational pain: Broker saturationRetry stormsReplay failuresQueue buildupCascading latency amplification The best architectures are rarely the simplest ones. They are the ones that continue functioning when the system is under maximum stress. In distributed systems, every design is ultimately a negotiation between: LatencyThroughputDurabilityAvailabilityCost Those forces shape every scalable platform on the internet. The systems that survive at scale are not the ones with the cleanest diagrams. They are the ones designed to absorb failure without collapsing under pressure. References Apache Kafka DocumentationRedis Pub/Sub DocumentationWebSocket Protocol RFC 6455Twitter Scalability Architecture DiscussionDesigning Data-Intensive Applications by Martin KleppmannGoogle SRE Book — Handling Overload

By Jayapragash Dakshnamurthy

One Stolen Key, One Stolen Token: Why Machine Identity Is Cloud-Native's Quietest Crisis — and the Only Fix That Actually Holds

On December 2, 2024, a security vendor called BeyondTrust noticed something wrong inside its own AWS account. By the time the investigation closed, the story that emerged was almost absurdly simple for something with this much fallout: an attacker — later attributed to the Chinese state-sponsored group Silk Typhoon — had used a software flaw to reach into a BeyondTrust cloud account and pull out an API key. Not a password. Not a phishing victim's login. A string of characters that a piece of software used to talk to another piece of software. With that one key, the attacker walked straight into the U.S. Department of the Treasury, reset internal passwords, accessed workstations inside the Office of Foreign Assets Control, and read unclassified documents before anyone noticed. The Treasury disclosed it to Congress on December 30. The Department of Justice indicted the alleged operators in March 2025. If you've never worked in security, here's the plain-English version of what happened: somewhere inside the machinery that runs modern software, there's almost always a "key" — a credential one computer program shows another to prove it's allowed to be there. Humans log in with passwords and, increasingly, a second factor on their phone. Software mostly doesn't. It just holds a key, often for months or years at a time, and whoever holds that key gets treated as trustworthy, no questions asked. The Treasury breach happened because one of those keys ended up in the wrong hands and nothing else stood between that key and a federal agency's internal documents. Two months later, a different flavor of the same problem produced the largest theft of digital assets in history. $1.5 Billion, One Developer's Laptop In February 2025, the cryptocurrency exchange Bybit lost approximately $1.5 billion in Ethereum in a single operation. Palo Alto Networks' Unit 42 threat research team later tied the attack to Slow Pisces, a North Korean state-linked group also known as Lazarus or TraderTraitor, and traced the entry point back to a developer at a third-party vendor that managed Bybit's multi-signature wallet infrastructure. The attackers didn't break Ethereum's cryptography. They stole that developer's AWS session tokens — another form of machine credential — and used them to gain administrative access to cloud infrastructure that could authorize transactions, then quietly altered what a routine-looking transaction actually did before it executed. Unit 42 then found the same pattern at a second cryptocurrency exchange later in 2025, this time running through Kubernetes, the orchestration system that now runs much of the cloud-native world. The attackers phished a developer, used the access on the developer's machine to drop a malicious workload directly into the exchange's production Kubernetes cluster, and had that workload expose its own service account token — a credential Kubernetes automatically hands to every running pod so it can talk to the cluster's control plane. The stolen token happened to belong to a CI/CD management identity with sweeping permissions. From there, the intruders queried secrets across namespaces, planted a backdoor, and pivoted into the exchange's cloud-hosted backend, reaching the financial systems behind it. Unit 42's broader research found suspicious activity consistent with service-account-token theft in 22 percent of cloud environments analyzed in 2025, and recorded a 282 percent year-over-year jump in Kubernetes-directed attacks overall. Different industries, different attackers, same root cause: a non-human credential that was both long-lived and broader in scope than the task in front of it ever needed. Why This Keeps Happening Identity and access management, as a discipline, was built for people. People have managers, onboarding dates, performance reviews, and an HR system that flags them the day they leave. A workload has none of that. A microservice can spin up, do its job, and disappear thousands of times a day; a service account, by contrast, often gets created once and never revisited again. CyberArk's research has been blunt about the resulting imbalance: machine identities now outnumber human ones by more than 80 to 1 in the average enterprise, and the security architecture protecting most of them still assumes the old, human-shaped world — an org chart, not a fleet of ephemeral containers. That mismatch is exactly why static secrets sprawl the way they do. A developer hardcodes a key during a deadline crunch, intending to externalize it "later." A Terraform state file ends up holding plaintext cloud credentials because nobody flagged it in review. A default Kubernetes service account token, more permissive than anyone realized, gets mounted into a pod by default because turning that off requires deliberate configuration most teams never get around to. None of these are exotic mistakes. They're the ordinary residue of moving fast, and they accumulate the way unpaid debt does — quietly, until the day someone calls it in. The structural fix has a name by now, even if adoption is uneven: frameworks like SPIFFE and its production runtime SPIRE replace the static key with a short-lived, cryptographically attested identity — something closer to a backstage pass that's reissued before every single show rather than a master key cut once and handed out forever. A workload proves what it actually is — which Kubernetes service account launched it, which container image it's running — and receives an identity document valid for minutes, not months. Steal that, and an attacker is racing a clock that resets automatically rather than one that only resets when a human notices something is wrong. Cloud providers offer narrower versions of the same idea for their own platforms — AWS's IAM Roles for Service Accounts, Google's Workload Identity Federation — letting a workload trade a short-lived token for cloud access instead of carrying a standing key in the first place. But identity alone doesn't close the loop, and this is the part most "zero trust" conversations skip past. None of it matters if nothing in your pipeline actually enforces it. Security By Design Is a Promise. CI/CD Is Where You Find Out If It's Kept. Plenty of organizations will tell you, with complete sincerity, that they practice "security by design." Most of them mean it stopped at an architecture review months before the first line of code shipped. That's not a fix, it's a memory of one. Code that deploys daily — sometimes hourly — doesn't wait for an annual audit to catch a misconfigured token or an over-privileged service account, and by the time a quarterly review would have caught the BeyondTrust-style key or the Bybit-style session token, the damage in both real cases was already done. The only version of "security by design" that survives contact with a real production pipeline is the one written as code and enforced automatically, at every stage, by something that can actually say no. Picture the pipeline this way: Plain Text Developer commits code | v CI build triggers | +--> SAST (code flaws) + SCA (dependency CVEs) + secrets scan | | | fail? -----> build blocked, developer notified | | | pass v Generate SBOM + sign artifact (Cosign) + build provenance (SLSA) | v Policy-as-code gate (OPA / Kyverno) | +--> checks: image from approved registry? running as non-root? | signature valid? provenance matches expected builder? | service account scoped to least privilege? | | fail? -----> deployment rejected, logged, alert raised | pass v Deploy to production | v Runtime monitoring + short-lived workload identity (SPIFFE/SPIRE, IRSA) | v Continuous re-verification — nothing trusted indefinitely Every box in that chain is a place where the Treasury breach or the Bybit breach could have stopped instead of escalating. A policy-as-code rule using Open Policy Agent's Rego language, or Kyverno's Kubernetes-native YAML equivalent, can flatly refuse to schedule a pod requesting broader RBAC permissions than its declared task needs — which would have directly undercut the over-privileged CI/CD identity that the crypto-exchange attackers rode into the cluster. A signing and attestation step using Cosign, tied to SLSA provenance, means a deployed artifact has to prove which build system actually produced it before it runs at all — closing exactly the kind of trust gap that let a single compromised AWS asset cascade into a stolen infrastructure API key at BeyondTrust. None of this is theoretical tooling. Red Hat's own Enterprise Contract documentation describes signing as tying an image to a specific builder identity precisely so an attacker can't substitute a malicious binary without the signature itself breaking and announcing the tampering. The Uncomfortable Bottom Line I don't think either of this year's headline breaches happened because anyone involved was careless in some obvious, fireable way. They happened because the credential — not the firewall, not the encryption, not the cleverness of the malware — was the actual asset under attack the entire time, and almost nothing downstream of "the key worked" was built to ask a second question. Gartner named non-human identity management a top strategic security trend for exactly this reason in 2025, and OWASP followed with a dedicated Non-Human Identity Top 10 the same year, an overdue acknowledgment that the tooling built for human logins was never going to be enough. My honest prediction, watching this pattern repeat across a federal agency and two of the largest crypto exchanges on earth within twelve months of each other: the organizations that treat policy-as-code enforcement and short-lived machine identity as default infrastructure — not optional hardening bolted on after an incident — are the ones that won't end up writing the next version of this story. Everyone else is currently running on borrowed time, secured by a key that, statistically, is already older than it should be.

By Igboanugo David Ugochukwu

CORE

Why AI-Generated Code Is Making Regression Testing More Important, Not Less

There is a widespread assumption circulating in engineering teams right now that goes something like this: if AI can write code faster, it probably makes testing less of a bottleneck too. The logic seems reasonable on the surface. Faster code, faster tests, faster everything. This assumption is wrong, and teams that act on it are going to find out the hard way. AI-generated code does not reduce the need for regression testing. It amplifies it. And the teams that understand this early will have a significant quality advantage over those that do not. The Fundamental Misunderstanding When developers use AI coding assistants to generate functions, services, or entire modules, they are not producing code that has been verified against the real behavior of their system. They are producing code that is syntactically correct and structurally plausible, written by a model that has no knowledge of how their specific application actually runs in production. This is a critically important distinction. A human developer who has worked on a codebase for months carries implicit knowledge about which edge cases matter, which downstream services are flaky, and which data patterns appear in production that were never anticipated in the original requirements. An AI model has none of this context. It produces code that looks right and often is right for the happy path, but it has no way of knowing what the code needs to handle in the real world. The result is a class of defects that regression testing is uniquely positioned to catch: behaviors that work in isolation but break in the context of the full system. The Velocity Trap Here is where teams get into trouble. AI coding tools are genuinely fast. Developers using them can produce working code at a rate that was not possible before, and the productivity gains are real. But velocity without verification is just a faster path to production failures. The pattern plays out predictably. A team adopts AI coding assistance, development speed increases, the engineering leadership is happy, and everyone agrees to keep moving fast. What nobody adjusts is the regression testing strategy. The test suite that was sized for the previous pace of development is now covering a larger surface area of code, generated at higher volume, by a process that has no awareness of production context. Coverage gaps compound quietly. Nobody sees them until something breaks in production in a way that takes two days to trace back to a function that an AI wrote last sprint and nobody fully read. What AI-Generated Code Actually Gets Wrong The failures that emerge from inadequate regression coverage of AI-generated code tend to cluster in specific areas. Integration points are the most common failure zone. AI generates code based on interfaces and contracts. It looks at API signatures, function definitions, and data schemas. What it cannot see is how those contracts actually behave when real traffic flows through them. Consider a realistic scenario: an AI-generated service calls a downstream payment processor using the documented API specification. The code is technically correct. But the payment processor returns a slightly different response shape when a transaction is declined due to insufficient funds versus when it is declined due to a card expiry. The specification documents neither distinction. The AI has no way to know they exist. A regression suite built from real production traffic would catch this within the first test run. A regression suite built from the same specification the AI used to write the code will not catch it until a customer sees a wrong error message in production. Mock drift compounds the problem. When tests for AI-generated code are written using mocked dependencies, those mocks represent what the developer or AI thought the dependency would do. Over time, the real dependency changes and the mocks do not. Tests keep passing, the real behavior keeps drifting, and the regression suite provides false confidence rather than real coverage. AI-generated code optimizes for the stated requirement. It handles the case described in the prompt competently. It does not handle the cases that were not in the prompt: the empty array that should return a specific error, the timestamp that crosses a timezone boundary, the concurrent request that triggers a race condition. These are edge cases that only emerge from real usage patterns, and they are precisely what a regression suite built from real traffic catches where tests written from requirements do not. The Regression Testing Response Understanding these failure modes points directly to what needs to change in regression testing strategy when AI-generated code becomes part of the development process. Test generation needs to be grounded in real behavior, not assumed behavior. The traditional model of writing tests based on requirements becomes increasingly insufficient when the code being tested was generated by a model that had access only to those same requirements. The regression suite ends up testing exactly what the AI thought the code should do. Tests need to be grounded in what the system actually does when real requests flow through it. Integration test coverage becomes more important than unit test coverage. AI-generated code can usually pass unit tests because it generates syntactically correct implementations of isolated functions. The failures emerge at integration points. Regression testing that focuses on the integration layer, verifying that services interact correctly under realistic conditions, catches the class of failures that AI-generated code is most likely to introduce. Regression coverage should update continuously rather than incrementally. The pace of development with AI assistance creates a situation where code is being added to the codebase faster than manual test authoring can keep up. If the regression suite is maintained manually, it will always be behind. Coverage needs to grow with the codebase automatically, derived from real usage rather than added by developers who are already stretched by higher output demands. Production behavior should feed back into test validation. Closing the loop between how the system behaves in production and what the regression suite is testing is one of the most important shifts a team can make. When tests are derived from actual production traffic rather than written specifications, the mock drift problem largely disappears because the tests reflect what services actually do, not what developers assumed they would do. The Counter-Intuitive Conclusion There is a temptation to see AI-generated code and automated testing as solving the same problem from different angles. If AI can generate both the code and the tests, the reasoning goes, maybe the coverage problem solves itself. It does not. An AI that generates code and then generates tests for that code is essentially testing its own assumptions about how the code should behave. It will consistently produce tests that pass against the code it wrote, and those tests will systematically miss the gap between what the AI thought the code should do and what the system actually needs to do under production conditions. The gap between AI intent and production reality is exactly where regression testing has always been most valuable. AI-generated code makes that gap wider, not narrower, because the code is being written by something with no production experience at all. The teams that treat AI coding assistance as a reason to invest less in regression testing will eventually face production incidents that trace directly to this decision. The teams that treat it as a reason to invest more, particularly in coverage grounded in real system behavior rather than written specifications, will find that AI assistance genuinely accelerates development without accumulating the hidden quality debt that comes with uncovered integration failures. The Bottom Line Regression testing was never just a safety net. It is the mechanism by which a team validates that their understanding of the system matches how the system actually behaves. When AI is generating the code, that validation matters more than ever, because the code is now written by something that has never seen your system run. Invest accordingly.

By Sancharini Panda

Can Rust Have Zero-Cost Dependency Injection?

Overview This article explores whether dependency injection (DI) can exist in Rust without sacrificing the language’s core philosophy of zero-cost abstractions. We will approach the question from three angles: Why dependency injection still matters in Rust, even for systems built with zero-sized types and compile-time guarantees.How DI evolved in other ecosystems, using Java as a reference point.A practical Rust-oriented approach to implementing DI with compile-time guarantees. We’ll also show how Rust traits enable DI patterns that scale across crates, preserving zero-cost guarantees. All Rust source code used in this article is available in this repository. Rust DI: The Problem Rust Hasn’t Solved Yet Rust has solved problems most languages haven’t even dared to touch: memory safety without a garbage collector, fearless concurrency, and powerful zero-cost abstractions. But there is a class of problems Rust hasn’t fully confronted yet. Not because Rust is incapable — but because these problems exist above the machine level. They are not about memory safety or performance. They are about composition, modularity, and architectural correctness in large systems. Managing dependencies between dozens or hundreds of components is fundamentally different from managing memory or threads. Rust gives us powerful primitives, but the question remains: How do we scale composition safely and maintainably? What “Enterprise” Really Means in Rust Terms When Rust developers hear enterprise, they often think slow, over-engineered, and bloated. But that perception is misleading. Enterprise systems are not bloated by accident. They are complex because composition eventually stops being trivial. The complexity comes from business requirements, not from the technology stack. Enterprise: The Burden We Can’t Avoid When a company reaches a certain scale, several things inevitably happen: Products serve thousands or millions of usersSystems integrate with vendors, partners, and third-party servicesTeams work independently on modules and featuresSoftware must evolve continuously without stopping the business These realities create architectural pressure. From a technical perspective, systems must support: Scalability: At multiple levels — both in terms of users and data, including hundreds, thousands, or millions, or even up to billions of concurrent users, as well as functional modules interacting across teams.Reliability: Systems run 24/7. Services must handle failures because dependencies on vendors, partners, or third-party services mean that failures are inevitable, and the system must continue operating despite them.Modularity: Independent teams need to work on isolated components without breaking other parts of the system.Flexibility: Infrastructure choices may change. Databases, messaging systems, or integrations might need to be swapped without rewriting the entire application.Observability: To detect and respond to performance bottlenecks, integration failures, or unexpected behaviors quickly.Extensibility: New products, markets, and regulations require systems to evolve incrementally rather than being rebuilt from scratch.Maintainable: Every business decision introduces new dependencies. And every dependency increases the complexity of the system’s composition. Ensuring that the system doesn’t become so convoluted that small changes introduce cascading errors. Even with Rust’s ownership model and strong type system, manually managing this dependency graph eventually becomes impractical. These pressures are not theoretical — they define the daily reality of enterprise software engineering. Every design decision must balance immediate business needs with long-term sustainability, especially under high concurrent load. Where Dependency Injection Becomes Relevant This is exactly where dependency injection becomes useful. DI allows systems to manage complexity by separating what components need from how those dependencies are created and connected. In practice, this means: Components declare their dependencies without constructing them directlyDependencies are provided externally, keeping components isolatedSystems evolve gradually without breaking existing modulesOptional features and plugins can be integrated without tightly coupling the system DI is not just a convenience. It is a structured approach to handling inevitable architectural complexity. Enterprise Isn’t Just Complexity — It’s Heterogeneity Large systems are rarely uniform. They typically contain: Independent components with their own dependency treesStateful infrastructure such as databases, caches, and message brokersOptional features and plugin-style modulesMultiple implementations of the same interface This heterogeneity appears naturally over time. Systems accumulate tools built years apart, libraries maintained by different teams, and components that survive long after their original authors have moved on. Enterprise systems grow gradually, and they rarely get the chance to start over. Rust does not eliminate these pressures. Any real system eventually faces them. Java’s Historical Perspective: DI Was Inevitable Java did not adopt dependency injection because it was fashionable. It adopted DI because large systems were becoming impossible to manage without it. Without DI, developers quickly ran into familiar problems: Tight coupling between componentsFragile initialization orderHard-coded dependencies scattered across the codebaseChanges in one module unexpectedly breaking another Dependency injection emerged as a discipline for managing complexity. Components declare what they depend on, and the system provides those dependencies when constructing the application. This separation allows systems to evolve without collapsing under their own architecture. DI in a Nutshell You can think of dependency injection as a kind of runtime composition system. If your application contains many services, modules, plugins, or optional components, something must assemble them and ensure they are wired correctly, and that role belongs to the DI system. DI is conceptually similar to package managers such as Cargo or Maven, but it operates at a different level: Package managers resolve dependencies between libraries at build time.Dependency injection resolves dependencies between components at runtime. Loading executable code into memory is easy — the operating system handles that. What is harder is creating objects, initializing them correctly, and ensuring that all components interact with the right dependencies. This becomes increasingly difficult as systems grow. Dependency injection addresses this problem directly. How Dependency Injection Is Typically Solved in Java Java provides one of the most mature ecosystems for dependency injection. Frameworks such as Spring or Guice automate object creation and dependency wiring almost entirely. Let’s revisit the same example from the previous section: a simple User Management API. We have two controllers: ReadController — retrieves users from a databaseWriteController — creates users and publishes events to a message broker Both controllers depend on infrastructure services that must be created and wired correctly. Without Dependency Injection In a traditional manual setup, object creation and wiring might look like this: public class Application { public static void main(String[] args) { Database database = new PostgresDatabase(); MessageBroker broker = new KafkaBroker(); ReadController readController = new ReadController(database); WriteController writeController = new WriteController(database, broker); // start application } } At first glance, this appears manageable. But as the application grows, the initialization code expands rapidly: Multiple infrastructure servicesOptional modulesConfiguration logicConditional wiring depending on the environment The main method eventually becomes responsible for constructing the entire dependency graph of the application. This approach becomes difficult to maintain and extremely fragile as the system evolves. Dependency Injection With Spring Dependency injection frameworks solve this by moving the responsibility of object creation and wiring to a container. Components simply declare what they need. @Service public class Database { } @Service public class KafkaBroker implements MessageBroker { } @RestController public class ReadController { private final Database database; @Autowired public ReadController(Database database) { this.database = database; } } Dependencies are declared in constructors, and the DI container automatically provides the correct instances. The application no longer manually constructs the object graph. Instead, the framework scans components and resolves dependencies automatically. Polymorphism in Java DI Java DI frameworks also support multiple implementations of the same interface. For example, an application may support several message brokers simultaneously: @Service public class KafkaBroker implements MessageBroker { } @Service public class RabbitBroker implements MessageBroker { } A controller can receive all implementations at once: @RestController public class WriteController { private final List<MessageBroker> brokers; @Autowired public WriteController(List<MessageBroker> brokers) { this.brokers = brokers; } } The DI container automatically collects all implementations of MessageBroker and injects them into the controller. This makes the system highly extensible: New brokers can be addedExisting ones can be removedThe controller remains unchanged The Cost of Traditional DI Java DI frameworks provide powerful capabilities, but they come with trade-offs: Dependency resolution happens at runtimeReflection is heavily usedErrors may only appear during application startupDependency graphs are not always fully visible to the compiler This runtime flexibility works well for the Java ecosystem, but it introduces overhead and reduces compile-time guarantees. Rust, on the other hand, encourages a different philosophy: If something can be verified at compile time, it should be. This raises an interesting question: Can Rust achieve the same flexibility of dependency injection while preserving compile-time guarantees and zero runtime cost? Journey into Rust Coding Let’s try to build a dependency injection approach in Rust gradually. We will follow the same conceptual example used in the Java section: A ReadControllerA WriteControllerMultiple implementations of a MessageBrokerAn abstraction for database connectivity Rust Without Dependency Injection In the first example, we will implement a small Rust application without dependency injection. However, we will introduce use-traits, which will later allow us to transition naturally to a dependency injection model. 1. Defining Database Interfaces First, let’s define the interface used to access the database. 1.1 DatabaseConnection Trait This trait represents an abstraction for database connectivity that can support multiple implementations (Postgres, MySQL, etc.). trait DatabaseConnection { fn read_query(&self, query: &str); fn write_query(&self, query: &str); } 1.2 UseDatabaseConnection Trait Next, we define a trait that allows components to request a database connection from a context. trait UseDatabaseConnection { type T: DatabaseConnection; fn database_connection(&self) -> &Self::T; } This trait will later be used as the foundation of dependency resolution. Instead of components knowing the entire application context, they simply declare that they require a DatabaseConnection. This keeps components decoupled from the full application structure. 2. Database Implementation Now we provide a concrete implementation of DatabaseConnection. #[derive(Default)] struct PostgresDatabaseConnection {} impl DatabaseConnection for PostgresDatabaseConnection { fn read_query(&self, query: &str) { println!("Reading from Postgres DB: {}", query) } fn write_query(&self, query: &str) { println!("Writing into Postgres DB: {}", query) } } For simplicity, this example only prints messages instead of connecting to a real database. In a real system, this could be implemented using any production database library. 3. Controllers Now we define the controllers responsible for performing application logic. 3.1 Controller Structs #[derive(Default)] struct ReadController {} #[derive(Default)] struct WriteController {} Rust allows structs with no fields. These zero-sized types have no runtime cost, but they still represent concrete types at compile time and can participate in abstractions. 3.2 Controller Use Traits Next, we define traits that expose controllers to other components. trait UseReadController { fn read_controller(&self) -> &ReadController; } trait UseWriteController { fn write_controller(&self) -> &WriteController; } These traits allow components to access controllers without knowing anything about the application context. 3.3 Controller Context Now we combine the previously defined traits into a context trait. trait ControllerContext: UseDatabaseConnection + UseReadController + UseWriteController {} This context describes the minimal environment required for controllers to function. Controllers will depend only on this trait instead of the full application context. 3.4 Controller Implementation Now we implement the controller logic. impl ReadController { fn do_something<C: ControllerContext>(&self, ctx: &C, argument: &str) { ctx.database_connection() .read_query(format!("SELECT * FROM table WHERE id = '{}'", argument).as_str()); } } impl WriteController { fn do_something<C: ControllerContext>(&self, ctx: &C, argument: &str) { ctx.database_connection().write_query( format!("UPDATE table SET value = 'new' WHERE id = '{}'", argument).as_str(), ); } } Notice something important here: The controllers do not know about the full application context. They only know about the traits they depend on. This means the controller and database code could already be extracted into separate crates, reusable by any application implementing the required use-traits. 4. Wiring the Application Now we wire all components together. 4.1 Application Context We define a struct that holds all application components. #[derive(Default)] struct ApplicationContext { read_controller: ReadController, write_controller: WriteController, postgres_database_connection: PostgresDatabaseConnection, } This struct acts as the composition root of the application. 4.2 Implement Use Traits Next we implement the previously defined traits. impl UseReadController for ApplicationContext { fn read_controller(&self) -> &ReadController { &self.read_controller } } impl UseWriteController for ApplicationContext { fn write_controller(&self) -> &WriteController { &self.write_controller } } impl UseDatabaseConnection for ApplicationContext { type T = PostgresDatabaseConnection; fn database_connection(&self) -> &Self::T { &self.postgres_database_connection } } By implementing these traits, ApplicationContext becomes capable of providing dependencies to components. 4.3 Controller Context Implementation impl ControllerContext for ApplicationContext {} Since ApplicationContext already implements the required traits, it automatically satisfies ControllerContext. 5. Running the Application Finally we run the application. pub fn run() { let ctx = ApplicationContext::default(); ctx.read_controller().do_something(&ctx, "argument"); ctx.write_controller().do_something(&ctx, "argument"); } Key characteristics of this approach: No dyn traitsNo Arc or RcNo runtime dependency container All wiring is resolved at compile time through generics and monomorphization. Multi-Threading An attentive reader may ask: Will this approach work in multi-threaded environments? In Rust, thread safety is typically ensured using the Send and Sync traits. These traits are automatically implemented by the compiler if all fields of a struct are also Send + Sync. We can verify thread safety with a compile-time assertion: const _: () = { const fn assert_send_sync<T: Send + Sync>() {} assert_send_sync::<ApplicationContext>(); }; If this compiles, the entire application context can safely be shared between threads. In real systems, some components (such as database connections) may not be inherently thread-safe. In such cases, a connection pool or synchronization mechanisms such as Mutex are required. This limitation is not related to the dependency injection approach itself, but rather to shared resource management in concurrent systems. What the Compiler Actually Generates If we inspect the compiled output with: cargo asm rust_di_example::main ... 26 │ lea rbx, [rsp, +, 32] 27 │ mov rdx, rbx 28 │ call qword, ptr, [rip, +, _ZN5alloc3fmt6format12format_inner17he42ed4cf3cdc276bE@GOTPCREL] 29 │ movups xmm0, xmmword, ptr, [rsp] 30 │ mov rax, qword, ptr, [rsp, +, 16] 31 │ movups xmmword, ptr, [rsp, +, 48], xmm0 32 │ mov qword, ptr, [rsp, +, 64], rax 33 │ mov qword, ptr, [rsp, +, 32], r14 34 │ mov qword, ptr, [rsp, +, 40], 18 35 │ mov qword, ptr, [rsp], rbx 36 │ mov qword, ptr, [rsp, +, 8], r13 37 │ lea rdi, [rip, +, .Lanon.63c02f0152e6743e61fdeaf76f1d4051.26] 38 │ mov rsi, rsp 39 │ call qword, ptr, [rip, +, _ZN3std2io5stdio6_print17hba8f5eda1e4e495eE@GOTPCREL] 40 │ lea rax, [rip, +, .Lanon.63c02f0152e6743e61fdeaf76f1d4051.27] 41 │ mov qword, ptr, [rsp, +, 32], rax 42 │ mov qword, ptr, [rsp, +, 40], 19 43 │ mov qword, ptr, [rsp], rbx 44 │ mov qword, ptr, [rsp, +, 8], r13 45 │ lea r14, [rsp, +, 48] 46 │ mov qword, ptr, [rsp, +, 16], r14 47 │ lea r15, [rip, +, _ZN60_$LT$alloc..string..String$u20$as$u20$core..fmt..Display$GT$3fmt17h9d11f1d81b352ac8E] 48 │ mov qword, ptr, [rsp, +, 24], r15 49 │ lea rdi, [rip, +, .Lanon.63c02f0152e6743e61fdeaf76f1d4051.7] 50 │ mov rsi, rsp 51 │ call qword, ptr, [rip, +, _ZN3std2io5stdio6_print17hba8f5eda1e4e495eE@GOTPCREL] 52 │ lea rax, [rip, +, .Lanon.63c02f0152e6743e61fdeaf76f1d4051.28] 53 │ mov qword, ptr, [rsp, +, 32], rax 54 │ mov qword, ptr, [rsp, +, 40], 21 55 │ mov qword, ptr, [rsp], rbx 56 │ mov qword, ptr, [rsp, +, 8], r13 57 │ mov qword, ptr, [rsp, +, 16], r14 58 │ mov qword, ptr, [rsp, +, 24], r15 59 │ lea rdi, [rip, +, .Lanon.63c02f0152e6743e61fdeaf76f1d4051.8] 60 │ mov rsi, rsp 61 │ call qword, ptr, [rip, +, _ZN3std2io5stdio6_print17hba8f5eda1e4e495eE@GOTPCREL] ... We see extremely flat assembly code with series of invocation to _ZN3std2io5stdio6_print17hba8f5eda1e4e495eE@GOTPCREL that is just printing subroutine in rust runtime. There are no runtime dependency resolution mechanisms, no dynamic dispatch, and no container logic. The generated code mostly contains calls to standard library functions such as printing. This demonstrates that the abstractions introduced here do not introduce runtime overhead. Why Use-Traits Matter At first glance, the use-trait might look like unnecessary indirection. Why not simply pass ApplicationContext directly to every component? The reason is crate-level decoupling. Enterprise applications often grow into multiple crates. Controllers, database access layers, messaging integrations, and domain logic are very often implemented as reusable libraries. For example, a Spring Boot actuator–style module may contain all layers inside the DB, provide REST API endpoints, and integrate with a monitoring aggregator service — it acts as a standalone sub-program. However, if a component directly depends on ApplicationContext, it becomes tied to the executable crate that defines it. That creates an architectural problem: Libraries would depend on the application crateThe application crate would depend on the libraries This circular dependency makes reuse impossible. Use-trait solve this by defining capability-based interfaces. Instead of depending on the application context, components depend only on the capabilities they require. Example: trait UseDatabaseConnection { type T: DatabaseConnection; fn database_connection(&self) -> &Self::T; } A controller does not know anything about the application structure. It simply requires that the context provides access to a database connection. impl ReadController { fn do_something<C: ControllerContext>(&self, ctx: &C, argument: &str) { ctx.database_connection() .read_query(format!("SELECT * FROM table WHERE id = '{}'", argument).as_str()); } } Because of this design: ReadController can live in its own crateThe crate only exports traits describing the capabilities it needsAny application can use the controller by implementing those traits The application context becomes an adapter, wiring together independent components. Application ├── implements UseDatabaseConnection ├── implements UseReadController └── implements UseWriteController This pattern enables a powerful architectural property: Components become fully reusable libraries, while the application remains responsible only for wiring them together. In other words, use-traits allow dependency injection to cross crate boundaries while preserving Rust’s compile-time guarantees. Without this indirection, the system collapses into a monolithic application context that cannot be decomposed into reusable modules. Limitations of This Approach Although this example demonstrates many useful properties, it is not yet a complete dependency injection system. The main limitation is that ApplicationContext still has too much knowledge about component internals. In real DI frameworks, modules often contain many components, initialization logic, and internal dependencies. For example, consider a Spring Boot module such as Spring Data. When you add the dependency to your project, it automatically provides: Database driver integrationConnection poolingRepository interfacesTransaction managementEntity scanningMetrics integrationHealth check integration All of this functionality is assembled automatically by the DI framework. From the application developer’s perspective, only minimal configuration is required. Real dependency injection modules therefore consist of entire subgraphs of components, not just individual services. In our example we intentionally introduced two controllers to demonstrate that even a simple module may contain multiple cooperating components. A complete dependency injection framework must also manage: Module compositionInitialization lifecycleDependency resolutionOptional componentsMultiple implementations This is where the real challenge begins. Rust With Dependency Injection To implement dependency injection in Rust, we will build iteratively. We start from the previous “no DI” approach and gradually close the gap toward a complete DI system. The good news is that we already have use-traits, and our components are decoupled. We can extract certain code into reusable modules. What’s missing for a true dependency injection system: ApplicationContext still has too much knowledge about the components it uses.Some wiring and initialization steps are still manual. Our goal is to move the wiring into DI modules, giving each component full control over how it is connected. Because we are still targeting compile-time injection, we cannot rely on runtime reflection (like Java DI frameworks do). Instead, we will push this logic into Rust macros, allowing compile-time wiring while preserving zero-cost abstractions. 1. Registering Components in ApplicationContext In traditional DI, the application knows which modules it depends on (like Spring Data). But modules themselves should control which components they export. In our previous example, ApplicationContext was a struct, and registering a component meant adding a field manually. This ties the application to module internals. We need a way to add fields to ApplicationContext automatically, without putting module-specific code into the executable. We can achieve this using the combine-structs crate, which provides macros to embed multiple structs into one. Each module defines an embeddable struct as a context extension. When imported, ApplicationContext automatically merges all fields from these extensions. 1.1 Context Extension for PostgreSQL #[allow(dead_code)] #[derive(Fields)] struct PostgresDatabaseContextExtension { postgres_database_connection: PostgresDatabaseConnection, } The Fields derive macro allows this struct to be merged into ApplicationContext. 1.2 Context Extension for Controllers #[allow(dead_code)] #[derive(Fields)] struct ControllerContextExtension { read_controller: ReadController, write_controller: WriteController, } The controller module exports two controllers. More components can be added without touching the main executable. 1.3 Embedding Context Extensions #[combine_fields(PostgresDatabaseContextExtension, ControllerContextExtension)] #[derive(Default)] struct ApplicationContext {} The combine_fields macro merges all fields from the context extensions. ApplicationContext now has all components automatically wired. 2. Providing Use-Traits Previously, wiring was done via use-traits. Now that ApplicationContext doesn’t know which components exist, modules must export use-trait implementations via macros. 2.1 Macro for Database Connectivity macro_rules! inject_postgres_impl { () => { impl UseDatabaseConnection for ApplicationContext { type T = PostgresDatabaseConnection; fn database_connection(&self) -> &Self::T { &self.postgres_database_connection } } }; } 2.2 Macro for Controllers macro_rules! inject_controller_impl { () => { impl UseReadController for ApplicationContext { fn read_controller(&self) -> &ReadController { &self.read_controller } } impl UseWriteController for ApplicationContext { fn write_controller(&self) -> &WriteController { &self.write_controller } } impl ControllerContext for ApplicationContext {} }; } 2.3 Injecting Components #[combine_fields(PostgresDatabaseContextExtension, ControllerContextExtension)] #[derive(Default)] struct ApplicationContext {} inject_postgres_impl!(); inject_controller_impl!(); The executable only calls these macros. Components remain isolated from the main application, and the wiring happens automatically. 3. Intermediate Conclusion At this stage: No component code has been changed.Modules can add or remove components freely.Components are decoupled from each other and from the container.Wiring happens automatically through macros and use-traits. This gives us a bare-minimum dependency injection system: application components are decoupled, wiring is automatic, and no single component needs full knowledge of the application. 4. Limitations Even though we now have a working DI mechanism, it isn’t fully production-ready: Initialization: Components may require setup before wiring.Lifecycle Management: Controlling initialization order, cleanup, or optional components can be challenging. Next, we will explore a Rust DI framework capable of automating component initialization and lifecycle management, moving closer to a complete solution. Dependency Injection and Initialization Cycle in Rust So far, we have built a dependency injection (DI) container where all components are stored as fields in ApplicationContext. The next challenge is initializing these components. Press enter or click to view image in full size The goal is to: Enumerate the fields of ApplicationContext.Identify which fields require initialization.Call an initialization method for each such component. Since we want everything to happen at compile time, we need a macro to generate a Rust method that calls init() on every tagged component without runtime loops or collections. I could not find an existing macro for this, so I implemented one myself. If you want the details, check the implementation here: di_macro/src/lib.rs. We will focus on how to use this macro, not how it works internally. Macro Example: Enumerating Tagged Fields Full example code: struct_enumerator.rs 1. Define a Struct with Tagged Fields #[allow(dead_code)] #[derive(Debug, FieldEnumerator, Default)] pub struct MyStruct { #[tag(init_listener)] field_1: i32, #[tag(init_listener)] #[tag(start_listener)] field_1_2: i32, field_2: i32, #[tag(start_listener)] field_3: i32, } FieldEnumerator is our custom derive macro.Fields can have one or more tags (init_listener, start_listener). 2. Define a Callback Macro macro_rules! my_callback { ($struct_name:ident, $field_name:ident, $listener_type:ident) => { println!( "struct = {}, field = {}, type = {}", stringify!($struct_name), stringify!($field_name), stringify!($listener_type), ) }; } For every tagged field, the callback macro is called at compile time.Arguments passed: struct_name, field_name, and listener_type. 3. Invoke the Field Enumerator pub fn run() { let my_struct = MyStruct::default(); println!("my_struct = {:?}", my_struct); enumerate_tags_MyStruct_init_listener!(my_callback); enumerate_tags_MyStruct_start_listener!(my_callback); } enumerate_tags_MyStruct_init_listener! and enumerate_tags_MyStruct_start_listener! are generated automatically by the FieldEnumerator macro.The macro expands into a flat sequence of println!() calls. Macro Example Output: // enumerate_tags_MyStruct_init_listener!(my_callback); // my_callback!(MyStruct, field_1, init_listener) println!("struct = {}, field = {}, type = {}", "MyStruct", "field_1", "init_listener") // my_callback!(MyStruct, field_1_2, init_listener) println!("struct = {}, field = {}, type = {}", "MyStruct", "field_1_2", "init_listener") //enumerate_tags_MyStruct_start_listener!(my_callback) // my_callback!(MyStruct, field_1_2, start_listener) println!("struct = {}, field = {}, type = {}", "MyStruct", "field_1_2", "start_listener") // my_callback!(MyStruct, field_3, start_listener) println!("struct = {}, field = {}, type = {}", "MyStruct", "field_3", "start_listener") Notice: No vectors, arrays, loops, or runtime collections — everything happens at compile time. Rust Dependency Injection with Initialization We can now use the same macro to enumerate all fields in ApplicationContext and initialize them. Code reference: di_init.rs We introduce a Configuration component to demonstrate how initialization can depend on runtime data. 1. Configuration Module #[derive(Default)] struct Configuration { run_arguments: &'static str, } #[allow(dead_code)] #[derive(Fields, Default)] struct ConfigurationContextExtension { configuration: Configuration, } trait UseConfiguration { fn configuration(&self) -> &Configuration; fn configuration_mut(&mut self) -> &mut Configuration; } macro_rules! inject_configuration_impl { () => { impl UseConfiguration for ApplicationContext { fn configuration(&self) -> &Configuration { &self.configuration } fn configuration_mut(&mut self) -> &mut Configuration { &mut self.configuration } } }; } Steps: Define the component struct (Configuration).Define a context extension for ApplicationContext.Define a use-trait (UseConfiguration) for wiring.Provide a macro to implement the trait on ApplicationContext. Note: Configuration is no longer zero-sized—it contains runtime data (run_arguments). 2. Database Connection Initialization 2.1 Update PostgresDatabaseConnection #[derive(Default)] struct PostgresDatabaseConnection { connection_string: String, } Now contains runtime data.Initialization depends on configuration. 2.2 Tag Component for Initialization #[allow(dead_code)] #[derive(Fields, ContextExtension)] struct PostgresDatabaseContextExtension { #[tag(init_listener)] postgres_database_connection: PostgresDatabaseConnection, } init_listener signals that the component requires initialization. 2.3 Define Initializable Trait trait Initializable<C> { fn init(ctx: &mut C); } Components implementing this trait can be initialized automatically. 2.4 Implement Initialization impl<C: UseConfiguration + UsePostgresDatabaseConnection> Initializable<C> for PostgresDatabaseConnection { fn init(ctx: &mut C) { println!("Init sequence = {}", ctx.configuration().run_arguments); ctx.postgres_database_connection_mut().connection_string = format!("Postgres DB on {}", ctx.configuration().run_arguments); } } Accesses ApplicationContext mutably for initialization of any of component. 2.5 Prepare ApplicationContext #[combine_fields( ConfigurationContextExtension, PostgresDatabaseContextExtension, ControllerContextExtension )] #[derive(Default, FieldEnumerator)] struct ApplicationContext {} inject_postgres_impl!(); inject_controller_impl!(); inject_configuration_impl!(); Added FieldEnumerator for tag enumeration.Configuration module bindings included. 2.6 Initialization Sequence impl ApplicationContext { fn init(&mut self) { fn call_init<T: Initializable<ApplicationContext>, F: Fn(ApplicationContext) -> T>( ctx: &mut ApplicationContext, _closure: F, ) { T::init(ctx); } macro_rules! init_callback { ($struct_name:ident, $field_name:ident, $listener_type:ident) => { call_init(self, |x| x.$field_name); }; } enumerate_tags_ApplicationContext_init_listener!(init_callback); } } How it works: call_initfunction This helper function takes a generic type T that implements Initializable<ApplicationContext>.It also takes a closure _closure of type Fn(ApplicationContext) -> T.The trick here: the Rust compiler monomorphizes the closure to the actual type of the field passed in, so T::init(ctx) is called with the concrete type.init_callback!macro The macro expands for each field tagged with init_listener.It calls call_init with the correct field from self, ensuring the proper Initializable implementation is invoked.enumerate_tags_ApplicationContext_init_listener!macro This macro iterates over all fields in ApplicationContext that are marked with #[init_listener].For each field, it invokes init_callback!, which triggers Initializable::init for that specific component. Key rust trick: By using the Fn trait and generics in call_init, the compiler resolves the actual type of the field at compile time. This avoids any runtime type checks and ensures zero-cost initialization while keeping strong type safety. 2.7 Running the Application pub fn run() { let mut ctx = ApplicationContext::default(); ctx.configuration_mut().run_arguments = "DB_URL=127.0.0.1:5555"; ctx.init(); ctx.read_controller().do_something(&ctx, "argument"); ctx.write_controller().do_something(&ctx, "argument"); } Sample Output: Init sequence = DB_URL=127.0.0.1:5555 Reading from Postgres DB on DB_URL=127.0.0.1:5555: SELECT * FROM table WHERE id = 'argument' Writing into Postgres DB on DB_URL=127.0.0.1:5555: UPDATE table SET value = 'new' WHERE id = 'argument' run_arguments successfully propagated into runtime data. Performance Considerations In this demo, some structs now hold runtime data — but this is intentional. It’s added to demonstrate initialization, just like in real applications where components manage runtime state. The wiring mechanism itself remains zero-cost: All bindings are resolved at compile time through monomorphization. Even with the initialization sequence broadcasting multiple init calls, the compiler generates a flat sequence of calls: no loops, no runtime collections, no dynamic dispatch — everything happens at compile time, efficiently. Limitations This approach is now mature and production-ready for wiring, decoupling, and initialization.Next steps can explore advanced topics, such as polymorphism and more complex runtime behaviors. Dependency Injection and Polymorphism This is the final example of the article and introduces what I would consider an advanced topic for the core engine of any dependency injection framework: polymorphism. Press enter or click to view image in full size Many DI frameworks handle basic dependency wiring well. For example, Java Spring Boot provides a very mature implementation. However, in many other DI implementations, one important capability is often missing — the ability to handle multiple implementations of the same abstraction in a flexible and compile-time-safe way. Let’s extend our example with a new requirement. New Requirement Our application should support multiple message brokers, for example: KafkaRabbitMQ After writing data to the database, the controller should publish a message to one or more brokers. However: The component does not know which brokers existThe container may contain multiple brokersThe DI framework must maintain this one-to-many relationship One component should be able to call many broker implementations without knowing which ones exist. To make things even more interesting, we introduce the concept of profiles. Each profile represents a different configuration of the application context. Example: Profile1 PostgreSQL databaseKafka brokerRabbitMQ broker Profile2 Oracle databaseRabbitMQ broker only See the complete example. Injection Macros and Profiles First, we slightly modify our injection macros so they accept the application context type as an argument. macro_rules! inject_configuration_impl { ($ctx:ident) => { impl UseConfiguration for $ctx { fn configuration(&self) -> &Configuration { &self.configuration } fn configuration_mut(&mut self) -> &mut Configuration { &mut self.configuration } } }; } This change is necessary because the DI module does not know which profile will be used. Each executable can choose a different application context profile, and the macros must work with whichever profile is selected. Oracle Database Component Now we introduce a new database implementation. #[allow(dead_code)] #[derive(Fields, ContextExtension)] struct OracleDatabaseContextExtension {} And the injection macro: macro_rules! inject_oracle_impl { ($ctx: ident) => { impl DatabaseConnection for $ctx { fn read_query(&self, query: &str) { println!("Reading from Oracle DB: {}", query) } fn write_query(&self, query: &str) { println!("Writing into Oracle DB {}", query) } } impl UseDatabaseConnection for $ctx { type T = $ctx; fn database_connection(&self) -> &Self::T { self } } }; } Here we apply a small trick. Instead of defining a separate struct for the database connection, we implement the trait directly on the application context. This approach avoids additional boilerplate and works well when we know there will only be one database implementation per profile. Defining Message Brokers Now we define the abstraction for message brokers. Broker Interface trait BrokerSender { fn send_to_broker(&self, value: &str); } RabbitMQ Broker #[allow(dead_code)] #[derive(Default, Fields, ContextExtension)] struct RabbitMqContextExtension { #[tag(broker)] rabbit_mq: RabbitMq, } #[derive(Default)] struct RabbitMq; impl BrokerSender for RabbitMq { fn send_to_broker(&self, value: &str) { println!("{} sent to RabbitMq", value); } } Notice the important detail: #[tag(broker)] This tag allows the DI framework to enumerate all brokers automatically using the same mechanism we previously used for initialization. Kafka Broker Kafka is implemented in exactly the same way. #[allow(dead_code)] #[derive(Default, Fields, ContextExtension)] struct KafkaContextExtension { #[tag(broker)] kafka: Kafka, } #[derive(Default)] struct Kafka; impl BrokerSender for Kafka { fn send_to_broker(&self, value: &str) { println!("{} sent to Kafka", value); } } Publisher — Compile-Time Polymorphism Now comes the most interesting part. We define a Publisher component that sends messages to all available brokers. trait Publisher { fn publish(&self, value: &str); } Injection macro: macro_rules! inject_publisher_impl { ($ctx:ident) => { impl Publisher for $ctx { fn publish(&self, value: &str) { macro_rules! broker_callback { ($struct_name:ident, $field_name:ident, $listener_type:ident) => { self.$field_name.send_to_broker(value); }; } enumerate_tags!($ctx, broker, broker_callback); } } impl UsePublisher for $ctx { type T = $ctx; fn publisher(&self) -> &Self::T { self } } }; } The key idea: the publisher does not know which brokers exist. Instead, the FieldEnumerator macro generates code that calls send_to_broker for each tagged broker. This gives us: One-to-many relationshipCompile-time wiringNo dynamic dispatchNo runtime overhead Helper Macro for Tag Enumeration macro_rules! enumerate_tags { ($ctx:ident, $tag:ident, $callback:ident) => { paste! { [<enumerate_tags_ $ctx _ $tag >]!($callback) } }; } This macro simply dispatches to the procedural macro generated earlier. Application Profiles Now we define two different application contexts. Profile 1 #[combine_fields( ConfigurationContextExtension, PostgresDatabaseContextExtension, ControllerContextExtension, PublisherExtension, RabbitMqContextExtension, KafkaContextExtension )] #[derive(Default, FieldEnumerator)] struct ApplicationProfile1 {} Profile1 includes: PostgreSQLRabbitMQKafka Profile 2 #[combine_fields( ConfigurationContextExtension, OracleDatabaseContextExtension, ControllerContextExtension, PublisherExtension, RabbitMqContextExtension )] #[derive(Default, FieldEnumerator)] struct ApplicationProfile2 {} Profile2 includes: Oracle databaseRabbitMQ brokerno Kafka Initialization Macro for Context We move the previously used initialization logic into a reusable macro: macro_rules! application_context { ($ctx: ident) => { const _: () = { const fn assert_send_sync<T: Send + Sync>() {} assert_send_sync::<$ctx>(); }; impl Initializable<$ctx> for $ctx { fn init(ctx: &mut $ctx) { fn call_init<T: Initializable<$ctx>, F: Fn($ctx) -> T>( ctx: &mut $ctx, _closure: F, ) { T::init(ctx); } macro_rules! init_callback { ($struct_name:ident, $field_name:ident, $listener_type:ident) => { call_init(ctx, |x| x.$field_name); }; } enumerate_tags!($ctx, init_listener, init_callback); } } }; } Wiring Profiles Profile1 application_context!(ApplicationProfile1); inject_postgres_impl!(ApplicationProfile1); inject_controller_impl!(ApplicationProfile1); inject_configuration_impl!(ApplicationProfile1); inject_publisher_impl!(ApplicationProfile1); inject_rabbit_mq_impl!(ApplicationProfile1); inject_kafka_impl!(ApplicationProfile1); Profile2 application_context!(ApplicationProfile2); inject_oracle_impl!(ApplicationProfile2); inject_controller_impl!(ApplicationProfile2); inject_configuration_impl!(ApplicationProfile2); inject_publisher_impl!(ApplicationProfile2); inject_rabbit_mq_impl!(ApplicationProfile2); Running the Example fn do_run<T: Initializable<T> + Default + UseConfiguration + ControllerContext>() { let mut ctx = T::default(); ctx.configuration_mut().run_arguments = "DB_URL=127.0.0.1:5555"; T::init(&mut ctx); ctx.read_controller().do_something(&ctx, "argument"); ctx.write_controller().do_something(&ctx, "argument"); } pub fn run() { println!("Running Profile1"); do_run::<ApplicationProfile1>(); println!(); println!("Running Profile2"); do_run::<ApplicationProfile2>(); } Example Output Running Profile1 Configuration = DB_URL=127.0.0.1:5555 PostgresDB connection init sequence = DB_URL=127.0.0.1:5555 Reading from Postgres DB... Writing into Postgres DB... WriteController 'argument' sent to RabbitMq WriteController 'argument' sent to Kafka Running Profile2 Configuration = DB_URL=127.0.0.1:5555 Reading from Oracle DB... Writing into Oracle DB... WriteController 'argument' sent to RabbitMq Final Result With this approach we achieved: Compile-time polymorphismOne-to-many dependency injectionProfile-based application configurationNo dynamic dispatchNo runtime containerFully monomorphized wiring Everything is resolved at compile time while still supporting flexible application configurations. Conclusion: Can Rust Have Zero-Cost Dependency Injection? Throughout this article we explored whether Dependency Injection can exist in Rust without introducing runtime overhead. Traditional DI frameworks in languages such as Java rely heavily on reflection, runtime containers, dynamic dispatch, and runtime graph construction. These features make frameworks like Spring Boot extremely flexible, but they also introduce runtime complexity and performance costs. Rust approaches the problem differently. Instead of relying on runtime containers, the examples in this article demonstrate how compile-time composition can be used to build a dependency injection system. Using traits, generics, procedural macros, and compile-time code generation, we can construct an application context where: Component wiring happens at compile timeDependencies are resolved through traits and genericsInitialization logic can be generated staticallyPolymorphism can be implemented without dynamic dispatch Because Rust performs monomorphization during compilation, every dependency binding is resolved into concrete function calls. This means the final binary contains no reflection, no dynamic lookup tables, and no runtime dependency container. In other words, dependency injection becomes a compile-time architectural pattern rather than a runtime framework. We also demonstrated several important features typically expected from mature DI systems: Modular component composition through context extensionsControlled initialization sequencesOne-to-many polymorphism for components such as brokersConfigurable application profiles And all of this without introducing runtime cost or dynamic dispatch The result is a system where flexibility and performance are not in conflict. Rust’s type system and macro system allow us to design architectures that remain fully decoupled, while still producing simple, predictable, zero-cost binaries. This raises an interesting conclusion. Rust may never have a DI framework that looks like Spring Boot — and it probably shouldn’t. But Rust does allow dependency injection to exist in a different form, one that embraces the language’s philosophy: compile-time guarantees, explicit composition, and zero-cost abstractions. Future Directions The examples in this article intentionally keep the framework small in order to focus on the core ideas. However, a production-ready system would likely evolve further. For example, initialization often requires explicit ordering between components, where some services must be initialized before others. The current example also contains a fair amount of boilerplate, which could be significantly reduced with a more advanced procedural macro design. Heavier use of derive and attribute macros could also improve IDE code completion and developer ergonomics while keeping the system fully type-safe. Beyond the core container mechanics, several practical features naturally follow from this model: improved testing support, built-in mechanisms for mocking and stubbing components, and the ability to override components in derived profiles — a common requirement when building test environments or specialized deployments. Finally, dependency injection frameworks rarely exist in isolation. Systems such as Spring Boot succeeded not only because of their DI container, but because they provided a standard foundation for an ecosystem of reusable modules. A similar approach in Rust could allow libraries to integrate around a shared compile-time DI model, enabling a broader ecosystem of interoperable components while preserving Rust’s philosophy of explicit composition and zero-cost abstractions.

By Dmytro Brazhnyk

AI-Augmented React Development: How I Rebuilt My Workflow Without Losing Control of the Code

Every React developer reaches a point where the sheer volume of boilerplate starts to slow them down. Prop drilling, repetitive hook patterns, component scaffolding, unit test setup — the cognitive overhead adds up fast, especially at enterprise scale. When GitHub Copilot entered my workflow, I expected a productivity boost. What I didn't expect was how much I'd have to think about using it correctly. After integrating AI-assisted development into a React 18 codebase — spanning custom hooks, context-based state management, and accessibility-driven UI — I came away with a clear picture of where AI genuinely accelerates the work, where it quietly introduces risk, and what guardrails every team needs before they ship AI-assisted code to production. This isn't a tutorial on setting up Copilot. It's an honest account of what changed in my day-to-day React workflow, and how I rebuilt my development process around the strengths of AI without surrendering architectural judgment. Where AI Actually Accelerates React Development 1. Component Scaffolding The most immediate win was generating boilerplate-heavy component shells. React functional components follow a predictable structure: imports, props interface, state declarations, effect hooks, render return. Copilot autocompletes this structure accurately and fast, especially when your file already has consistent patterns. For example, starting a new form component with a comment like: Plain Text // Controlled form component with validation and submit handler … triggers a usable scaffold within seconds. In a codebase with 50+ form components, this adds up to meaningful time savings. 2. TypeScript Prop Typing One of the most tedious parts of React 18 development is defining interface types for component props — especially for components consuming API response shapes. Copilot handles this well when the API shape is already defined elsewhere in the file or project. It infers prop types from usage context and generates clean interfaces without much guidance. 3. Unit Test Generation Copilot shines at generating @testing-library/react test cases for presentational components. Given a component file, it can suggest: Render testsUser interaction tests (click, input change)Accessibility checks using getByRole This reduced the time I spent on repetitive test scaffolding by roughly 40% for simple components. 4. Repetitive Hook Patterns Standard hooks like useEffect with cleanup, useCallback with dependency arrays, and useMemo for expensive computations follow well-known patterns. Copilot autocompletes these reliably — and the suggestions are often correct on the first try when the surrounding context is clear. Where AI Fails React Developers (and Why It Matters) This is the part most AI-workflow articles skip. In my experience, Copilot introduced subtle issues in three specific areas: 1. State Management Architecture Copilot is pattern-matching, not reasoning. When I was designing a context-based global state solution for a multi-step form flow, Copilot consistently suggested patterns that worked for isolated examples but didn't scale: it created redundant useContext calls across components that should have been wrapped in a provider, and it failed to account for re-render performance implications. The lesson: Never accept AI suggestions for state architecture without reviewing the component tree. AI optimizes locally; architecture requires global thinking. 2. Custom Hook Dependency Arrays Incorrect dependency arrays in useEffect and useCallback are a well-known React footgun. Copilot's suggestions here were hit-or-miss. It occasionally omitted dependencies that needed to be included and included stale values that triggered unnecessary re-renders. I started treating all AI-generated dependency arrays as drafts that required manual review against the ESLint react-hooks/exhaustive-deps rule. This step is non-negotiable. 3. Accessibility in JSX This one is subtle. Copilot generates functional JSX — but accessible JSX requires deliberate attention to ARIA roles, focus management, and semantic HTML. AI-generated components often defaulted to div-heavy markup without the aria-* attributes or keyboard event handlers that production apps require. For any component touching user interaction — modals, dropdowns, form controls — I reviewed AI-generated output against WCAG 2.1 AA standards before committing. My Rebuilt Workflow: A Practical Stack After months of iteration, here's the workflow that works: Phase 1: Design First, Prompt Second Before I open a new file, I sketch the component's responsibilities on paper or in a comment block: JavaScript /** * UserProfileCard * - Displays user avatar, name, role * - Supports edit mode toggle * - Emits onSave callback with updated values * - Must be keyboard accessible */ This comment becomes the Copilot context. The more specific the intent, the better the scaffold. Phase 2: Accept Scaffolding, Write Logic I accept Copilot suggestions for: Component shellProp interfaceState variable declarationsJSX structure for simple layouts I write manually: useEffect logic and cleanupEvent handler implementationsContext provider designError boundariesAny business logic touching API data Phase 3: Review AI-Generated Tests Copilot generates test scaffolding well. I review every generated test for: Correct use of userEvent vs fireEventAccurate assertions (not just "it rendered")Missing edge cases (empty state, error state, loading state) Phase 4: Accessibility Audit Pass Every component gets a final pass against: Semantic HTML element usagearia-label / aria-describedby for interactive elementsKeyboard navigation (tab order, focus trap for modals)Color contrast (handled at design system level, not component level) A Real Before-and-After Example Before (pre-AI workflow): A controlled input component with validation took roughly 25–30 minutes to scaffold, type, test, and review. After (AI-augmented workflow): The same component takes 10–12 minutes — with Copilot handling the initial scaffold and test shell, and me handling the validation logic, hook dependencies, and accessibility pass. Here's a simplified example of the kind of component where AI delivers the most value: TypeScript interface SearchInputProps { value: string; onChange: (value: string) => void; onSubmit: () => void; placeholder?: string; isLoading?: boolean; } const SearchInput: React.FC<SearchInputProps> = ({ value, onChange, onSubmit, placeholder = "Search...", isLoading = false, }) => { const handleKeyDown = (e: React.KeyboardEvent<HTMLInputElement>) => { if (e.key === "Enter") onSubmit(); }; return ( <div role="search"> <input type="search" value={value} onChange={(e) => onChange(e.target.value)} onKeyDown={handleKeyDown} placeholder={placeholder} aria-label="Search" disabled={isLoading} /> <button onClick={onSubmit} disabled={isLoading} aria-label="Submit search"> {isLoading ? "Searching..." : "Search"} </button> </div> ); }; The scaffold, prop interface, and JSX structure above were AI-generated in under 30 seconds. The aria-label attributes, role="search", and handleKeyDown implementation were my additions — things Copilot consistently missed in initial suggestions. Where AI Hits a Wall: Large-Scale Enterprise React Projects Small, isolated components are where AI shines. But real enterprise codebases are rarely small or isolated. Once you're working inside a large monorepo with hundreds of components, shared design systems, domain-specific business logic, and cross-team API contracts, AI-assisted development runs into a fundamental limitation: it only sees what's in its context window. Here's where that breaks down in practice: 1. Cross-File Dependency Awareness In a large React application, a single component may depend on a shared context provider defined four directories away, a utility hook maintained by a different team, and a TypeScript type exported from a core domain package. Copilot's autocomplete works within the file you're editing — it doesn't have a deep understanding of the full dependency graph. The result: AI-generated code that compiles locally but breaks at integration because it assumes a prop shape, import path, or context value that doesn't match what actually exists in the broader system. I've seen this surface most often with shared form validation schemas and API response types that live outside the component's immediate file tree. 2. Institutional Knowledge and Business Logic Enterprise React codebases carry years of intentional decisions that aren't documented anywhere in the code — they live in the heads of the team. Why is this particular component wrapped in a custom error boundary? Why does this dropdown use a local state copy instead of reading directly from context? Why is this API called twice? Copilot has no way of knowing. When it generates code in these areas, it produces something that looks reasonable but violates the implicit contract the team has built over time. Catching these violations requires a senior developer who understands the why behind the existing patterns — AI cannot substitute for that. 3. Design System Consistency at Scale Large teams typically maintain a shared component library — think an internal fork of Material UI or a custom design system. AI tools don't know which internal components to reach for. Copilot frequently suggests raw HTML elements or third-party components when the project has established internal equivalents: <Button> from your design system instead of <button>, <TextInput> from your library instead of a raw <input>. At scale, this creates design debt fast. Every AI-generated component that uses a raw HTML element instead of the design system equivalent is a component that diverges from your visual and behavioral standards — and accumulates technical debt that's expensive to audit later. 4. Performance Optimization in Complex Component Trees React 18 introduced useDeferredValue, useTransition, and concurrent rendering features specifically to handle performance in large, deeply nested component trees. These are nuanced APIs — their correct usage depends on understanding the rendering priority of specific subtrees, which operations are expensive, and what the user experience should be during transitions. Copilot-generated code in this area is almost always naive. It doesn't know that a particular list component renders 500+ items and needs virtualization. It doesn't know that a specific state update should be wrapped in startTransition to keep the UI responsive. Optimizing a large React application for performance remains deeply human work. 5. Multi-Team Merge Conflicts and Shared State In enterprise projects with multiple teams contributing to the same React codebase, shared state management becomes politically and technically complex. Redux slices, Zustand stores, or React Query caches span team boundaries. AI tools can suggest changes to these shared structures without awareness of how other teams depend on them — leading to breakages that only surface in integration environments. The practical takeaway: the larger and more interconnected the codebase, the more you need to treat AI as a localized assistant, not a system-aware collaborator. Use it to accelerate work on leaf-node components and isolated utilities. Treat any AI suggestion that touches shared state, cross-team APIs, or core infrastructure with the same scrutiny you'd give an external contributor who just joined the project. If you're introducing AI-assisted development into a React team, here are the non-negotiables: 1. Never merge AI-generated code without lint and type checks passing. Run eslint, tsc --noEmit, and your test suite before treating any AI-generated file as complete. 2. Establish a "no AI for architecture" rule. Component tree design, context structure, routing decisions, and data fetching strategy should be human-driven. AI is a code accelerator, not an architect. 3. Code review AI-generated PRs with extra scrutiny. Reviewers should specifically look for: missing hook dependencies, over-broad useEffect triggers, missing accessibility attributes, and logic that "looks right" but doesn't account for edge cases. 4. Document what AI touched. Some teams are beginning to tag AI-assisted code in commit messages or comments. This creates accountability and helps reviewers calibrate their scrutiny. 5. Keep your feedback loop active. When Copilot generates something wrong, reject it explicitly rather than accepting and editing. This helps calibrate your own pattern recognition for what AI does and doesn't handle well. What's Coming Next: Agentic React Workflows The current state of AI in React development is assistive — it completes what you start. The next wave is agentic: AI agents that can take a design spec or Figma export, scaffold an entire component hierarchy, wire up state, and generate test coverage — with a human reviewing the output rather than writing it line by line. Early tools like Cursor's Composer mode and experimental GitHub Copilot Workspace are beginning to move in this direction. For React developers, the implication is a shift in the skill that matters most: from writing components quickly to reviewing and evaluating AI-generated component systems critically. The developers who will thrive in this environment are those who deeply understand React's rendering model, state management tradeoffs, and accessibility requirements — not because they're writing every line, but because they're the final judgment layer on what ships. Conclusion AI-augmented development isn't about replacing React expertise — it's about redirecting it. The hours saved on scaffolding and boilerplate are hours you can reinvest in architecture, performance, accessibility, and code quality. The key insight from rebuilding my workflow around GitHub Copilot is this: AI is a force multiplier for what you already know well. If you understand React deeply, it makes you faster. If you're still learning React's mental model, it can quietly introduce patterns that seem right but aren't. Used with clear guardrails and deliberate review habits, AI turns a good React developer into a significantly more productive one — without sacrificing the code quality that enterprise applications demand.

By Sathwik Nagulapati

If You Can Facilitate a Retrospective, You Can Audit Your AI

TL;DR: The AI Delegation Audit Scrum teams inspect how the last Sprint went during the Retrospective. They are much less likely to inspect the work they have handed to AI, because no meeting on the calendar owns it. That gap is where a working AI automation quietly turns into risk: it keeps producing fluent, on-brand output long after the decision to trust it has expired. The AI Delegation Audit closes the gap by leveraging the facilitation skills teams already use in a Retrospective. Thesis: The Delegation Audit is the missing inspection cadence for delegated AI work. It checks four things: whether the work still meets the standard, whether the model still fits the task, whether the team can still stop the automation, and whether reviewed assistance has quietly become unreviewed automation. You can try it on one workflow in fifteen minutes. The Automation That Looked Healthy A product team automates its Friday stakeholder update in March. The setup is careful: the model drafts from the Jira board, the workflow owner reviews the draft, and it ships. For three months it works. In June, the same automation tells an enterprise prospect that a security feature is in production. No application code changed, and nobody touched the prompt. But the system around the automation had shifted: a descoped feature, a stale ticket title that survived in the product backlog, and a change in model behavior combined into a false update. The dangerous part was not a visible failure: the automation kept producing fluent, plausible, on-brand updates, which is exactly what made the degradation hard to notice. That points to the belief worth naming first: a workflow that still produces output is assumed to be still fully functioning. A working automation is not evidence that the delegation behind it is still valid, and validating it once, at setup, is not the same as keeping it valid. What the Delegation Audit Is The Delegation Audit of the A3 Framework borrows the facilitation pattern of a Retrospective, not the Scrum event itself. Instead of how the team worked, it examines how the team’s AI delegations are holding up: 45 to 60 minutes, monthly or every other Sprint, with a named owner and a slot on the calendar. In the A3 Framework, this is what the Automate category has always required. The moment you trust work to run with little or no human review, you owe it explicit rules and a recurring audit. Most teams adopt the rules and skip the audit because no one owns it. The Delegation Audit is that meeting, and it is the Inspect step of the AI Delegation Lifecycle. The name is deliberate: nobody in finance, security, or operations needs an agile glossary to understand what a delegation audit is or why a team runs one. The practice underneath is familiar: gather data, surface what changed, turn findings into decisions, and leave with owners. The Four Checks Each check inspects one way a delegation degrades after it goes live: Output and source drift: Does the work still meet its AI Definition of Done, and are the inputs still fit for use? Pull three recent outputs per workflow and trace each one back to its sources. Model updates change output quality in both directions without notice, and the inputs move along with them: stale records, changed permissions, and archived data that the model cannot tell from current facts. A polished summary built on stale data is still a failed delegation.Model fit: Is the assigned model still the right one? Look in both directions: a cheaper tier that no longer meets the standard, and a frontier model burning budget on work that a mid-tier now handles. The test is whether the model is sufficient for this task at this risk level, not whether it is the most capable one available. If your team runs a routing policy, this check feeds into it, and the cost side has its own treatment in token economics.Reversibility: Could you stop each automation today? Test the stop rules from your handoff: who pulls the plug, how fast, and whether that person still works here. An automation without a reachable owner is not delegated; it is abandoned, now posing a risk.Category creep: Which Assist work has become unreviewed Automate? Watch for the tell: review time per output trending toward zero. When a human approves a draft in 4 seconds, that is not review, and the work changed its A3 category without anyone deciding. Name it, then choose: promote it to Automate properly, with rules and a stop rule, or restore genuine review. Run It Like a Retrospective The agenda fits 60 minutes and will feel familiar: Data walk (10 min): Put the delegation inventory on the wall: every automated and assisted workflow, its A3 category, its model tier, its last audit date. Add usage or spend data if you have it. Look first, discuss later.Run the four checks in pairs (20 min): Assign workflows to pairs. Each pair runs all four checks on its workflows and marks each finding pass, drift, or fail.Re-classify (15 min): Walk through the findings. Every drift or fail gets a decision: change the A3 category, change the tier, update the AI Definition of Done, fix the stop rule, or retire the delegation. Retiring an automation that no longer earns its audit cost is a successful outcome of the meeting.Decisions and owners (10 min): Each decision gets a name and a date. A finding without an owner is one you will rediscover next time; don’t create waste.Close the record (5 min): Update the log: what moved, why, and who decided. Why Inspection Stopped Being Optional Two forces make a standing audit necessary now: The first is the models: they update on the vendor’s schedule, not yours. A change to how a model summarizes, refuses, or formats can move output quality with no signal on your side. An automation you validated once is running on assumptions that have quietly expired. The second is accountability: NIST organizes AI risk management around four functions: govern, map, measure, and manage. Inspection is the measure-and-manage half, and a team that only governs and maps has stopped before the work becomes operational. Set-and-forget is the default, and it compounds unseen until a drifted output becomes an incident in front of the wrong audience. The Record You Get for Free Each audit updates a dated log: workflow, owner, model tier, last checked output, drift finding, decision, and follow-up date. Stack those logs, and you have an inspection trail: evidence that your team’s AI adoption is controlled rather than assumed. When a stakeholder, for example, a prospect’s procurement team, asks how you govern your internal AI use, that trail is half the answer, and you wrote none of it as a separate report. It came out of one recurring meeting. What to Do in Your Next Retrospective Do not schedule a new event yet. Take one delegated workflow, the one that would embarrass you most if it drifted, and spend fifteen minutes of your next Retrospective running the four checks on it out loud: output and source, model fit, reversibility, category creep. You will probably find at least one answer that amounts to “nobody has looked since we set this up.” That single finding is enough to put the audit on the calendar. Conclusion A Retrospective keeps a team honest about how it works together. The Delegation Audit extends that same facilitation habit to the work the team handed to a model, where an automation can look healthy long after the decision to trust it has expired. When did your team last inspect an automation it trusts, and what would the four checks find if you ran them this week? Key Questions This Article Answers What Is a Delegation Audit? A Delegation Audit is a recurring 45- to 60-minute inspection of a team’s delegated AI work, run monthly or every other Sprint. It checks whether automated and AI-assisted workflows still meet the team’s standard, using the facilitation skills of a Retrospective. It is the Inspect step of the AI Delegation Lifecycle. What Does a Delegation Audit Check? Four things: Output and source drift (Does the work still meet its AI Definition of Done, and are the inputs still trustworthy?),model fit (Is the assigned model still the right one for the task and its risk level?),reversibility (Can you stop the automation today?), andcategory creep (Has Assist work become unreviewed Automate?). How Is a Delegation Audit Different From a Retrospective? Same skill, different subject. A Retrospective inspects how the team worked together. A Delegation Audit inspects how the team’s AI delegations are holding up, then turns each drift finding into a decision with an owner and a date.

By Stefan Wolpers

CORE

Loop Engineering: The Layer After Prompt, Context, and Harness Engineering

A few weeks ago, I read a line from Boris Cherny, the person behind Claude Code, that stuck with me. He said he does not prompt Claude anymore. He has loops running, and those loops are the ones prompting Claude and deciding what to do next. I sat with that for a while. For two years, every guide on working with AI agents told us to get better at writing instructions. Then it told us to get better at feeding the model the right information. Then it told us to build proper scaffolding around the agent so that it behaves like trustworthy software. Now there is a fourth layer, and it is less about talking to the agent and more about building a small system that talks to the agent for you. People are calling it loop engineering. This piece walks through how we got here, what loop engineering actually means in plain terms, and where it fits next to the three ideas that came before it: prompt engineering, context engineering, and harness engineering. I have written about the first two before, so I will link back to those pieces where it helps rather than repeat myself. How we got here: Journey so far Each layer did not replace the one before it. It sat on top of it. You still write prompts. You still manage context. You just stopped being the one doing it by hand for every single turn. Here is the short version, side by side, before we go into each one in detail: LayerWhat you are actually doingWhere the skill livesGood fit forWeak fit forWhat breaks if you skip itPrompt engineeringWriting clear instructions for one turnThe words in the messageOne-off questions, demos, quick scripts, learning a new modelRepeated workflows, anything that touches production dataInconsistent answers, works once and fails the next ten timesContext engineeringChoosing what the model gets to see before it answersThe data, documents, and tool outputs around the promptRAG systems, chatbots over live data, anything where the model needs facts it was not trained onTasks where the prompt alone is already enoughConfident, well-written, wrong answersHarness engineeringBuilding the scaffolding that checks the agent's workThe system around one agent run: tools, evals, guardrails, logsCoding agents, multi-step automations, anything running without a human reading every stepSimple single-call requests with no follow-up actionsAgents that quietly do the wrong thing and nobody notices until laterLoop engineeringDesigning the system that decides what to run next, on its ownThe control loop sitting above the agent and its harnessLong-running or unattended work, backlog grooming, overnight batches, anything with a clear goal and a way to prove it is doneTasks with a vague goal or no real way to check successLoops that run for hours, spend budget, and produce work nobody asked for This table is the map for the rest of the piece. Each section below goes one layer deeper. Prompt Engineering: Where It All Started Prompt engineering was the first skill anyone associated with getting good output from a language model. Phrase the question well, give an example or two, ask the model to think step by step, and the answer gets better. It worked, and it still works for one-off tasks. The problem showed up once people tried to run the same model on a real workflow, again and again, across different inputs. A clever sentence that worked once does not hold up across a thousand runs with messy, real-world data. The short version is this: prompt engineering is great for experiments and demos, but production systems need something steadier than a well-worded sentence. Context Engineering: The Model Needed More Than Nice Words Context engineering showed up once teams realized that the actual bottleneck was rarely the phrasing. It was what the model could see. A model with a perfect prompt and no access to the right document, the right database row, or the right tool output will still guess wrong. Tobi Lütke from Shopify put it simply: context engineering is the art of giving the model everything it needs so the task is actually solvable. Andrej Karpathy described it as the careful science of filling the context window with exactly the right information for the next step, not more, not less. By late 2025, prompt engineering had become something people did inside context engineering rather than a separate skill on its own. Harness Engineering: Making One Agent Run Trustworthy Once teams started letting agents take multiple steps on their own, a new problem appeared. The agent might write code, run a test, look at the failure, and try again. That loop within a single task needed rules. What tools can it call? What happens if it gets stuck? How do you stop it from quietly making the wrong change to a file it was never supposed to touch? This is harness engineering, and I spent a full piece on it earlier this year in From Prompts to Harnesses: How AI Engineering Has Grown Up. The short version: prompt engineering got the conversation started, context engineering made the answers consistent, and harness engineering is what actually makes an agent safe to run in production, because it stops depending on the model behaving well and starts depending on a system around the model that checks its work. Think of the harness as the seatbelt, the dashboard, and the guardrails for one agent doing one job. It covers things like giving the agent a clean view of the repository, exposing the right API contracts, watching logs and live CI status as part of its context, and building eval gates that catch bad output before it ships. I went deeper into one well-known pattern for this layer in The Twelve-Factor Agents: Building Production-Ready LLM Applications, borrowing from the original Twelve-Factor App rules and adapting them for agents: one clear purpose per agent, explicit dependencies, and a strong separation between business logic and execution state. I also looked at the architectural side of this in AI Agent Architectures: Patterns, Applications, and Implementation Guide, where orchestrator-worker setups and blackboard-style coordination turn out to be different ways of answering the same question a harness has to answer: who is allowed to do what, and who checks the result. Picking an architecture for the agents inside your harness is its own decision, and it is worth slowing down on, because the wrong pattern for the job tends to show up as flaky behavior that looks like a model problem but is actually a structure problem: ArchitectureHow it worksBest forWatch out forSingle agentOne agent, one prompt loop, one set of toolsSimple, well-bounded tasks like answering support tickets or summarizing a documentFalls apart fast once the task needs more than a handful of stepsOrchestrator-workerA central agent breaks the job into pieces and hands each piece to a specialist agentTasks that can be cleanly split, like "research this, then write this, then format this"The orchestrator becomes a single point of failure if it makes a bad plan early onBlackboardAgents post partial answers to a shared space and pick up work opportunistically, with no central controllerOpen-ended problems where the right order of steps is not known in advance, like diagnosis or researchHarder to debug, since there is no one place that decided what happens nextEvent-drivenAgents react to events as they happen rather than being called in a fixed orderSystems that need to respond to changes in real time, like monitoring or alertingNeeds solid event delivery guarantees, or agents miss things silentlyGraph or loop-basedA loop or graph decides which agent or sub-agent runs next, based on state and results so farLong-running, multi-stage work where the next step depends on what the last step foundThe whole thing is only as reliable as the state tracking and the checks between steps None of this works without the layer underneath it, either. The compute, storage, and serving choices that decide whether a harness can actually run reliably at scale are covered in AI Infrastructure for Agents and LLMs: Options, Tools, and Optimization and its follow-up, AI Infrastructure Guide: Tools, Frameworks, and Architecture Flows. And because a harness is still software that needs to be deployed and rolled back like anything else, Infrastructure as Code: How Automation Evolved to Power AI Workloads walks through why pinning model versions in config, the same way you pin a database connection string, has become a basic rule rather than a nice-to-have. I rounded up some of the tools doing this well in Developer Tools That Actually Matter in 2026. There is also a second, smaller decision inside the harness itself: what kind of check are you actually running at each step? Some checks are deterministic and cheap; others are judgment calls made by another model. Both have a place, and most solid harnesses use a mix: Check typeWhat it meansExampleSpeed and costReliabilityComputationalA fixed rule, run by regular codeUnit tests, linters, type checkers, schema validationFast and cheapVery reliable, but only catches what you thought to check forInferentialAnother model judges the outputLLM-as-judge review, an AI code reviewer, a second model checking tone or factual accuracySlower and more expensiveCatches fuzzier problems, but can itself be wrong or inconsistent For the fuller operational picture, the practices around monitoring, cost control, and incident response for agents running in production are laid out in the Shipping Production-Grade AI Agents refcard, including a simple three-question framework for any agent incident: what was the agent trying to do, what did it actually do, and what state did it change in the outside world. Harness engineering solved a real problem: it made a single agent run safe enough to trust. But it still assumed a person was sitting there, kicking off the run, reading the result, and deciding what happens next. Loop Engineering: Stop Being the One Who Presses the Button This is the part that changed in 2026. Geoffrey Huntley, earlier in the year, described running a coding agent inside a plain loop in his terminal: give it the same prompt against a written spec, let it pick one task, implement it, then start a fresh copy of the agent and feed it the same prompt again. People started calling this the Ralph technique, after the simple while loop running underneath it. It looked almost too basic to matter, but it worked, and it pointed at something bigger. Loop engineering takes that idea and makes it a discipline. Instead of typing a prompt, reading the answer, and typing the next prompt yourself, you build a small system that does that cycle for you. It checks what work is pending, decides what an agent should try next, hands the task off, checks whether the result actually meets the goal, saves what it learned, and either stops or starts the cycle again. Addy Osmani, who wrote one of the essays that got this term moving, framed it well: loop engineering means replacing yourself as the one who prompts the agent. You design the system that prompts it instead. Loop engineering A few things matter a lot once you build one of these: The goal has to be provable, not just stated. "Make the checkout flow better" gives a loop nothing to check itself against, so it will stop whenever it feels like it has done enough, which is rarely what you wanted. People running these loops for real work have settled on writing out the end state they expect, the proof needed to show it was reached, the rules that cannot be broken along the way, and a hard limit on how long or how much the loop is allowed to run. The verifier is the actual bottleneck, not the model. A loop is only as good as its ability to tell good work from bad work. If the check at the end of each cycle is weak, the loop will happily mark broken work as done and move on. Most of the engineering effort in a good loop goes into that verification step, not into the prompt that kicks the agent off. Prompts and context did not disappear, they moved inside the loop. The loop is still writing prompts and assembling context on every cycle. It is just doing it itself, using the same context engineering ideas from earlier, instead of a person typing it fresh each time. This is why I think of loop engineering as sitting on top of the other three rather than replacing any of them. Not every loop looks the same, either. Once you decide a loop is the right tool, the next choice is how tightly you want to hold its leash: Loop styleHow it runsGood forRiskClosed loop, human approves each stepLoop proposes the next action, a person clicks approveHigh-stakes changes, early days of trusting a new loopSlow, defeats some of the point if you approve everything anywayOpen loop, runs to a budget or time limitLoop runs unattended until it hits a turn limit, a cost cap, or finishes the goalOvernight batches, backlog grooming, well-scoped refactorsCan burn budget on the wrong thing if the goal or verifier is weakSingle agent in a plain loop (the Ralph style)One agent, one spec, fresh instance each cycle, no memory carried forward except what is written to diskSmall, well-defined coding tasks where starting fresh each time avoids the agent confusing itselfRepeats work it has no memory of, and needs a very clear spec to avoid driftingOrchestrated loop with sub-agentsA controlling loop spawns specialized sub-agents for different parts of the task and merges resultsLarger goals that naturally split into independent piecesCoordination overhead, and a weak verifier at the merge step undoes everything underneath it Where This Connects to the Bigger Picture None of these lives in isolation. A loop that spawns multiple agents to work on different pieces of a task at the same time needs the kind of coordination covered by multi-agent orchestration work, including how agents talk to tools through the Model Context Protocol and to each other through agent-to-agent protocols. And every loop running unattended for hours needs the same monitoring, cost tracking, and incident response discipline that harness engineering already worked out, just applied continuously instead of once per run. There is also a real risk worth naming honestly: people are already calling the overuse of unattended loops "loopmaxxing," where a loop runs for hours, burns budget, and produces a pile of code nobody asked for because the goal was vague and the verifier was weak. A loop is not magic. It is a control system, and like any control system, it is only as good as what you tell it to check for. Conclusion I have watched this field rename itself every year or so since I started writing about agents, and each rename has felt a little like marketing at first glance. Loop engineering is different in one respect: it describes a real change in where the human sits in the workflow. We went from writing every prompt by hand, to curating what the model sees, to building safety nets around a single run, and now to designing a small system that runs the whole cycle on our behalf while we go do something else. The job did not get smaller. It got one level removed from the keyboard. If you found context engineering useful, the next worthwhile habit to build is writing goals that can actually be proven true or false, because that is the one piece that a loop cannot do for itself. If any of the earlier pieces in this chain are new to you, start with the prompt-to-context shift, then read how that grew into harness engineering in From Prompts to Harnesses: How AI Engineering Has Grown Up, and the production patterns for agents in the Twelve-Factor Agents piece. You can find the rest of what I have written on agents, infrastructure, and developer tools on my DZone author page.

By Vidyasagar (Sarath Chandra) Machupalli FBCS

CORE