DZone Spotlight

Friday, July 3 View All Articles »

The Software Deployment Failures That Pass Every Pre-Deployment Check

By Sancharini Panda

A deployment can pass every gate in a pipeline and still be wrong. This sounds like a contradiction until you look closely at what pre-deployment checks actually verify. Unit tests confirm that individual functions behave as the developer who wrote them intended. Integration tests confirm that components interact the way they were specified to interact. Smoke tests confirm that the application starts and responds. Every one of these checks can pass cleanly while the deployment still introduces a failure that none of them were ever positioned to catch. The failures that slip through this way share a specific characteristic worth naming directly: they are not failures of the code that was just changed. They are failures in how that code now interacts with something else in the system that was not part of the deployment at all. Why Passing Checks Are Not the Same as Correct Behavior Pre-deployment checks are, almost by design, retrospective and localized. They validate against a specification someone wrote at some point in the past, scoped to the component being deployed. This is a reasonable and necessary thing to do. It is also fundamentally insufficient for catching an entire category of deployment risk that exists specifically because modern systems are not static. Consider what happens in a system composed of a dozen or more independently deployable services. Service A integrates with Service B by calling its API and expecting a particular response shape. The test suite for Service A includes a mock that represents Service B's behavior, written when the integration was first built. That mock was accurate at the time. It is now a frozen snapshot of a moving target. Service B continues to evolve. It deploys updates on its own schedule, for its own reasons, entirely disconnected from Service A's release cycle. Each of those updates might be entirely correct from Service B's own perspective, validated by Service B's own test suite, reviewed and approved by Service B's own team. None of that matters to Service A, which is still running its tests against a mock that no longer reflects what Service B actually does. When Service A deploys next, its pipeline runs cleanly. Every check passes, because every check is validating against an internally consistent but externally outdated picture of the world. The deployment that breaks production is, from the perspective of the pipeline that approved it, a complete success. The Specific Shape of This Failure This category of failure has a recognizable signature once you know to look for it, and it differs in important ways from a typical bug. It does not appear in the code that was just changed. The deployed service often behaves exactly as intended. The failure surfaces at the boundary, in how that service's output is interpreted by something downstream, or in how an upstream dependency's actual current behavior diverges from what the deployed service assumed it would be. It does not correlate cleanly with software deployment frequency in the way most teams expect. A team might deploy daily with low change failure rates for months, building justified confidence in their pipeline, and then be blindsided by an incident that traces back to a dependency that changed six weeks earlier and was never re-validated against. The failure was latent the entire time, waiting for the right combination of conditions to surface it. It is also, critically, invisible to code review. A reviewer looking at the diff for Service A's deployment has no way to know that Service B's actual behavior has drifted from what Service A's tests assume. The information needed to catch this gap does not live in the code being reviewed. It lives in the current, real behavior of a system that the reviewer is not looking at. Why More Tests Do Not Solve This The instinctive response to this problem is to write more tests, and it is worth being explicit about why that instinct, while understandable, does not actually address the root cause. Adding more test cases against a static specification increases confidence in that specification. It does nothing to address the fact that the specification itself can become inaccurate the moment a dependency changes. A team can have excellent code coverage, a comprehensive integration test suite, and rigorous review standards, and still be exposed to this exact failure mode, because the problem is not insufficient testing. It is testing against an assumption that silently stopped being true. This is also why manual processes aimed at keeping integration assumptions current tend to break down at scale. The discipline required to track every downstream dependency, monitor every change, and update every corresponding mock or stub is real work that competes for the same engineering time as everything else on a team's plate. It works reasonably well with three services and a small team that has informal awareness of what changed recently. It does not scale to fifteen services with independent deployment schedules and rotating ownership, where no single person has visibility into every dependency's current state. What Actually Closes the Gap The structural fix for this category of failure requires a different source of truth than a specification written in the past. It requires validating deployments against what dependencies are actually doing right now, not what they were documented or assumed to do when an integration was first built. In practice, this means deriving test coverage and integration assumptions from observed, current system behavior rather than from manually maintained documentation that ages the moment it is written. When a service's actual current responses become the basis for validating what depends on it, the gap between specification and reality closes by construction, because there is no longer a static specification to drift away from in the first place. The validation is only ever as old as the most recent observation of real behavior, not as old as the last time someone remembered to update a mock file. This shift changes what passing a pre-deployment check actually means. A check that validates against current, observed behavior is verifying something meaningfully different from a check that validates against a frozen assumption. The former tells you the deployment is compatible with the system as it exists today. The latter only tells you the deployment is compatible with the system as someone believed it to exist at some point in the past. What This Means for How Teams Think About Deployment Risk The deeper implication here is about where deployment risk in distributed systems actually concentrates. It is tempting to think of risk as proportional to the size or complexity of the change being deployed. In practice, a significant share of the riskiest deployments are small, low-risk-looking changes to services that have quietly drifted out of sync with their dependencies over time, with nobody noticing because nothing forced the drift to surface. Treating software deployment safety as primarily a function of how thoroughly the changed code itself is tested misses where the actual exposure lives. The exposure lives at the seams between services, in assumptions that were correct once and were never revisited. Closing that gap requires validation infrastructure built around the same principle that makes any monitoring system trustworthy: it has to reflect what is actually happening now, not what was true when it was last updated. Teams that internalize this distinction tend to ask a different question before deploying. Not only "does this change pass its tests," but "are the assumptions this change depends on still accurate?" The first question is necessary. The second is the one that catches the failures the first one was never designed to see. More

WebSockets, gRPC, and GraphQL in the Core

By Shai Almog

CORE

Three connectivity features landed together this week, and they belong in one place because they build on each other. WebSockets moved into the core; the GraphQL client uses that same WebSocket support for subscriptions; and gRPC reuses the exact code-generation pattern GraphQL and OpenAPI already follow. This post is a tutorial for all three. By the end, you will have a live chat, a typed GraphQL client, and a typed gRPC client, and you will see how little code each one takes. These features come from PR #5133 (WebSockets) and PR #5141 plus PR #5099 (the typed clients). Part 1: WebSockets, No cn1lib Required WebSockets used to require the cn1-websockets cn1lib. They are now part of the framework as com.codename1.io.WebSocket, implemented natively on every port (a hand-rolled RFC 6455 handshake on JavaSE and Android, NSURLSessionWebSocketTask on iOS, the browser WebSocket on JavaScript), with no third-party dependencies pulled into your build. If you're using cn1-websockets you can keep using it. There's no change required from you. We moved the package up one level, so there's no conflict. Step 1: Open a Connection The new API is a final, fluent class with lambda handlers. You build it, attach handlers, and connect: Java // Good practice although in reality all current Codename One Platforms support WebSockets if (!WebSocket.isSupported()) { return; } WebSocket ws = WebSocket.build("wss://echo.example.com/socket") .onConnect(() -> Log.p("connected")) .onTextMessage(text -> addIncoming(text)) .onClose((code, reason) -> Log.p("closed " + code + " " + reason)) .onError(ex -> Log.e(ex)) .connect(); There is no URL-in-constructor subclassing trap from the old API; the connection is an object you hold. send(...) has a String and a byte[] overload, getReadyState() returns a WebSocketState, and close() does a clean close handshake. Step 2: Build the Chat Screen Here is a compact chat form. Outgoing messages are added immediately; incoming ones arrive on the onTextMessage handler, and because the handler can touch the UI we wrap that in callSerially: Java private WebSocket ws; private Container conversation; private void showChat(Form parent) { Form chat = new Form("Live Chat", BoxLayout.y()); conversation = chat.getContentPane(); TextField input = new TextField("", "Message", 20, TextField.ANY); Button send = new Button("Send"); send.addActionListener(e -> { String text = input.getText(); if (text.length() > 0 && ws != null) { ws.send(text); addBubble(text, true); input.clear(); } }); Container bar = BorderLayout.centerEastWest(input, send, null); chat.add(BorderLayout.SOUTH, bar); ws = WebSocket.build("wss://chat.example.com/room/general") .onTextMessage(text -> Display.getInstance() .callSerially(() -> addBubble(text, false))) .connect(); chat.show(); } private void addBubble(String text, boolean mine) { Label bubble = new Label(text); bubble.setUIID(mine ? "ChatBubbleMe" : "ChatBubbleThem"); Container line = FlowLayout.encloseIn(bubble); line.getStyle().setAlignment(mine ? Component.RIGHT : Component.LEFT); conversation.add(line); conversation.animateLayout(150); } That is a working real-time chat. The screen it produces, rendered in the simulator: Step 3: Negotiate a Subprotocol When You Need One If your server speaks a named subprotocol, set it during the handshake and read back what the server chose: Java WebSocket ws = WebSocket.build(url) .subprotocols("graphql-transport-ws") .onConnect(() -> Log.p("using " + ws.getSelectedSubprotocol())) .connect(); That graphql-transport-ws value is not an accident; it is exactly what the GraphQL subscriptions in the next part use. One reason to trust this implementation: our own screenshot CI now runs on it. The pipeline that ships rendered PNGs from each device back to the host machine uses a WebSocket as its transport, so the same code your app calls is carrying the binary payloads that validate the framework on every commit. Part 2: A Typed GraphQL Client cn1:generate-graphql turns a GraphQL schema into a typed client, and @GraphQLClient is the interface you write against. The runtime lives in com.codename1.io.graphql, and a GraphQLResponse<T> carries data and errors together so partial results survive. Step 1: Declare the Client Java @GraphQLClient("https://swapi.example.com/graphql") public interface StarWarsApi { @Query("query HeroName($episode: Episode) { hero(episode: $episode) { name homeworld { name } species { name } filmConnection { totalCount } } }") void hero(@Var("episode") Episode episode, OnComplete<GraphQLResponse<HeroData>> callback); @Subscription("subscription OnReview($ep: Episode!) { reviewAdded(episode: $ep) { stars } }") GraphQLSubscription onReview(@Var("ep") Episode ep, GraphQLSubscription.Handler<ReviewData> handler); static StarWarsApi of(String endpoint) { return GraphQLClients.create(StarWarsApi.class, endpoint); } } The build-time processor emits the implementation and a bootstrap that registers it; you never write the HTTP plumbing. The generator has two modes. The precise operations mode emits per-selection types from your operation documents; the schema-only quick-start mode auto-selects fields to a bounded depth (cn1.graphql.maxDepth). Step 2: Call It and Render the Result Java StarWarsApi api = StarWarsApi.of("https://swapi.example.com/graphql"); api.hero(Episode.EMPIRE, response -> { if (!response.isOk()) { return; } Container list = heroForm.getContentPane(); for (Hero h : response.getResponseData().heroes) { MultiButton row = new MultiButton(h.name); row.setTextLine2(h.homeworld + " . " + h.species); row.setUIID("HeroRow"); list.add(row); } heroForm.revalidate(); }); The list this populates, rendered in the simulator: Step 3: Subscriptions Ride the Core WebSocket A @Subscription returns a GraphQLSubscription backed by the core WebSocket using the graphql-transport-ws protocol from Part 1. New events arrive on the handler: Java GraphQLSubscription sub = api.onReview(Episode.JEDI, review -> Display.getInstance().callSerially(() -> showStars(review.stars))); // later sub.close(); This is the payoff of putting WebSockets in the core: the GraphQL layer did not need its own socket implementation; it just used the frameworks. Part 3: A Typed gRPC Client cn1:generate-grpc does the same trick for proto3. Point it at your .proto files and it emits hand-editable @ProtoMessage, @ProtoEnum, and @GrpcClient sources; the annotation processor generates the binary protobuf codecs and call sites into target/generated-sources so your source tree stays clean. There is no protoc dependency. Step 1: The Proto Java syntax = "proto3"; service Greeter { rpc SayHello (HelloRequest) returns (HelloReply); } message HelloRequest { string name = 1; } message HelloReply { string message = 1; } Step 2: Call the Generated Client Java GreeterGrpc g = GreeterGrpc.of("https://api.example.com"); HelloRequest req = new HelloRequest(); req.name = "world"; g.sayHello(req, "Bearer " + token, response -> { if (response.isOk()) { renderGreeting(response.getResponseData().message); } }); The wire protocol is gRPC-Web binary (application/grpc-web+proto), the standard variant for mobile and browser clients, which works with Envoy, the official grpcweb Go proxy, and the gRPC-Web filter in modern gRPC servers. Version one covers unary RPCs, all scalar types, nested messages, enums, and repeated fields; streaming, map<K,V>, well-known types, and import are out for now, and the parser errors cleanly when it meets one. Enums Bind Across All of It All three connectors share the build-time JSON and XML mapper, and that mapper now binds enums. Previously an enum field was treated as a nested reference, found no mapper, and silently did not serialize. It now writes with name() and reads with valueOf (unknown values decode to null), and it handles List<Enum>, across both JSON and XML. That is why the GraphQL Episode above is a real enum rather than a String, and it is a welcome fix for anyone using @Mapped directly. Keep Your Tokens Out of the Binary The gRPC and GraphQL samples pass a bearer token, so the rule bears repeating: never hard-code a token, and never check it into source or embed it in the app. Fetch it from your backend at runtime and store it with SecureStorage. A shipped binary can be unpacked, so anything baked into it is effectively public. These connectors learn from real specs. If a schema or a proto file does not generate the client you expected, please file an issue at github.com/codenameone/CodenameOne/issues with the source attached. The previous deep dive covered native Mac builds and desktop integration, and the release post has the full index. Tomorrow's post is the new advertising API. More

Building an AI Agent That Responds to Real-Time Events With AWS Bedrock, Kinesis, DynamoDB, and S3

By Jubin Abhishek Soni

CORE

Refcard #291

Code Review Core Practices

By Vidyasagar (Sarath Chandra) Machupalli FBCS

CORE

Refcard #403

Shipping Production-Grade AI Agents

By Vidyasagar (Sarath Chandra) Machupalli FBCS

CORE

Real-Time AI Feature Engineering With Spark Structured Streaming and Databricks Feature Store

The Feature Engineering Problem Feature engineering is where most ML projects silently fail in production. Not because the model is wrong — but because the features the model sees at training time are different from the features it sees at inference time. This is called training-serving skew, and it's the #1 silent killer of ML systems. Three specific failure modes cause it: Online/offline inconsistency – the batch pipeline that computes training features uses different logic than the real-time service that computes inference featuresData leakage – training features accidentally include information from the future (e.g., joining on a label that was created after the event)Feature staleness – a model trained on 30-day rolling averages is served features that are 6 hours stale because the pipeline backfills are slow The Databricks Feature Store — now part of Unity Catalog as Feature Engineering in Unity Catalog — solves all three by: Storing feature computation logic alongside the data (no drift between training and serving)Enforcing point-in-time lookups during training dataset creationProviding a unified API for both batch offline reads and low-latency online reads Architecture Overview Feature Store Concepts: ERD Understanding the data model behind the Feature Store is essential for designing correct pipelines. Here's how the entities relate: The critical relationship: a Model Version is bound to a Training Set, which records exactly which feature tables and which point-in-time lookups were used. This is how Databricks guarantees reproducibility — you can always re-create the exact training data that produced any model version. Environment Setup Python # Databricks Runtime ML 13.x+ recommended # Feature Engineering in Unity Catalog (formerly Feature Store) %pip install databricks-feature-engineering==0.6.0 --quiet dbutils.library.restartPython() from databricks.feature_engineering import FeatureEngineeringClient, FeatureLookup from databricks.feature_engineering.entities.feature_serving_endpoint import ( ServedEntity, EndpointCoreConfig ) from pyspark.sql import functions as F, SparkSession from pyspark.sql.types import ( StructType, StructField, StringType, LongType, DoubleType, TimestampType, ArrayType ) import mlflow spark = SparkSession.builder.getOrCreate() fe = FeatureEngineeringClient() # Unity Catalog paths CATALOG = "prod" FEATURE_DB = f"{CATALOG}.feature_store" EVENTS_TABLE = f"{CATALOG}.silver.events_clean" KAFKA_BROKER = "kafka-broker.internal:9092" KAFKA_TOPIC = "user-events" # Checkpoint locations (ADLS / S3 / GCS) CHECKPOINT_BASE = "abfss://[email protected]/features" Streaming Feature Pipeline The streaming pipeline reads from Kafka, computes windowed aggregations using Spark's stateful streaming engine, and writes features to the Feature Store via foreachBatch. This keeps the feature table continuously fresh. Python # ── Streaming Feature Pipeline ──────────────────────────────────────────────── # Step 1: Define the raw event schema from Kafka event_schema = StructType([ StructField("user_id", StringType(), False), StructField("event_type", StringType(), True), StructField("product_id", StringType(), True), StructField("revenue", DoubleType(), True), StructField("session_id", StringType(), True), StructField("platform", StringType(), True), StructField("event_ts", TimestampType(), False), ]) # Step 2: Read from Kafka raw_stream = ( spark.readStream .format("kafka") .option("kafka.bootstrap.servers", KAFKA_BROKER) .option("subscribe", KAFKA_TOPIC) .option("startingOffsets", "latest") .option("failOnDataLoss", "false") .load() .select( F.from_json(F.col("value").cast("string"), event_schema).alias("data"), F.col("timestamp").alias("kafka_ts") ) .select("data.*", "kafka_ts") ) # Step 3: Apply watermark and compute windowed features # Watermark: tolerate up to 10 minutes of late data windowed_features = ( raw_stream .withWatermark("event_ts", "10 minutes") .groupBy( F.col("user_id"), F.window(F.col("event_ts"), "1 hour", "15 minutes").alias("window") ) .agg( F.count("*").alias("event_count_1h"), F.sum(F.when(F.col("event_type") == "purchase", F.col("revenue")) .otherwise(0)).alias("revenue_1h"), F.countDistinct("session_id").alias("session_count_1h"), F.countDistinct("product_id").alias("unique_products_1h"), F.sum(F.when(F.col("event_type") == "purchase", 1) .otherwise(0)).alias("purchase_count_1h"), F.first("platform").alias("last_platform"), ) # Flatten window struct to scalar columns .withColumn("window_start", F.col("window.start")) .withColumn("window_end", F.col("window.end")) .withColumn("feature_ts", F.col("window.end")) # timestamp key for PIT lookup .drop("window") # Derived features .withColumn("conversion_rate_1h", F.when(F.col("event_count_1h") > 0, F.col("purchase_count_1h") / F.col("event_count_1h")) .otherwise(0.0)) .withColumn("avg_revenue_per_purchase_1h", F.when(F.col("purchase_count_1h") > 0, F.col("revenue_1h") / F.col("purchase_count_1h")) .otherwise(0.0)) ) # Step 4: Write to Feature Store via foreachBatch # foreachBatch gives us transactional writes per micro-batch def write_to_feature_store(batch_df, batch_id): """ Called on each micro-batch. Merges feature data into the Feature Store table using merge_on keys (user_id + feature_ts). """ if batch_df.isEmpty(): return fe.write_table( name=f"{FEATURE_DB}.user_activity_features", df=batch_df, mode="merge", # upsert: update existing, insert new ) print(f"Batch {batch_id}: wrote {batch_df.count()} feature rows") # Step 5: Create the feature table (idempotent — safe to re-run) try: fe.create_table( name=f"{FEATURE_DB}.user_activity_features", primary_keys=["user_id"], timestamp_keys=["feature_ts"], schema=windowed_features.schema, description=( "Real-time user activity features computed from event stream. " "1-hour sliding window, refreshed every 15 minutes. " "Primary key: user_id. Timestamp key: feature_ts (window end)." ), ) print("Feature table created.") except Exception: print("Feature table already exists — continuing.") # Step 6: Launch the streaming query streaming_query = ( windowed_features.writeStream .outputMode("update") # update mode for stateful aggregations .option("checkpointLocation", f"{CHECKPOINT_BASE}/user_activity") .trigger(processingTime="5 minutes") # micro-batch every 5 min .foreachBatch(write_to_feature_store) .start() ) print(f"Streaming query '{streaming_query.name}' running...") print(f"Status: {streaming_query.status}") Point-in-Time Correct Training Dataset Generation This is the most critical part of the Feature Store. When creating training data, we must join labels to features at the timestamp of the label event — not the current time. This prevents data leakage. Python # ── Point-in-Time Correct Training Dataset ──────────────────────────────────── # Step 1: Load the label dataset # Each row = one prediction target event, with the exact timestamp # at which a model would have needed to make a prediction. labels_df = ( spark.table(f"{CATALOG}.gold.churn_labels") .select( "user_id", "churn_label", # 0 = retained, 1 = churned F.col("observation_ts").alias("event_timestamp"), # point-in-time anchor "experiment_split" # train/val/test ) .filter(F.col("observation_ts") >= "2024-01-01") ) print(f"Label rows: {labels_df.count():,}") labels_df.show(5) # +----------+-----------+---------------------+-----------------+ # | user_id |churn_label| event_timestamp | experiment_split| # +----------+-----------+---------------------+-----------------+ # | u_123456 | 0 | 2024-03-15 14:22:00 | train | # | u_789012 | 1 | 2024-03-15 18:45:00 | train | # Step 2: Define feature lookups # as_of_timestamp=None → use the label's event_timestamp (point-in-time) # Databricks will join each label row to the feature values # that were valid at event_timestamp — not the latest values. feature_lookups = [ # User activity features — 1h window features from the streaming pipeline FeatureLookup( table_name=f"{FEATURE_DB}.user_activity_features", feature_names=[ "event_count_1h", "revenue_1h", "session_count_1h", "unique_products_1h", "purchase_count_1h", "conversion_rate_1h", "avg_revenue_per_purchase_1h", "last_platform", ], lookup_key="user_id", timestamp_lookup_key="event_timestamp", # ← PIT anchor ), # User profile features — slower-changing, from batch pipeline FeatureLookup( table_name=f"{FEATURE_DB}.user_profile_features", feature_names=[ "account_age_days", "lifetime_revenue", "preferred_category", "subscription_tier", ], lookup_key="user_id", timestamp_lookup_key="event_timestamp", # ← PIT anchor ), # Transaction aggregates — 30d and 90d rolling windows FeatureLookup( table_name=f"{FEATURE_DB}.transaction_features", feature_names=[ "purchase_count_30d", "purchase_count_90d", "avg_order_value_30d", "days_since_last_purchase", "category_diversity_score", ], lookup_key="user_id", timestamp_lookup_key="event_timestamp", ), ] # Step 3: Create training dataset (Feature Store handles the PIT join) training_set = fe.create_training_set( df=labels_df, feature_lookups=feature_lookups, label="churn_label", exclude_columns=["observation_ts", "experiment_split"], ) # The returned DataFrame has features + labels, PIT-correct training_df = training_set.load_df() print(f"Training rows: {training_df.count():,}") print(f"Training cols: {len(training_df.columns)}") training_df.show(3) # Step 4: Train model and log via Feature Store (preserves lineage!) from sklearn.ensemble import GradientBoostingClassifier import pandas as pd train_pdf = ( training_df .filter(F.col("experiment_split") == "train") .drop("experiment_split", "user_id") .fillna(0) .toPandas() ) X_train = train_pdf.drop(columns=["churn_label"]) y_train = train_pdf["churn_label"] model = GradientBoostingClassifier( n_estimators=300, learning_rate=0.05, max_depth=5, subsample=0.8, random_state=42, ) with mlflow.start_run(run_name="churn-gbm-v1") as run: model.fit(X_train, y_train) # Log model via Feature Store — this records the feature lineage fe.log_model( model=model, artifact_path="churn_model", flavor=mlflow.sklearn, training_set=training_set, # ← binds model to its feature lookups registered_model_name=f"{CATALOG}.ml.user_churn_model", ) print(f"Logged model with feature lineage. Run: {run.info.run_id}") Writing Features to the Online Store For real-time inference, the model needs features in milliseconds — not the seconds it takes to query Delta Lake. Databricks Feature Store can publish features to an online store (DynamoDB, Cosmos DB, MySQL, etc.) for low-latency reads. Python # ── Publish Features to Online Store ───────────────────────────────────────── # Online stores are configured per feature table. # Here we publish user_activity_features to DynamoDB for <5ms lookups. from databricks.feature_engineering.entities.feature_store_online_table import ( OnlineTable, OnlineTableSpec, TriggeredSchedulingPolicy ) # Create an online table spec (backed by a serverless real-time compute layer) online_table_spec = OnlineTableSpec( primary_key_columns=["user_id"], source_table_full_name=f"{FEATURE_DB}.user_activity_features", run_triggered=OnlineTableSpec.TriggeredSchedulingPolicy(), # sync on-demand # OR for continuous sync: # run_continuous=OnlineTableSpec.ContinuousSchedulingPolicy() ) # Create the online table (idempotent) online_table = fe.create_online_table(spec=online_table_spec) print(f"Online table: {online_table.name}") print(f"Status: {online_table.status.detailed_state}") # Trigger an initial sync from the offline Delta table to the online store fe.refresh_online_table(name=f"{FEATURE_DB}.user_activity_features") Serving Features at Inference Time At inference time, the Feature Store SDK performs automatic feature lookups, joining the incoming request data with features from the online store before passing them to the model. Python # ── Real-Time Feature Serving at Inference ──────────────────────────────────── import requests, json WORKSPACE_URL = "https://<workspace>.azuredatabricks.net" TOKEN = dbutils.secrets.get("prod-scope", "databricks-token") # Option 1: Model Serving with automatic feature lookup # When you logged the model with fe.log_model(), Databricks knows which # features to fetch. You only send the lookup key (user_id) at inference time. def predict_churn(user_ids: list) -> list: """ Send only user_id — the serving endpoint fetches features automatically from the online store and runs inference. """ payload = { "dataframe_records": [ {"user_id": uid} for uid in user_ids ] } resp = requests.post( f"{WORKSPACE_URL}/serving-endpoints/churn-predictor/invocations", headers={ "Authorization": f"Bearer {TOKEN}", "Content-Type": "application/json", }, data=json.dumps(payload), timeout=5, ) resp.raise_for_status() return resp.json()["predictions"] # Example usage predictions = predict_churn(["u_123456", "u_789012", "u_345678"]) for uid, pred in zip(["u_123456", "u_789012", "u_345678"], predictions): print(f"{uid}: churn_probability = {pred:.4f}") # u_123456: churn_probability = 0.0821 # u_789012: churn_probability = 0.7643 # u_345678: churn_probability = 0.1209 # Option 2: Direct feature lookup via the Feature Serving endpoint # Useful when you want raw features without running inference def get_features(user_ids: list) -> dict: payload = { "dataframe_records": [{"user_id": uid} for uid in user_ids] } resp = requests.post( f"{WORKSPACE_URL}/serving-endpoints/user-features-serving/invocations", headers={ "Authorization": f"Bearer {TOKEN}", "Content-Type": "application/json", }, data=json.dumps(payload), timeout=5, ) return resp.json() # Option 3: Batch scoring (offline) — uses Delta offline store # No online store needed; reads directly from the feature table with PIT lookup batch_labels = spark.table(f"{CATALOG}.gold.users_to_score_today") \ .select("user_id", F.current_timestamp().alias("event_timestamp")) batch_predictions = fe.score_batch( model_uri=f"models:/{CATALOG}.ml.user_churn_model@champion", df=batch_labels, result_type="double", ) batch_predictions.select("user_id", "prediction") \ .write.format("delta").mode("overwrite") \ .saveAsTable(f"{CATALOG}.gold.churn_scores_daily") Feature Table Reference A summary of the feature tables in our pipeline, their update cadence, and their role in the ML lifecycle: Feature TablePrimary KeyTimestamp KeyUpdate MethodLatencyUsed Inuser_activity_featuresuser_idfeature_tsSpark Structured Streaming~5 minReal-time churn, recommendationtransaction_featuresuser_idfeature_tsScheduled batch (hourly)~60 minChurn, LTV predictionuser_profile_featuresuser_idupdated_atCDC from OLTP (near real-time)~2 minAll modelsproduct_featuresproduct_idfeature_tsScheduled batch (daily)~24 hrRecommendation, search rankingsession_featuressession_idsession_end_tsStreaming (micro-batch)~1 minClick-through rate, abandon predictioncohort_featurescohort_idcomputed_atWeekly batch~7 daysSegmentation, A/B analysis Freshness vs cost tradeoff: Streaming features are ~10× more expensive to compute than batch features (continuous cluster vs scheduled job). Only promote a feature to streaming if your model's performance degrades meaningfully with stale data — validate this with an offline ablation study first. Key Takeaways Training-serving skew is the silent killer of production ML — the Feature Store eliminates it by encoding feature computation logic once and using it in both training and serving paths.Point-in-time correct joins via timestamp_lookup_key are non-negotiable for any model trained on time-series data. A missing event_timestamp in your label table is a data leakage bug waiting to happen.fe.log_model() is the right model logging call, not mlflow.sklearn.log_model(). It records feature lineage, enabling reproducible re-training and automatic feature lookup at serving time.Watermarks in Structured Streaming are critical for stateful aggregations — without them, Spark accumulates state indefinitely and the job eventually OOMs. Set them to the maximum tolerable late-data window.Online stores are only worth the operational cost when your SLA is under ~100ms. For batch scoring jobs or APIs with >500ms budgets, read directly from the offline Delta table.fe.score_batch() is the cleanest way to run periodic batch inference — it handles PIT feature lookups automatically, keeps inference logic DRY, and logs results to Delta for downstream consumers. References Databricks — Feature Engineering in Unity Catalog (Overview)Databricks — Create and Manage Online TablesDatabricks — Point-in-Time Feature LookupsApache Spark — Structured Streaming Programming GuideApache Spark — Streaming Watermarks for Late Data HandlingDatabricks — Feature Store Python API ReferenceDatabricks — Score Batch with Feature Store"Feature Stores for ML" — Feast Documentation (open-source reference)"Rethinking Feature Stores" — Chip Huyen (huyenchip.com)Databricks — Model Serving with Automatic Feature Lookup"Building Machine Learning Pipelines" — Hannes Hapke & Catherine Nelson (O'Reilly)

By Jubin Abhishek Soni

CORE

From Pilot to Production: The Six Agent Patterns That Determine Whether Your AI Program Scales or Stalls

We've been running AI agents in production across enterprise cloud support for several years now. I've watched the same pattern play out dozens of times across organizations of every size: a team builds a compelling pilot, leaders get excited, and then... it stalls. Not because the technology failed. Because the operating model was never designed for what agents actually do when they stop assisting humans and start executing work on their behalf. This isn't a failure of ambition. It's a failure of classification. Organizations treat all agent initiatives the same way, same governance, same ownership model, same success metrics — and then wonder why agents that draft emails scale easily while agents that process workflows create governance crises by agent fifty. The problem isn't building agents. The problem is that nobody designed an operating model for what agents do when they stop assisting and start executing. The Shift That Changes Everything There's a deceptively simple transition happening in enterprise AI that most architecture conversations skip over. AI agents are moving from assisting humans to executing work. On the surface, this sounds like an incremental capability improvement. In practice, it changes everything about how you govern, own, and operate them. In assist mode, the agent supports human decision-making. The human decides what to do. The human executes the action. The human is fully accountable. The governance model is familiar because it's essentially the same as any other software tool: set some usage policies, manage access, track adoption. Low risk. Familiar territory. In execute mode, the agent performs work across systems. The agent acts on decisions. The agent orchestrates multi-step workflows. The human oversees outcomes rather than approving each action. This creates four new demands that most organizations are completely unprepared for: Who is accountable for this agent? What happens when it goes wrong? Who maintains and improves it over time? What is it allowed to do and not do? These questions sound simple. In my experience, most organizations cannot answer even one of them clearly for their production agents. That's the gap. And it's why agents stall. Six Patterns, Six Operating Models The most useful insight I can share from production experience is this: not all agent initiatives are the same, and treating them the same is what breaks scale. An agent that drafts emails for individuals is a completely different organizational bet than an agent that processes support requests autonomously. They require different governance, different ownership models, different success metrics, and different levels of organizational maturity. In practice, I've found it useful to think about agent work in six distinct patterns, each with its own operating requirements. These are design choices, not stages; most organizations run two or three simultaneously. Pattern 1: Employee AI Enablement Every employee uses AI assistants for research, drafting, summarization, and personal workflow automation. The human retains full decision-making authority; the agent recommends, the human decides. This is the most accessible pattern and the right starting point for most organizations. What most teams get wrong here: they treat this as a technology deployment rather than a behavior change program. The technology is the easy part. Getting people to actually change how they work to build the habit of using agents rather than falling back to familiar processes requires visible leadership role-modeling, continuous enablement, and a community that celebrates and shares what works. Licenses do not become usage on their own. Pattern 2: Business Expert Empowerment An expert's knowledge — in compliance, engineering standards, risk assessment, regulatory interpretation — is captured and scaled across the organization through an agent. The expert shifts from answering every question to teaching the agent and auditing its output. The critical insight here: the agent's credibility IS the product. If the agent gives wrong expert advice, you damage the expert's reputation and potentially the business. I've seen this pattern fail repeatedly because teams focused on building the agent and ignored knowledge quality controls. The agent is only as good as its source documents. If you cannot guarantee those documents are authoritative, current, and complete, you should not deploy this pattern. Pattern 3: Workplace and IT Services Agents operate internal services end-to-end: IT helpdesk, HR, Finance, Facilities. These agents don't just answer questions; they execute service workflows: processing leave requests, provisioning access, validating expenses, routing procurement. The scale-breaker I see consistently: teams automate individual tasks without redesigning the service flow. You end up with islands of automation that don't connect to a faster intake process that feeds into the same manual triage queue. Design the service first. Then build the agents. Pattern 4: Core Business Process Transformation Agents run core enterprise processes end-to-end: claims processing, order-to-cash, financial close, supply chain coordination. These are business-critical workflows where agents make decisions — not just suggestions — with direct impact on revenue, cost, and customer experience. This is where I see the most governance failures. Organizations apply the same lightweight controls they used for productivity agents to business-critical autonomous workflows. The result is agents making consequential decisions without audit trails, escalation paths, or defined autonomy limits. This pattern demands depth everywhere — there's no capability driver you can shortcut. Pattern 5: External Engagement Agents interact directly with customers, partners, or ecosystem stakeholders — crossing the enterprise trust boundary. Every interaction affects brand, reputation, and customer trust. Errors are visible externally. The non-negotiable here: external agents need higher governance and security maturity than any internal pattern because one bad customer interaction from an unsupervised agent is a brand crisis. Disclosure, consent, identity isolation, and real-time monitoring are not optional. Neither is a 15-minute incident response plan. Pattern 6: AI-First Capabilities Net new capabilities designed with agents as the core building block things that weren't possible before AI. Agents operate in sense-decide-act loops: continuously monitoring signals, making autonomous decisions within boundaries, executing actions, and learning from outcomes. This pattern demands the highest maturity across all capability dimensions. There's no existing process to compare against, no baseline to measure improvement from. Everything must be built — including how you measure success. Your pattern determines WHERE you invest, not just how much. Starting with the wrong pattern for your maturity level is a primary reason agents stall. The Maturity Trap Here's the mistake I see most often: organizations pick an ambitious pattern — say, core business process transformation — without honestly assessing whether their organizational capabilities can support it. They have Level 1 maturity in business strategy and governance but Level 3 technology infrastructure, and they convince themselves the technology readiness compensates for the organizational gaps. It doesn't. Maturity in this context spans five dimensions: how deliberately you plan and invest in AI strategy; how deeply AI is integrated into business processes and outcome measurement; how well you manage risk, compliance, and responsible AI; how mature your platforms, architecture, and data quality are; and how effectively you enable adoption and build an AI-positive culture. The critical insight is that your weakest dimension becomes your ceiling, regardless of how strong the others are. I've watched organizations with world-class AI infrastructure fail to scale agents because they had no governance model and no named owners for production agents. The technical foundation was irrelevant; the agents couldn't be trusted in production because nobody knew who was accountable when something went wrong. The goal is not to reach maximum maturity everywhere. Different patterns require different maturity depths across different dimensions. Your job is to identify which pattern you're pursuing, assess where you are today, find the biggest gap, and fix that first. The biggest gap is your scale-breaker. Five Scale-Breakers I've Seen in Production After working across multiple AI agent deployments, these are the patterns I see breaking scale most consistently: 1. Many Pilots, No Portfolio Agents aren't tied to measurable business outcomes. Each team builds something interesting, but there's no portfolio view, no named business owners, no defined success metrics. The fix: pick one or two outcomes, pick one or two patterns, name an owner for each, and define what success looks like before you build. 2. One-Off Agents, No Reuse Every team reinvents the wheel because there's no shared reference architecture, no standardized integration approach, and no common telemetry baseline. Each agent is a bespoke build that can't share components with anything else. At agent fifty, your maintenance burden is fifty independent systems. 3. Great Demos, Low Adoption The AI experience isn't designed end-to-end. Users don't know when to use the agent, what it can do, or how to validate its outputs. The fix: define golden paths for your top scenarios, how users engage, what's automated versus human-approved, and how exceptions are handled. 4. Licenses Don't Equal Usage Enablement and change management aren't systematic. There's no community, no training program, no champions network, no incentives tied to new ways of working. You can deploy Copilot to 10,000 employees and have 200 active users if you don't build a sustained enablement motion. 5. Shadow Agents Appearing Governance isn't operational. Teams build agents outside official channels because the official path is too slow or unclear. The fix isn't more process; it's making the safe path the easy path. Implement a minimum baseline: named owner, audit trail, release gate, monitoring, escalation path. Make that baseline so easy to satisfy that going around it takes more effort than using it. The Operating Model That Actually Works The operating model question that matters most is not 'what technology should we use' but 'who owns this agent, what happens when it goes wrong, and how does it improve over time.' In my experience, the organizations that scale agents successfully share three operating model characteristics that struggling organizations consistently lack. First, they treat agents as products, not projects. A project ends when the agent is deployed. A product has an owner, a monitoring plan, a feedback loop, and a defined path to improvement or retirement. Every agent in production without monitoring and an improvement plan is accumulating risk — knowledge goes stale, integrations break, user patterns change. Agents don't fail dramatically; they slowly drift, giving increasingly wrong answers with full confidence. That's worse than a crash, because nobody notices. Second, they govern proportionately to risk. They don't apply the same controls to a personal productivity agent that they apply to an agent processing financial transactions. Low-risk agents get lightweight controls — named owner, basic monitoring, standard release checklist. High-risk agents get production-grade SLA monitoring, security reviews, responsible AI assessments, decision rights frameworks, and incident response plans. Over-governing low-risk agents kills adoption. Under-governing high-risk agents creates liability. Third, they centralize how scale works, not who builds everything. The central team sets standards, manages platforms, runs community programs, and provides governance guardrails. Domain teams build and own agents within those guardrails. The central team's primary job is enablement, not control. Make the safe path the easy path. Agents don't scale through technology. They scale through people, ownership, and operating discipline. You don't need a bigger model. You need a better operating model. What I'd Do Differently If I were starting an enterprise agent program from scratch today, here's what I would prioritize differently based on production experience: Name an owner before you build. Not a team, a person. The accountability gap is the single most common failure point I see. When something goes wrong with an agent that 'the team' owns, nobody fixes it promptly because everyone assumes someone else is handling it. Run your maturity diagnostic before picking your pattern. Be honest about where you actually are, not where you aspire to be. A realistic assessment of your weakest dimension will tell you more about what pattern you're ready for than any technology readiness assessment. Deploy monitoring on day one, not after adoption. I have seen too many teams treat monitoring as a phase-two concern. By the time phase two arrives, there are already production agents with no visibility into accuracy, drift, or escalation patterns. If you can't monitor it, you can't trust it. Build your first agent for reuse, not just for the use case. The architectural decisions you make in your first production agent — how you handle telemetry, how you structure knowledge sources, how you design escalation paths — become the template every subsequent agent follows. Get those decisions right early, and the fiftieth agent will be easier to build, deploy, and operate than the fifth. The Bottom Line The technical capability to build production-grade AI agents exists today. The constraint is organizational. Most enterprises are running a twenty-first-century technology capability on a twentieth-century operating model — and wondering why it keeps stalling. The organizations winning with agents are not necessarily the ones with the best models or the most compute. They're the ones that figured out ownership, governance, and lifecycle discipline before they scaled. They built operating models designed for agents that execute — not just agents that assist. That shift from assist to execute is the one that changes everything. And it's the one most organizations are still not prepared for.

By BALAJI BARMAVAT

Beyond Root Cause: Building Effective Blameless Postmortems for Cloud-Native Systems

Production incidents are inevitable. No matter how much testing, automation, observability, or resilience engineering an organization invests in, complex distributed systems will eventually fail in unexpected ways. The real differentiator between high-performing engineering organizations and everyone else is not whether incidents occur — it is how effectively organizations learn from them. Unfortunately, many root cause analysis (RCA) processes fail to achieve this objective. Instead of uncovering systemic weaknesses, they often focus on identifying a single mistake, a specific engineer, or a single technical failure. The resulting report may satisfy a compliance requirement, but it rarely produces meaningful improvements in reliability. As cloud-native architectures become increasingly distributed and interconnected, organizations must evolve beyond traditional RCA practices and adopt blameless postmortems that focus on organizational learning and continuous improvement. The Traditional RCA Trap Most incident investigations begin with a simple question: "What caused the outage?" At first glance, this seems reasonable. However, the question itself often leads teams toward finding a single root cause. Common conclusions include: An engineer deployed an incorrect configuration.A database migration introduced an error.An operator executed the wrong command.A monitoring alert was ignored.A service exceeded capacity limits. While these statements may be factually correct, they often represent only the final event in a much larger chain of failures. Consider a scenario where a configuration change causes a critical service outage. A traditional RCA might conclude: The outage occurred because an engineer deployed an invalid configuration file. While technically true, this explanation leaves many important questions unanswered: Why was the invalid configuration allowed into production?Why did automated validation fail to detect the issue?Why did monitoring not identify the problem immediately?Why was the blast radius so large?Why was rollback difficult?Why did recovery take longer than expected? These questions often reveal the real opportunities for improvement. Modern Incidents Rarely Have a Single Root Cause One of the most important lessons from operating distributed systems is that incidents are almost never caused by a single failure. Modern cloud environments contain thousands of interacting components: Microservices, APIs, Databases, Service meshes, Kubernetes clusters, CI/CD pipelines, Infrastructure automation, Third-party dependencies A seemingly simple outage often emerges from a combination of factors. For example: Contributing FactorImpactIncomplete testingAllowed faulty configurationMissing safeguardsFailed to block deploymentWeak observabilityDelayed detectionDocumentation gapsSlowed troubleshootingComplex architectureIncreased blast radiusManual recovery processExtended outage duration No single factor caused the outage. Rather, the outage occurred because multiple layers of defense failed simultaneously. This is why mature organizations increasingly focus on contributing causes rather than searching for a single root cause. What Does "Blameless" Actually Mean? One of the most misunderstood concepts in incident management is the idea of a blameless postmortem. Some teams incorrectly assume that blameless means avoiding accountability. It does not. Blameless means recognizing that engineers make decisions based on the information available to them at a given moment. During an active incident: Information is incomplete.Time pressure is high.Monitoring signals may be conflicting.Customer impact is increasing.Stress levels are elevated. The objective of a postmortem is therefore not to judge whether an individual made a perfect decision. The objective is to understand: Why the decision seemed reasonable at the time.What information was available.What information was missing.What systemic conditions contributed to the outcome. When teams focus on learning instead of blame, they become far more willing to share details openly and honestly. Anatomy of an Effective Postmortem High-quality postmortems typically follow a structured approach. 1. Incident Summary Begin with a concise overview: What happened?When did it occur?How long did it last?Who was affected?What was the business impact? Example: "On March 12, Service X experienced elevated latency following a configuration deployment. Approximately 15% of customer requests failed for 42 minutes before service was fully restored." 2. Timeline Reconstruction The timeline is often the most valuable section of a postmortem. Document key events chronologically: TimeEvent09:00Deployment initiated09:05Error rate increased09:08Customer complaints received09:12Incident declared09:18Rollback initiated09:25Error rate returned to normal09:42Incident resolved A detailed timeline helps teams understand exactly how events unfolded. 3. Contributing Factors Analysis Rather than searching for a single root cause, identify all meaningful contributors. Examples include: Technical Contributors Configuration validation gapsCapacity limitationsMonitoring deficienciesDependency failuresArchitectural constraints Process Contributors Incomplete deployment reviewsMissing runbooksEscalation delaysLack of disaster recovery testing Organizational Contributors Knowledge silosStaffing limitationsUnclear ownership boundariesTraining gaps The goal is to build a complete picture of the incident. 4. Recovery Assessment Analyze the effectiveness of the response. Questions worth asking: Was detection timely?Were alerts actionable?Was ownership clear?Did responders have the necessary tools?Were runbooks useful?Could recovery have been automated? Many organizations discover that recovery challenges contribute more customer impact than the original failure itself. The Five Whys: Useful But Limited Many organizations use the "Five Whys" technique. Example: 1. Why did the outage occur? Because a configuration was invalid. 2. Why was it invalid? Because validation checks were incomplete. 3. Why were validation checks incomplete? Because a new deployment framework was introduced. 4. Why was the framework deployed without complete validation? Because release deadlines prioritized delivery. 5. Why were deadlines prioritized? Because organizational risk was underestimated. The Five Whys can uncover valuable insights. However, distributed systems are rarely linear. Multiple parallel factors often contribute simultaneously. Treat them as one investigative tool, not the entire analysis framework. Turning Findings Into Action A postmortem without action items is merely documentation. Every significant finding should produce a measurable improvement initiative. Examples include: FindingActionConfiguration errors reach productionAdd automated validationDetection delayed by 10 minutesImprove alert coverageRollback requires manual interventionImplement automated rollbackTroubleshooting knowledge unavailableCreate operational runbooksRecovery depends on expertsExpand team training Action items should be: specific, assigned, prioritized, and trackable. Without ownership, lessons learned quickly become lessons forgotten. Measuring Postmortem Effectiveness Many organizations measure success by counting completed postmortems. A more meaningful approach is measuring operational improvement. Consider tracking: Mean time to detect (MTTD)Mean time to recover (MTTR)Repeat incident frequencyAutomated recovery rateManual intervention reductionCustomer impact reduction The ultimate goal is not producing better reports. The goal is producing more resilient systems. The Future: AI-Assisted Incident Learning As incident management platforms evolve, AI is beginning to transform postmortem creation. Modern systems can automatically: Build incident timelinesCorrelate alertsSummarize communication channelsExtract remediation actionsIdentify recurring failure patternsGenerate draft postmortems This allows responders to spend less time gathering information and more time analyzing systemic weaknesses. However, AI should augment human investigation — not replace it. Understanding organizational context, operational tradeoffs, and architectural decisions still requires human expertise. Final Thoughts The most valuable outcome of an incident is not service restoration. It is learning. Organizations that focus solely on identifying who made a mistake often repeat the same failures. Organizations that focus on understanding how their systems allowed failures to occur continuously improve their resilience. Blameless postmortems shift the conversation from: "Who caused this incident?" to "What can we learn from this incident, and how can we make the system stronger?" That mindset is ultimately what transforms incident management from a reactive operational function into a strategic capability that improves reliability, resilience, and engineering excellence over time.

By Akshay Pratinav

Multi-Agent Software Engineering: One Coding Agent Isn't Enough

Coding agents are good now. They can write a function, fix a failing test, or walk you through a chunk of legacy code you'd rather not read. That part is settled. The harder question is what happens when you hand one a real piece of delivery work, something that has to change the database and the API and the UI and the tests all together, and keeps running long after you've stepped away from your desk. That's usually where a single agent starts to struggle, and it isn't because the model isn't smart enough. The limit is human attention. A team might have fifty things sitting in its backlog that an agent could help with, but somebody still has to scope each one, keep an eye on it, review what comes back, and confirm it actually works. So you can generate code far faster than before and still ship at about the same pace. The slow part just moved. Long delivery work is a different animal from a quick coding task. It needs someone to hold the scope steady, keep the architecture consistent from one file to the next, make sure the tests check what the feature is meant to do rather than what the code happens to do, review the result, and hand off cleanly to whatever comes next. Ask one agent to carry all of that in a single context window across a long run, and it tends to drift. You've probably watched it happen: it loses the plot halfway through, writes tests that pass only because they were shaped around the code it just produced, uses one pattern here and a different one three files over, rebuilds something that already existed, and then can't quite tell you what it finished and what it didn't. So you read every diff yourself. The agent writes code, and you're still doing the planning, reviewing, QA, and firefighting. There's a limit to how far that stretches. From One Agent to a Team A more workable setup is to stop giving one agent the whole job and split it the way a functioning team already does. One agent plans the work, another builds it, another checks it. Three roles get you most of the way. RoleResponsibilityOrchestratorUnderstands the goal, asks the clarifying questions, writes the plan, sets milestones, and decides how the work is sequenced.WorkerImplements one feature from clean context and commits it in a controlled way.ValidatorChecks the implementation independently, runs the checks, verifies behavior, and flags follow-up work. Keeping the building and the checking in different hands matters for the same reason people review each other's code. Whoever wrote it is invested in it working, and that bias is hard to spot from the inside. A fresh agent that had no part in those decisions tends to catch what the author missed. How Agents Coordinate Underneath the roles, the agents end up talking to each other in a few recurring ways, and it helps to have names for them. Delegation is the obvious one, and usually the first that teams build. An agent hands a scoped task to another and waits for the result. Creator-verifier is the one that matters most for software. One agent writes the code and a separate one, working from its own context, checks it. That separation is what stops an agent from grading its own homework. Direct communication lets agents talk without a coordinator in the middle. It's tempting and it's fragile, since state scatters across separate conversations and sooner or later somebody acts on something out of date. Negotiation is what happens when agents share a resource, which for us usually means the codebase. Two agents about to edit the same file have to work out who does what before they overwrite each other. Broadcast is one agent telling the rest about something that changed, like a new constraint or a failure everyone needs to know about. It's the least exciting of the five, and the one that quietly keeps the long run from falling out of sync. Define "Done" Before Any Code Gets Written Settling what "correct" means before anyone writes code does more for reliability than any amount of prompt tuning. It heads off a specific and very common failure. An agent builds a feature, then writes tests that wrap neatly around the feature it just built. Everything passes, coverage looks healthy, and none of it tells you whether the feature does what was actually asked for. Tests written after the code mostly confirm whatever the code already does. They don't find the bugs. A validation contract flips that order. During planning, before there's any code, you write down what the feature has to do: the behavior that has to exist, the edge cases that matter, the flows that have to work, the regressions you can't allow. A small change might need a handful of those. A big feature can need hundreds, spread across the backend, the API, the front end, and the full end-to-end paths. Each one gets tied to a feature, and a feature isn't finished until it satisfies the ones assigned to it. The effect is that "done" gets defined separately from however the code happens to come out. Workers build against the contract, validators check against it, and you stop relying on whether the code looks right and start measuring whether it works. Passing Tests Aren't the Same as Working Software You still want lint, type checks, unit tests, and code review. The trouble is that once an agent is shipping whole features on its own, those checks stop being enough. Plenty of changes pass every unit test and are still broken where it counts. The form renders fine, but the submit button does nothing. The endpoint returns exactly the right shape, filled with stale data. A flow that worked in isolation falls apart once it sits behind a login. A migration runs clean on a laptop and chokes on production-scale data. So the better systems add a validator that works more like a QA engineer than a linter. It launches the app, clicks around, fills in forms, and confirms the whole path works end to end. That's slow, and on a long task it's where most of the wall-clock time goes: not generating tokens, but waiting on a live application to do something and watching what it does. The trade is worth it, since generating code quickly without really checking it only gets you to the wrong answer faster. In one production run an engineer at Factory described, building a clone of Slack, the project finished with about half its lines of code being tests, and roughly 90% coverage, and the validation step never passed on its first try. That last part is the whole reason the loop exists. Long Runs Can't Rely on Memory Run something for hours or days and context starts leaking between the agents. A bigger context window doesn't really fix it. What helps is not letting a worker close out a task by simply announcing it's done. Instead, each worker leaves a written handoff: what it built, which files it touched, which commands it ran and how they exited, what it assumed along the way, what it ran into, and what it left unfinished. That makes the run auditable. When validation fails, the orchestrator reads back through the handoffs, works out where things went sideways, scopes the fix, and pulls the run back on track at the next milestone instead of discovering the mess at the very end. The teams who make this work don't count on their agents remembering anything; they write enough down that the next agent can safely pick up where the last one stopped. Factory has reported runs lasting as long as sixteen days on this kind of setup. More Agents Isn't More Throughput The instinct is to run everything in parallel. Ten agents should mean ten times the work, right? For software, it usually doesn't play out that way. Agents running at the same time tend to edit the same files, redo work that's already done, and make architectural choices that don't line up with each other. The effort of untangling all that eats whatever speed you gained, and you pay for the conflict in tokens on top of it. What works better is to run the actual changes one at a time and save the parallelism for read-only work, like searching the codebase, reading docs, looking up an API, or reviewing code. On paper that's slower. Over a long task it comes out ahead, because you spend far less time cleaning up conflicts, the handoffs stay cleaner, and the whole thing behaves more predictably. Pile on more agents without coordinating them and you don't get speed so much as a codebase that disagrees with itself. The Right Model in Each Seat These systems also change how you pick models, because no single model is the right choice for every seat. Planning tends to go better with a model that reasons slowly and carefully. Writing code rewards speed and fluency instead. Checking the work rewards something closer to stubbornness: following the instructions exactly and giving nothing the benefit of the doubt. The model that writes the best code is often not the one you'd trust to grade it. There's even a case for running the validator on a different provider, so it doesn't carry the same blind spots as the model that wrote the code. That's the argument for staying model-agnostic. You want to put the right model in each role and swap it out as models get better at particular things, rather than getting stuck with one vendor's weakest area showing up everywhere. It works in the other direction too. A solid scaffolding of contracts, checkpoints, and independent validators can prop up a weaker or open-weight model and get more out of it than it would manage alone. Most of the orchestration in these systems lives in prompts and skills rather than hardcoded logic, which is the reason a new model release tends to make them better instead of obsolete. The Case for Fewer Agents Everything up to here makes the case for splitting work across agents, so it's only fair to take the strongest counterargument seriously. In 2025, the team behind Devin put out a post titled "Don't Build Multi-Agents," and the heart of it is hard to dismiss. They argue that most multi-agent failures come down to context getting fragmented. When you fan work out to parallel subagents, each one quietly makes its own assumptions, and those assumptions don't reconcile when the pieces come back together. One subagent picks a naming convention, another picks a different one, and you're left with something that reads as coherent but doesn't actually fit. Their advice is to keep one agent on a single thread and compress the context as it grows instead of spreading it across a crowd of workers. Anthropic landed somewhere close, though more conditional, when it wrote up its own multi-agent research system around the same time. Splitting work across agents paid off for broad, parallel tasks like searching many sources at once, but it struggled on anything that needed one shared context and tight coordination, which is most of what software work is. Both write-ups end up pointing at the same shape described here. Don't run agents in parallel on tightly coupled work. Split the work by role, and let the coupled parts happen in order. What the Failure Data Shows This isn't only field intuition, either. In 2025, a group at Berkeley published a study called "Why Do Multi-Agent LLM Systems Fail?" that went through failure traces from several well-known frameworks and grouped what went wrong. What stood out was where the failures landed. They mostly weren't about the model being too weak. They were about design, with agents given vague roles or ignoring the roles they had; about coordination, with one agent sitting on information another needed or a conversation getting reset partway through; and about verification, with work marked finished that nobody really checked, or a run quitting too early. Those are the same three places this whole architecture tries to shore up, with clear roles, written handoffs, and validators that don't simply take an agent at its word. There's also hard evidence that giving each worker fresh context is more than tidiness. The "lost in the middle" research found that models pay the most attention to the start and end of their context and the least to whatever sits in the middle. Later work on "context rot" found accuracy slipping as the input gets longer, even on simple lookups. A worker drowning in a long accumulated history is a real, measured liability, not a theoretical one, and handing each worker a clean slate keeps the model working in the range where it's actually reliable. The Bill Comes Due It's easy to underestimate what these systems cost. More agents running for longer means a lot more tokens. Anthropic reported that a single agent already burns through several times the tokens of an ordinary chat, and a multi-agent system can use roughly an order of magnitude more on top of that. That only pencils out on work that's worth the spend. Running a multi-agent system to fix a typo is just an expensive way to fix a typo. A couple of things keep it in check. One is prompt caching. A long run reads the same stable context over and over, the system prompt, the codebase, the plan, and caching that material so it isn't reprocessed every time cuts the bill sharply, which is why anyone running these in production leans on it. The other is the serial discipline from earlier: every conflict you don't create is a repair cycle you don't pay for, and repair cycles are where a lot of tokens quietly disappear. How much these systems cost is mostly a design question, not a billing one. A Bigger Attack Surface Security rarely shows up on the architecture diagram, and every agent you add is another door. Even a single agent has a well-known soft spot in prompt injection, where instructions tucked into a web page or a file or a tool's output get read as commands rather than data. Add more agents and the problem grows. A poisoned document that one worker reads can smuggle instructions through a handoff into another worker with more access, or one that touches production directly. The shared state and the messages agents pass around become a channel an attacker can aim at on purpose. This is the kind of thing you build in from the start, because it's painful to bolt on later. The same controls that keep these systems correct also keep them safer. Validators that won't take an agent's own word for it, handoffs that record exactly which commands ran and what came back, limits on what any single worker is allowed to reach, all of that doubles as containment, so one compromised step can't quietly become a compromised system. The audit trail that helps a run recover from its own mistakes is the same one you'll be glad to have when something goes wrong on purpose. Where This Leaves the Engineer None of this puts engineers out of work. It moves the work up a level. Instead of hand-driving every step of an implementation, you spend your time deciding what should get built, what the real constraints are, what counts as correct, which parts of the architecture are worth protecting, and when a human has to sign off. It feels more like running a delivery operation than like chatting with a bot. And the biggest gain usually isn't speed. It's keeping several streams of work moving at once without quality slipping, and often ending up with a codebase in better shape than when you started, since the tests and checks and handoffs all become part of what ships. The real skill is knowing when to reach for any of this. For a small, contained change, one good agent on a single thread is simpler and cheaper and less likely to wander off. For serious delivery at scale, you need the planning and checking and recovery that a team provides, and the only way agents can do that work is inside the same kind of structure a team uses: real roles, a shared definition of done agreed before anyone starts, honest handoffs, shared state, and execution kept under control rather than just turned up to full speed.

By Jithu Paulose

Dead Letter Queue Patterns in Apache Flink: Handling Poison Messages Without Stopping Your Stream

Streaming systems usually fail in one of two ways: Loudly, when infrastructure breaksQuietly, when one bad record keeps replaying until the pipeline is effectively dead The second failure mode is more dangerous because it often starts with something small: malformed JSON, an unexpected schema change, a missing required field, or a downstream timeout that was never handled correctly. In Apache Flink, one unhandled exception can trigger a restart. If the same poison message is still sitting in Kafka after recovery, the job reads it again, fails again, restarts again, and enters a loop. At that point, the pipeline is technically "recovering," but operationally it is down. This is exactly why production Flink jobs need a Dead Letter Queue (DLQ) strategy from day one. A proper DLQ pattern does three things: Isolates bad records so they do not stop good onesCaptures enough failure context to debug the issue laterPreserves replayability so quarantined records can be reprocessed after the root cause is fixed Anything less is not really a DLQ. It is either silent data loss or delayed outage. In this article, I will walk through the most practical DLQ patterns for Apache Flink 1.18: Side outputs as the core DLQ primitiveRetry with exponential backoff for transient failuresTiered DLQ routing by error classKafka and S3 sink patternsMetrics and alertingReplay with a dedicated reprocessing jobA PyFlink version of the side output pattern The goal is simple: a bad message should never silently disappear, and it should never silently stop the stream. Why Poison Messages Break Otherwise Healthy Pipelines A poison message is any record that consistently fails processing. Typical examples include: Malformed JSONIncompatible schema versionsMissing required fieldsInvalid business valuesRecords that trigger unexpected code pathsMessages that repeatedly fail downstream enrichment calls Without DLQ handling, the failure path usually looks like this: The record enters the pipelineDeserialization or validation throws an exceptionThe operator failsFlink restarts from the last checkpointThe same record is consumed againThe same exception happens again That loop can continue indefinitely. The result is predictable: Throughput drops to zeroDownstream consumers starveCheckpoint recovery does not helpOn-call engineers get paged for a problem caused by one record This is why DLQ handling is not just an error-handling convenience. It is a core reliability pattern. What a DLQ Should Look Like in Flink In a streaming architecture, a DLQ is a durable destination for records that could not be processed successfully. For Flink, that means the DLQ record should usually include: Raw payloadError typeError messageStack trace or summarized failure contextFailure timestampSource metadata such as topic, partition, or offset when available That information matters because a DLQ is only useful if someone can answer two questions later: Why did this record fail?How do I replay it safely once the issue is fixed? If you only log the exception, you lose replayability. If you only store the payload, you lose debugging context. If you drop the record entirely, you lose both. So the design target is not "catch exceptions." The design target is durable, observable, replayable failure handling. Pattern 1: Use Side Outputs as the Core DLQ Primitive The most natural DLQ mechanism in Flink is the side output. A side output allows one operator to emit records to multiple streams: The main stream for successful recordsOne or more side streams for failures, late data, or quarantined records That makes it the right primitive for DLQ routing. Define the DLQ Envelope and Output Tag Java import org.apache.flink.util.OutputTag; import org.apache.flink.streaming.api.functions.ProcessFunction; import org.apache.flink.util.Collector; public static final OutputTag<DeadLetterRecord> DLQ_TAG = new OutputTag<DeadLetterRecord>("dead-letter-queue") {}; public record DeadLetterRecord( String rawPayload, String errorType, String errorMessage, String stackTrace, long failedAtEpochMs, String sourceTopicPartition, long sourceOffset ) {} The important point here is that the DLQ record is not just the failed payload. It is an envelope that preserves enough context for triage and replay. Route Failures Inside a ProcessFunction Java public class EntityEventProcessor extends ProcessFunction<String, EntityEvent> { @Override public void processElement( String rawMessage, Context ctx, Collector<EntityEvent> out) { try { EntityEvent event = parseAndValidate(rawMessage); out.collect(event); } catch (JsonParseException e) { ctx.output(DLQ_TAG, new DeadLetterRecord( rawMessage, "JSON_PARSE_FAILURE", e.getMessage(), getStackTrace(e), System.currentTimeMillis(), ctx.element().toString(), -1L )); } catch (SchemaValidationException e) { ctx.output(DLQ_TAG, new DeadLetterRecord( rawMessage, "SCHEMA_VALIDATION_FAILURE", e.getMessage(), getStackTrace(e), System.currentTimeMillis(), ctx.element().toString(), -1L )); } catch (Exception e) { ctx.output(DLQ_TAG, new DeadLetterRecord( rawMessage, "UNKNOWN_FAILURE", e.getMessage(), getStackTrace(e), System.currentTimeMillis(), ctx.element().toString(), -1L )); } } private EntityEvent parseAndValidate(String raw) throws JsonParseException, SchemaValidationException { EntityEvent event = objectMapper.readValue(raw, EntityEvent.class); if (event.entityId() == null || event.entityId().isBlank()) { throw new SchemaValidationException("entityId is required"); } if (event.timestamp() <= 0) { throw new SchemaValidationException("timestamp must be positive"); } return event; } } This is the minimum viable DLQ pattern, and it already solves the most important operational problem: bad records no longer stop good ones. Wire the Main Stream and DLQ Stream Java StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); DataStream<String> kafkaSource = env .fromSource(buildKafkaSource(), WatermarkStrategy.noWatermarks(), "entity-events-source"); SingleOutputStreamOperator<EntityEvent> processed = kafkaSource.process(new EntityEventProcessor()); DataStream<EntityEvent> goodEvents = processed; DataStream<DeadLetterRecord> deadLetters = processed.getSideOutput(DLQ_TAG); goodEvents.sinkTo(buildDownstreamKafkaSink()); deadLetters.sinkTo(buildDlqKafkaSink()); env.execute("Entity Resolution Pipeline"); If you do nothing else, do this. Side outputs should be the default DLQ foundation in Flink. Pattern 2: Retry Transient Failures Before Escalating to DLQ Not every failure belongs in the DLQ immediately. Some failures are transient: A downstream service is temporarily unavailableA database call times outAn external API is rate-limitedA network dependency is briefly unstable If you send all of those directly to the DLQ, you create noise and bury the truly bad records. The better pattern is: Retry transient failures a limited number of timesUse exponential backoffEscalate to DLQ only after retries are exhausted Retry With KeyedProcessFunction and Timers Java public class RetryingEnrichmentProcessor extends KeyedProcessFunction<String, EntityEvent, EnrichedEvent> { private static final int MAX_RETRIES = 3; private static final long BASE_BACKOFF_MS = 500L; private transient ValueState<Integer> retryCountState; private transient ValueState<EntityEvent> pendingEventState; @Override public void open(Configuration parameters) { retryCountState = getRuntimeContext().getState( new ValueStateDescriptor<>("retry-count", Integer.class)); pendingEventState = getRuntimeContext().getState( new ValueStateDescriptor<>("pending-event", EntityEvent.class)); } @Override public void processElement( EntityEvent event, Context ctx, Collector<EnrichedEvent> out) throws Exception { try { EnrichedEvent enriched = callEnrichmentService(event); retryCountState.clear(); pendingEventState.clear(); out.collect(enriched); } catch (TransientServiceException e) { int retries = retryCountState.value() == null ? 0 : retryCountState.value(); if (retries >= MAX_RETRIES) { retryCountState.clear(); pendingEventState.clear(); ctx.output(DLQ_TAG, new DeadLetterRecord( event.toString(), "MAX_RETRIES_EXCEEDED", "Failed after " + MAX_RETRIES + " retries: " + e.getMessage(), getStackTrace(e), System.currentTimeMillis(), ctx.getCurrentKey(), -1L )); } else { retryCountState.update(retries + 1); pendingEventState.update(event); long backoffMs = BASE_BACKOFF_MS * (long) Math.pow(2, retries); ctx.timerService().registerProcessingTimeTimer( System.currentTimeMillis() + backoffMs ); } } catch (PoisonMessageException e) { ctx.output(DLQ_TAG, new DeadLetterRecord( event.toString(), "POISON_MESSAGE", e.getMessage(), getStackTrace(e), System.currentTimeMillis(), ctx.getCurrentKey(), -1L )); } } @Override public void onTimer( long timestamp, OnTimerContext ctx, Collector<EnrichedEvent> out) throws Exception { EntityEvent pending = pendingEventState.value(); if (pending == null) return; try { EnrichedEvent enriched = callEnrichmentService(pending); retryCountState.clear(); pendingEventState.clear(); out.collect(enriched); } catch (TransientServiceException e) { int retries = retryCountState.value(); if (retries >= MAX_RETRIES) { retryCountState.clear(); pendingEventState.clear(); ctx.output(DLQ_TAG, new DeadLetterRecord( pending.toString(), "MAX_RETRIES_EXCEEDED", "Timer retry exhausted: " + e.getMessage(), getStackTrace(e), System.currentTimeMillis(), ctx.getCurrentKey(), -1L )); } else { retryCountState.update(retries + 1); long backoffMs = BASE_BACKOFF_MS * (long) Math.pow(2, retries); ctx.timerService().registerProcessingTimeTimer( timestamp + backoffMs ); } } } } Why This Works Especially Well in Flink This pattern is stronger in Flink than in many other stream processors because timers and state are checkpointed. That means: Retry counters survive restartsPending events survive restartsScheduled retries resume after recovery In other words, the retry workflow itself is fault-tolerant. That is exactly what you want when handling transient failures in a long-running stream. Pattern 3: Split the DLQ by Failure Type Once a pipeline matures, a single DLQ topic usually becomes too coarse. Schema failures, business validation failures, exhausted retries, and unknown exceptions all end up mixed together. That makes triage slower and replay harder. A better pattern is to classify failures and route them to separate DLQ streams. Define Failure Tiers Java public enum DlqTier { TRANSIENT_EXHAUSTED, SCHEMA_INVALID, BUSINESS_RULE, UNKNOWN } Route by Exception Class Java public class TieredDlqRouter extends ProcessFunction<String, EntityEvent> { @Override public void processElement( String raw, Context ctx, Collector<EntityEvent> out) { try { EntityEvent event = parse(raw); validate(event); out.collect(event); } catch (JsonParseException | MappingException e) { route(ctx, raw, DlqTier.SCHEMA_INVALID, e); } catch (BusinessValidationException e) { route(ctx, raw, DlqTier.BUSINESS_RULE, e); } catch (Exception e) { route(ctx, raw, DlqTier.UNKNOWN, e); } } private void route(Context ctx, String raw, DlqTier tier, Exception e) { OutputTag<DeadLetterRecord> tag = getTierTag(tier); ctx.output(tag, new DeadLetterRecord( raw, tier.name(), e.getMessage(), getStackTrace(e), System.currentTimeMillis(), "", -1L )); } } Define One Output Tag Per Tier Java public static final OutputTag<DeadLetterRecord> DLQ_SCHEMA = new OutputTag<>("dlq-schema-invalid") {}; public static final OutputTag<DeadLetterRecord> DLQ_BUSINESS = new OutputTag<>("dlq-business-rule") {}; public static final OutputTag<DeadLetterRecord> DLQ_UNKNOWN = new OutputTag<>("dlq-unknown") {}; Sink Each Tier Independently Java SingleOutputStreamOperator<EntityEvent> processed = kafkaSource.process(new TieredDlqRouter()); processed.getSideOutput(DLQ_SCHEMA) .sinkTo(buildKafkaSink("dlq.schema-invalid")); processed.getSideOutput(DLQ_BUSINESS) .sinkTo(buildKafkaSink("dlq.business-rule")); processed.getSideOutput(DLQ_UNKNOWN) .sinkTo(buildKafkaSink("dlq.unknown")); This makes the DLQ operationally useful instead of just technically correct. For example: Schema failures can be routed to the producer teamBusiness rule failures can feed data quality workflowsUnknown failures can trigger higher-severity alerting Pattern 4: Choose DLQ Sinks Based on How You Plan To Recover Once records are routed to a DLQ stream, they need a durable destination. In practice, the two most common choices are Kafka and object storage. Kafka DLQ Sink Kafka is the right choice when you want: Near-real-time inspectionStreaming replayOperational integration with existing consumers Java private static KafkaSink<DeadLetterRecord> buildDlqKafkaSink( String topicName) { return KafkaSink.<DeadLetterRecord>builder() .setBootstrapServers("kafka-broker:9092") .setRecordSerializer( KafkaRecordSerializationSchema.builder() .setTopic(topicName) .setValueSerializationSchema( new JsonSerializationSchema<>(DeadLetterRecord.class)) .setKeySerializationSchema( record -> record.errorType().getBytes()) .build() ) .setDeliveryGuarantee(DeliveryGuarantee.AT_LEAST_ONCE) .build(); } S3 DLQ Sink Object storage is the better choice when you want: Long retentionLow-cost quarantineBatch replay with Spark or AthenaPartitioned storage by date or error type Java private static FileSink<DeadLetterRecord> buildS3DlqSink() { return FileSink .forRowFormat( new Path("s3://your-bucket/dlq/entity-resolution/"), new JsonRowEncoder<>(DeadLetterRecord.class) ) .withRollingPolicy( DefaultRollingPolicy.builder() .withRolloverInterval(Duration.ofMinutes(15)) .withInactivityInterval(Duration.ofMinutes(5)) .withMaxPartSize(MemorySize.ofMebiBytes(128)) .build() ) .withBucketAssigner( new DateTimeBucketAssigner<>( "error-type='unknown'/year=yyyy/month=MM/day=dd/hour=HH") ) .build(); } A practical production pattern is to use: Kafka for short-term operational handlingS3 for long-term quarantine and replay That gives you both fast response and durable history. Pattern 5: Monitor DLQ Rate, Not Just Job Uptime A DLQ that nobody watches is just a backlog with better branding. Job uptime alone is not enough. A Flink job can stay green while quietly routing 10% of traffic to the DLQ. That is still a production incident. Add Metrics Inside the Operator Java public class MonitoredEntityEventProcessor extends ProcessFunction<String, EntityEvent> { private transient Counter dlqCounter; private transient Counter successCounter; private transient Histogram processingLatency; @Override public void open(Configuration parameters) { MetricGroup metrics = getRuntimeContext() .getMetricGroup() .addGroup("entity_resolution"); dlqCounter = metrics.counter("dlq_routed_total"); successCounter = metrics.counter("processed_success_total"); processingLatency = metrics.histogram( "processing_latency_ms", new DescriptiveStatisticsHistogram(1000) ); } @Override public void processElement( String raw, Context ctx, Collector<EntityEvent> out) { long start = System.currentTimeMillis(); try { EntityEvent event = parseAndValidate(raw); successCounter.inc(); out.collect(event); } catch (Exception e) { dlqCounter.inc(); ctx.output(DLQ_TAG, buildDeadLetter(raw, e)); } finally { processingLatency.update(System.currentTimeMillis() - start); } } } Alert on DLQ Rate A useful alert is DLQ throughput relative to successful throughput: YAML - alert: FlinkDlqRateHigh expr: | rate(flink_entity_resolution_dlq_routed_total[5m]) / rate(flink_entity_resolution_processed_success_total[5m]) > 0.01 for: 2m labels: severity: warning annotations: summary: "DLQ rate exceeds 1% of total throughput" description: "Check dlq.unknown Kafka topic for upstream schema changes" As a rule of thumb: above 1% often indicates schema drift or producer issuesabove 5% usually indicates a broader systemic problem The exact thresholds depend on the pipeline, but the principle does not: monitor DLQ rate as a first-class health signal. Pattern 6: Replay With a Dedicated Reprocessing Job A DLQ is only complete when replay is possible. The cleanest design is a separate Flink job that reads from the DLQ topic and routes records back through the main processing logic. Example Replay Job Java public class DlqReprocessingJob { public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); DataStream<DeadLetterRecord> dlqStream = env .fromSource( buildKafkaSource("dlq.schema-invalid"), WatermarkStrategy.noWatermarks(), "dlq-source" ); DataStream<String> replayStream = dlqStream .filter(r -> r.failedAtEpochMs() >= START_EPOCH && r.failedAtEpochMs() <= END_EPOCH) .map(DeadLetterRecord::rawPayload); SingleOutputStreamOperator<EntityEvent> reprocessed = replayStream.process(new EntityEventProcessor()); reprocessed.sinkTo(buildDownstreamKafkaSink()); reprocessed.getSideOutput(DLQ_TAG) .sinkTo(buildKafkaSink("dlq.permanent-quarantine")); env.execute("DLQ Reprocessing Job"); } } Why Replay Should Be a Separate Job Keeping replay separate from the main pipeline gives you: Independent scalingIndependent schedulingCleaner checkpoint behaviorSafer operational control It also lets you drain backlogs on your own terms: Off-peak hoursReduced parallelismOr maximum parallelism when you need to catch up quickly That separation keeps the main pipeline stable while still making recovery practical. PyFlink Version: Same Pattern, Same Principle If your team uses PyFlink, the same side output pattern applies. Python from pyflink.datastream import StreamExecutionEnvironment from pyflink.datastream.functions import ProcessFunction from pyflink.common.typeinfo import Types from pyflink.datastream.output_tag import OutputTag DLQ_TAG = OutputTag( "dead-letter-queue", Types.ROW_NAMED( ["raw_payload", "error_type", "error_message", "failed_at_ms"], [Types.STRING(), Types.STRING(), Types.STRING(), Types.LONG()] ) ) class EntityEventProcessor(ProcessFunction): def process_element(self, value, ctx): try: event = parse_and_validate(value) yield event except Exception as e: from pyflink.common import Row yield DLQ_TAG, Row( raw_payload=str(value), error_type=type(e).__name__, error_message=str(e), failed_at_ms=int(time.time() * 1000) ) env = StreamExecutionEnvironment.get_execution_environment() source_stream = env.from_source(...) processed = source_stream.process( EntityEventProcessor(), output_type=Types.STRING() ) good_events = processed dead_letters = processed.get_side_output(DLQ_TAG) good_events.sink_to(build_downstream_sink()) dead_letters.sink_to(build_dlq_sink()) env.execute("Entity Resolution Pipeline") The syntax changes, but the design principle stays the same: good records continue, bad records are isolated and persisted. Production Checklist Before shipping a Flink pipeline, verify the following: RequirementWhy It MattersRisky operators wrapped in try/catchPrevents restart loops from unhandled exceptionsDLQ output tags use explicit typingAvoids runtime serialization failuresDLQ sink is durableFailed records must survive restartsDLQ metrics are exportedSilent DLQ growth is otherwise invisibleReplay path exists and is testedA DLQ without replay is just storageDLQ retention is long enoughTeams need time to diagnose and replayPermanent quarantine existsPrevents infinite replay loopsAlerting is based on DLQ rateJob health alone is not enough This checklist is worth automating in code review or deployment readiness checks. DLQ handling is too important to leave to convention. Key Takeaways If you are building Flink pipelines in production, the safest default is: Use side outputs for DLQ routingRetry transient failures before escalationClassify failures into separate DLQ streamsSink DLQ records durablyExport DLQ metricsReplay through a dedicated job The core rule is simple: A bad message should never silently disappear, and it should never silently stop the stream. That is what turns DLQ handling from a defensive coding trick into a real reliability pattern. Environment Notes The examples in this article target: Apache Flink 1.18Java 17PyFlink 1.18 A few implementation notes: The retry timer pattern requires a keyed stream before KeyedProcessFunctionRocksDB is usually the safer state backend for larger retry stateHashMap state backend can work well for smaller, latency-sensitive workloadsAT_LEAST_ONCE is usually sufficient for DLQ sinks Final Thoughts Poison messages are not rare in streaming systems. They are inevitable. The real question is whether one bad record can take down an otherwise healthy pipeline. With the right DLQ design in Flink, the answer becomes no. The stream keeps moving. Good records continue. Bad records are quarantined. Alerts fire. Replay remains possible. And the pipeline stays operational while the root cause is fixed. That is the difference between a stream that works in staging and one that survives production.

By Rohit Muthyala

Apache Spark Query Optimization on Databricks: Catalyst, AQE, and Photon Engine

Why Query Optimization Matters A Spark query written by a human and a Spark query executed by the engine are often very different things. The gap between them — the optimization — is what separates a job that runs in 3 minutes from one that runs in 3 hours on identical hardware. Databricks compounds Spark's native Catalyst optimizer with two additional layers: Adaptive Query Execution (AQE) – re-optimizes the query at runtime using actual statistics collected mid-jobPhoton – a C++ vectorized execution engine that replaces the JVM-based Spark executor for eligible operators Understanding all three lets you write queries that cooperate with the engine rather than fight it. The Catalyst Optimizer Pipeline Catalyst is Spark's rule-based and cost-based query optimizer. Every query — whether written in SQL, DataFrame API, or Dataset API — passes through the same four-stage pipeline before a single byte of data is read. Stage 1: Parsing — From SQL to Unresolved Logical Plan Python # ── Catalyst Stage 1: Parsing ───────────────────────────────────────────────── # Spark uses ANTLR4 to parse SQL into an Abstract Syntax Tree (AST). # At this point column names are NOT validated — the plan is "unresolved". from pyspark.sql import SparkSession spark = SparkSession.builder.appName("catalyst-demo").getOrCreate() # Both of these produce identical internal representations df_api = ( spark.table("prod.silver.events_clean") .filter("event_type = 'purchase'") .groupBy("platform") .agg({"revenue": "sum"}) ) sql_api = spark.sql(""" SELECT platform, SUM(revenue) AS total_revenue FROM prod.silver.events_clean WHERE event_type = 'purchase' GROUP BY platform """) # Inspect the unresolved logical plan (before analysis) df_api.explain(mode="formatted") # Output includes: # == Parsed Logical Plan == # 'Aggregate ['platform], ['platform, unresolvedAlias('sum('revenue), None)] # +- 'Filter ('event_type = 'purchase) # +- 'UnresolvedRelation [prod, silver, events_clean] The key insight here: UnresolvedRelation and unresolvedAlias mean Spark hasn't touched the catalog yet. Column names could be typos at this point and Catalyst doesn't know. Stage 2: Analysis — Binding to the Catalog The Analyzer walks the unresolved AST and looks up every relation and attribute against the Catalog (in Databricks, this is Unity Catalog). It resolves column names, infers data types, validates references, and binds functions. Python # ── Catalyst Stage 2: Analysis ──────────────────────────────────────────────── # After analysis, every column is resolved to a specific attribute with a type. # AnalysisException is thrown HERE if a column doesn't exist. from pyspark.sql import functions as F from pyspark.sql.utils import AnalysisException # Example of what Analysis catches: try: spark.table("prod.silver.events_clean") \ .select("nonexistent_column") \ .show() except AnalysisException as e: print(f"Analysis failed: {e}") # → AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] # A column or function parameter with name `nonexistent_column` cannot be resolved. # After successful analysis, inspect the resolved plan df = ( spark.table("prod.silver.events_clean") .filter(F.col("event_type") == "purchase") .select("platform", "revenue", "user_id") ) # The analyzed plan shows fully qualified attribute IDs like: # == Analyzed Logical Plan == # platform: string, revenue: double, user_id: string # Project [platform#42, revenue#67, user_id#31] # +- Filter (event_type#39 = purchase) # +- Relation prod.silver.events_clean[...] parquet print(df._jdf.queryExecution().analyzed()) Stage 3: Logical Optimization — Rule-Based Rewrites This is where Catalyst applies its ~100+ built-in rules to produce an equivalent but cheaper logical plan. Rules fire repeatedly in fixed-point iteration until the plan stabilises. Python # ── Catalyst Stage 3: Key Optimization Rules ────────────────────────────────── # RULE 1: Predicate Pushdown # Catalyst moves filters as close to the data source as possible, # so Spark reads fewer rows from Parquet. df_before = ( spark.table("prod.silver.events_clean") .join( spark.table("prod.silver.users_clean"), on="user_id" ) .filter(F.col("event_type") == "purchase") # ← filter AFTER join ) # Catalyst rewrites this internally as if you wrote: df_after_equivalent = ( spark.table("prod.silver.events_clean") .filter(F.col("event_type") == "purchase") # ← filter BEFORE join .join( spark.table("prod.silver.users_clean"), on="user_id" ) ) # Result: potentially millions fewer rows shuffled during the join # RULE 2: Column Pruning # Catalyst removes columns not needed by downstream operators. # Even if you select(*), Spark will only read the columns it needs. df_pruned = ( spark.table("prod.silver.events_clean") .select("*") .filter(F.col("event_type") == "purchase") .groupBy("platform") .agg(F.sum("revenue").alias("total_revenue")) ) # Internally, Catalyst prunes all columns except: event_type, platform, revenue # RULE 3: Constant Folding # Expressions with only literals are evaluated at plan time, not per-row. df_constants = spark.range(1000).select( F.lit(2 + 3 * 4).alias("always_14"), # folded to Literal(14) at plan time F.col("id") * F.lit(1).alias("same_id"), # simplified to just col("id") ) # RULE 4: Boolean Simplification # AND/OR chains with tautologies or contradictions are collapsed df_simplified = spark.range(100).filter( (F.col("id") > 10) & F.lit(True) # simplified to just (col("id") > 10) ) # See all optimizations applied: print(df_pruned._jdf.queryExecution().optimizedPlan()) Stage 4: Physical Planning — Strategies and Cost Models The physical planner maps each logical operator to one or more physical implementations and selects the best one using a cost model. The most impactful decision here is join strategy selection. Python # ── Catalyst Stage 4: Physical Planning & Join Strategies ──────────────────── # JOIN STRATEGY 1: Broadcast Hash Join (BHJ) # Best when one side is small enough to fit in executor memory. # No shuffle — the small table is broadcast to all workers. spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "10mb") # default large_df = spark.table("prod.silver.events_clean") # 500GB small_df = spark.table("prod.gold.product_catalog") # 8MB ← will be broadcast result_bhj = large_df.join(small_df, on="product_id") # BHJ auto-selected # Force BHJ with a broadcast hint (overrides threshold check): from pyspark.sql.functions import broadcast result_forced = large_df.join(broadcast(small_df), on="product_id") # JOIN STRATEGY 2: Sort Merge Join (SMJ) # Default for large-large joins. Both sides are sorted and merged. # Requires a full shuffle — expensive but handles any size. spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1") # disable BHJ large_df2 = spark.table("prod.silver.transactions_clean") # 200GB result_smj = large_df.join(large_df2, on="user_id") # SMJ selected # JOIN STRATEGY 3: Shuffle Hash Join (SHJ) # Hash-based, no sort. Chosen by AQE when one side is much smaller # than the other but still above the broadcast threshold. spark.conf.set("spark.sql.join.preferSortMergeJoin", "false") # WHOLE-STAGE CODEGEN: Spark fuses multiple operators into a single # Java function to avoid virtual dispatch overhead and intermediate objects. # Verify it's active in your plan: spark.conf.set("spark.sql.codegen.wholeStage", "true") # default result_bhj.explain(mode="formatted") # Look for: *(1) BroadcastHashJoin — the *(N) prefix = WholeStageCodegen stage N Adaptive Query Execution (AQE) AQE is Databricks' most impactful runtime optimization layer. It materializes shuffle map output statistics at shuffle boundaries and uses them to make three key decisions after data has been partially processed. Python # ── AQE Configuration ───────────────────────────────────────────────────────── # AQE is ON by default in Databricks Runtime 7.3+ spark.conf.set("spark.sql.adaptive.enabled", "true") # 1. Dynamic Partition Coalescing # Merges small post-shuffle partitions to avoid thousands of tiny tasks spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true") spark.conf.set("spark.sql.adaptive.advisoryPartitionSizeInBytes", "128mb") spark.conf.set("spark.sql.adaptive.coalescePartitions.minPartitionNum", "1") # 2. Dynamic Join Strategy Switching # Allows AQE to downgrade SMJ → BHJ at runtime if a side turns out small spark.conf.set("spark.sql.adaptive.localShuffleReader.enabled", "true") # AQE broadcast threshold (can be higher than static threshold since # we now KNOW the actual size) spark.conf.set("spark.sql.adaptive.autoBroadcastJoinThreshold", "30mb") # 3. Skew Join Optimization # Splits oversized partitions and replicates the non-skewed side spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true") spark.conf.set("spark.sql.adaptive.skewJoin.skewedPartitionFactor", "5") # 5x median spark.conf.set("spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes", "256mb") # Verify AQE decisions in the query plan: df = ( spark.table("prod.silver.events_clean") .join(spark.table("prod.silver.users_clean"), on="user_id") .groupBy("platform") .agg(F.sum("revenue").alias("total")) ) df.explain(mode="formatted") # Look for: AdaptiveSparkPlan isFinalPlan=true # and: == Final Physical Plan == (shows post-AQE decisions) The Photon Engine Photon is Databricks' native vectorized query engine written in C++. It replaces the JVM-based Spark executor for eligible operations, processing data in column-oriented batches (vectors) rather than row-by-row. Python # ── Photon Configuration & Verification ─────────────────────────────────────── # Photon is available on Databricks Runtime 9.1+ with Photon-enabled clusters. # Enable it at the cluster level (UI: Cluster > Configuration > Enable Photon) # or via config: spark.conf.set("spark.databricks.photon.enabled", "true") # Photon-accelerated operators (as of DBR 13.x): # ✅ Scan (Parquet, Delta) ✅ Filter / Project # ✅ Hash Aggregate ✅ Sort # ✅ Broadcast Hash Join ✅ Sort Merge Join # ✅ Window functions ✅ Union / Expand # ✅ String functions ✅ Math functions # ❌ UDFs (Python/Scala) ❌ Some complex types # ❌ Streaming (partial) ❌ RDD-based operations # Verify Photon is executing your query: df = spark.sql(""" SELECT platform, DATE_TRUNC('month', event_ts) AS month, SUM(revenue) AS total_revenue, COUNT(DISTINCT user_id) AS unique_buyers, AVG(revenue) AS avg_order_value FROM prod.silver.events_clean WHERE event_type = 'purchase' AND event_ts >= '2024-01-01' GROUP BY platform, DATE_TRUNC('month', event_ts) ORDER BY month DESC, total_revenue DESC """) df.explain(mode="formatted") # Look for operators prefixed with "Photon" in the physical plan: # == Physical Plan == # PhotonResultStage # +- PhotonSort [month DESC NULLS LAST, total_revenue DESC NULLS LAST] # +- PhotonShuffleExchangeSink hashpartitioning(platform, month) # +- PhotonGroupingAgg [platform, month], [sum(revenue), count(user_id), avg(revenue)] # +- PhotonFilter (event_type = purchase AND event_ts >= 2024-01-01) # +- PhotonScan parquet prod.silver.events_clean # Photon performance metrics appear in Spark UI under "Photon Metrics": # - Photon scan time # - Photon total compute time # - Rows processed by Photon vs fallback JVM Reading Explain Plans The explain(mode="formatted") output is your primary debugging tool. Here's how to read it efficiently: Python # ── Explain Plan Modes ──────────────────────────────────────────────────────── df = ( spark.table("prod.silver.events_clean") .filter(F.col("event_type") == "purchase") .join(broadcast(spark.table("prod.gold.product_catalog")), on="product_id") .groupBy("platform", "category") .agg( F.sum("revenue").alias("total_revenue"), F.count("*").alias("transaction_count") ) ) # Mode 1: simple (default) — compact tree df.explain() # Mode 2: extended — all 4 plan stages side by side df.explain(mode="extended") # Mode 3: formatted — human-readable with operator details (RECOMMENDED) df.explain(mode="formatted") # Mode 4: cost — includes estimated row counts and sizes (requires ANALYZE TABLE) df.explain(mode="cost") # Mode 5: codegen — shows generated Java code for WholeStageCodegen df.explain(mode="codegen") # ── Key Signals to Look For ─────────────────────────────────────────────────── # ✅ GOOD signs: # *(N) prefix → WholeStageCodegen active (operators fused) # BroadcastHashJoin → small table correctly broadcast, no shuffle # PhotonXxx → Photon accelerating this operator # AdaptiveSparkPlan → AQE is engaged # PartitionFilters → Delta/Parquet file skipping active # PushedFilters → filters pushed to Parquet reader # ❌ WARNING signs: # Exchange (shuffle) → unexpected shuffle (missing broadcast hint?) # SortMergeJoin → large-large join (may need Z-ORDER or AQE tuning) # HashAggregate x2 → partial + final agg = shuffle involved # CartesianProduct → missing join condition! Will OOM on large tables # ObjectHashAggregate → non-codegen path, JVM overhead # GenerateXxx → explode() or similar, can't be fused # ── ANALYZE TABLE: feed statistics to CBO ───────────────────────────────────── # Without stats, Catalyst uses default estimates (1M rows, 8 bytes/col). # Run ANALYZE to give the Cost-Based Optimizer real numbers. spark.sql("ANALYZE TABLE prod.silver.events_clean COMPUTE STATISTICS") spark.sql(""" ANALYZE TABLE prod.silver.events_clean COMPUTE STATISTICS FOR COLUMNS user_id, event_type, platform, revenue """) # Now explain(mode="cost") shows real row counts and sizes Tuning Reference Table A quick-reference guide for the most impactful Spark/Databricks configs, what they control, and when to change them: Config KeyDefaultWhat It ControlsWhen to Tunespark.sql.adaptive.enabledtrueMaster AQE switchKeep on; only disable for debuggingspark.sql.adaptive.advisoryPartitionSizeInBytes64mbTarget post-coalesce partition sizeIncrease to 128mb–256mb for large shufflesspark.sql.adaptive.skewJoin.enabledtrueAQE skew splitKeep on; tune skewedPartitionFactor if neededspark.sql.autoBroadcastJoinThreshold10mbStatic BHJ thresholdIncrease to 50mb–100mb if executor memory allowsspark.sql.adaptive.autoBroadcastJoinThreshold30mbAQE runtime BHJ thresholdIncrease if AQE isn't catching small tablesspark.sql.shuffle.partitions200Default shuffle partition countSet to 8 × num_cores for your clusterspark.sql.files.maxPartitionBytes128mbMax bytes per Parquet read partitionReduce for high-parallelism scansspark.databricks.photon.enabledtruePhoton vectorized engineKeep on; disable only for UDF-heavy jobsspark.sql.codegen.wholeStagetrueWhole-Stage CodeGen fusionKeep on; disable only for debuggingspark.sql.statistics.histogram.enabledfalseColumn histograms for CBOEnable after running ANALYZE TABLEspark.sql.cbo.enabledtrueCost-Based OptimizerKeep on; requires ANALYZE TABLE to be usefulspark.databricks.delta.optimizeWrite.enabledtrueAuto bin-pack write filesKeep on for all Delta writes Key Takeaways Catalyst has four stages: Parse → Analyze → Optimize → Plan. Each stage has a distinct job, and understanding them tells you exactly where to look when a query misbehaves.Predicate pushdown and column pruning are the two most impactful automatic optimizations — they reduce the data volume Spark has to move before any aggregation or join.AQE is not a set-and-forget feature: tune advisoryPartitionSizeInBytes to your actual data sizes, and verify its decisions with explain(mode="formatted") — look for AdaptiveSparkPlan isFinalPlan=true.Photon drops in transparently for most SQL and DataFrame operations. The exceptions are Python UDFs, RDD operations, and some complex types — refactor these away from hot paths.Run ANALYZE TABLE ... COMPUTE STATISTICS FOR COLUMNS on your most-joined tables. The CBO's join ordering and strategy decisions improve dramatically with real statistics vs. default estimates.explain(mode="formatted") is your most important debugging tool — learn to read it before reaching for cluster config changes. References Apache Spark — Catalyst Optimizer (Deep Dive Paper, Armbrust et al., SIGMOD 2015)Databricks — Adaptive Query ExecutionApache Spark Docs — Adaptive Query ExecutionDatabricks — Photon RuntimeDatabricks Blog — Photon: A Fast Query Engine for Lakehouse SystemsDatabricks — Cost-Based OptimizerApache Spark — Performance Tuning GuideDatabricks — Broadcast Join Hints"Photon: A Fast Query Engine for Lakehouse Systems" (Behm et al., SIGMOD 2022)Spark by Examples — Explain Plan Modes

By Jubin Abhishek Soni

CORE

The Inter-Agent Protocol Problem

Every major agent framework now has a story for multi-agent systems. Most of them are incompatible with each other. An agent built in AutoGen cannot natively receive a task from a deepagents orchestrator. An OpenAI Agents SDK cannot talk to a LangGraph subgraph. A CrewAI crew cannot delegate to a Pydantic AI team without custom glue code. This is the inter-agent protocol problem: We have multi-agent frameworks, but no agreed-upon protocol for agents to communicate across frameworks. This post breaks down the four main approaches, compares them, and examines the need for standardization. Why Is Inter-Agent Communication Hard? Single-agent systems have a clean interface: you send a prompt, you get a response. Multi-agent systems need to express: Task delegation – what work is being handed off and to whomContext transfer – what is the background the receiving agent needsState propagation – what the delegating agent needs backError and cancellation – what happens when the receiving agent failsStreaming – how partial results flow back during long tasks Most frameworks solve some of these, but with different schemas, different transport assumptions, and no interoperability. The 4 Current Approaches 1. ACP (Agent Communication Protocol): deepagents Installation: Shell pip install deepagents-acp # Spec: https://github.com/langchain-ai/deepagents/tree/main/libs/acp # deepagents-acp version: 0.0.6 (requires agent-client-protocol>=0.8.0) ACP is an open protocol: a published JSON schema for agent-to-agent communication. Langchain's deepagents ships deepagents-acp, a Python client and server implementation, as a separate package, so any framework can implement it. ACP Message Flow A deepagents orchestrator routing a task to an ACP-compatible worker: Python from deepagents import create_deep_agent from deepagents.middleware import AsyncSubAgentMiddleware, AsyncSubAgent # Declare an ACP-compatible worker agent research_worker = AsyncSubAgent( name="research", description="Deep research agent — use for tasks requiring web search and synthesis", url="http://research-agent:8080", # ACP server endpoint ) # The orchestrator routes tasks to workers via the ACP protocol automatically orchestrator = create_deep_agent( model="anthropic:claude-sonnet-4-6", middleware=[ AsyncSubAgentMiddleware(subagents=[research_worker]), ], ) Serving a deepagents agent as an ACP endpoint: Python import asyncio from acp import run_agent as run_acp_agent from deepagents import create_deep_agent from deepagents_acp.server import AgentServerACP agent = create_deep_agent(model="anthropic:claude-haiku-4-5") # Wrap the compiled graph and expose it over ACP acp_agent = AgentServerACP(agent=agent) asyncio.run(run_acp_agent(acp_agent)) What makes ACP different: It's an open schema, not a framework-internal call. Any framework can implement an ACP server or client, which means a CrewAI crew could delegate to a deepagents worker over ACP without any shared code. Current limitation: Adoption is early. As of v0.9.0, the primary implementations are deepagents-native. Support for other frameworks requires each to implement the server interface independently. 2. Handoffs: OpenAI Agents SDK Installation: Shell pip install openai-agents # Docs: https://openai.github.io/openai-agents-python/handoffs/ OpenAI's SDK handles delegation through handoffs: an agent declares which other agents it can hand off to. At runtime, the orchestrator agent decides when to delegate and execution transfers. Python from agents import Agent, Runner, handoff research_agent = Agent( name="ResearchAgent", instructions="You are an expert at web research. Answer research questions thoroughly.", tools=[web_search_tool], ) code_agent = Agent( name="CodeAgent", instructions="You write and review Python code.", tools=[run_code_tool], ) orchestrator = Agent( name="Orchestrator", instructions=( "Route tasks to the right specialist. " "Use ResearchAgent for questions requiring web search. " "Use CodeAgent for programming tasks." ), handoffs=[ handoff(research_agent), handoff(code_agent, tool_name_override="delegate_to_coder"), ], ) result = await Runner.run(orchestrator, "Write a Python script to fetch weather data") You can also add context and filters to handoffs: Python from agents import handoff, RunContextWrapper def on_handoff_to_research(ctx: RunContextWrapper, input_data: str) -> None: print(f"Handing off to research agent with: {input_data}") research_handoff = handoff( research_agent, on_handoff=on_handoff_to_research, input_filter=lambda inp: inp, # transform input before handoff ) Limitation: Handoffs are framework-internal. A handoff target must be an Agent instance from the same SDK. There's no published schema, no HTTP transport, no cross-framework interoperability. If your orchestrator is a LangGraph graph and your specialist is an OpenAI agent, handoffs can't bridge them. 3. Agent Delegation via Tools: Pydantic AI Installation: Shell pip install pydantic-ai # Docs: https://pydantic.dev/docs/ai/guides/multi-agent-applications/ Pydantic AI does not have a dedicated inter-agent messaging primitive. The idiomatic multi-agent pattern is agent delegation via tools: a specialist agent is called from within a tool of the orchestrator agent, and the result is returned to the orchestrator like any other tool return value. Python from pydantic_ai import Agent, RunContext research_agent = Agent( "anthropic:claude-haiku-4-5", system_prompt="You are a research specialist. Answer questions with citations.", ) code_agent = Agent( "anthropic:claude-sonnet-4-6", system_prompt="You are a Python expert. Write clean, tested code.", ) orchestrator = Agent( "anthropic:claude-sonnet-4-6", system_prompt="Route tasks to the right specialist tool.", ) @orchestrator.tool async def research(ctx: RunContext[None], query: str) -> str: """Delegate a research question to the research specialist.""" result = await research_agent.run(query, usage=ctx.usage) return result.output @orchestrator.tool async def write_code(ctx: RunContext[None], task: str) -> str: """Delegate a coding task to the code specialist.""" result = await code_agent.run(task, usage=ctx.usage) return result.output result = await orchestrator.run( "Research quantum computing and write a Python simulation" ) Passing usage=ctx.usage propagates token accounting from the delegate run back to the parent, so result.usage() that the orchestrator covers all sub-agent calls. Limitation: Delegation is framework-internal: all agents must be Pydantic AI Agent instances. There is no pub-sub bus, no HTTP transport, and no cross-framework protocol. For true cross-framework delegation, a pattern like ACP is required. 4. AgentProtocol Backbone: Agno Installation: Shell pip install agno # Docs: https://docs.agno.com/introduction Agno takes the most ambitious approach: an AgentProtocol backbone that acts as a multi-framework adapter. The goal is to let agents from different frameworks (LangGraph, DSPy, Claude's SDK) plug into the same Agno team: Python from agno.agent import Agent from agno.team import Team from agno.models.anthropic import Claude # Native Agno agent research_agent = Agent( name="Researcher", model=Claude(id="claude-haiku-4-5"), tools=[web_search_tool], description="Specializes in web research and synthesis.", ) # Native Agno agent code_agent = Agent( name="Coder", model=Claude(id="claude-sonnet-4-6"), tools=[python_repl_tool], description="Specializes in Python development.", ) team = Team( name="FullStackTeam", mode="coordinate", members=[research_agent, code_agent], model=Claude(id="claude-sonnet-4-6"), db=agno_db_storage, # sessions + memory persistence enable_agentic_state=True, ) team.print_response( "Research quantum computing trends and prototype a simulation in Python" ) Limitation: The multi-framework adapter is still maturing. Most production Agno deployments use native Agno agents. The cross-framework vision is real, but the published adapter surface for LangGraph and DSPy is sparse at the time of writing. Side-by-Side Comparison DimensionACP (deepagents)Handoffs (OpenAI)Agent delegation (Pydantic AI)AgentProtocol (Agno)Published open schema✓✗✗PartialHTTP transport✓✗✗✓Streaming support✓ (SSE)✓ (run-level)✗✓Cross-framework workers✓ (any ACP server)✗✗PartialContext/metadata passing✓✓ (input_filter)✓ (usage propagation)✓Error/cancellation schema✓Partial✗✓Built-in state persistence✗ (LangGraph handles it)✗✗✓ (Agno DB)Production deploymentsEarlyGrowingMatureGrowing The Fragmentation Cost in Practice Consider this real scenario: you have a research pipeline where: The orchestrator is a deepagents agent (LangGraph-backed)The research worker is a CrewAI crew (good at parallel research tasks)The code worker is an OpenAI agent (good at code + sandboxed execution) Today, wiring this up requires: Python # Option 1: Wrap every non-deepagents agent as a plain function tool # Loses: streaming, cancellation, structured error handling @tool def run_crewai_research(query: str) -> str: crew = ResearchCrew() result = crew.kickoff(inputs={"query": query}) return str(result) # no streaming, no structured output # Option 2: Host each agent as an HTTP service and call it manually # Loses: shared context, standard error handling, progress tracking import httpx @tool async def call_openai_agent(task: str) -> str: async with httpx.AsyncClient() as client: response = await client.post( "http://openai-agent-service/run", json={"task": task}, ) return response.json()["result"] With ACP, the same cross-framework delegation looks like this: Python from deepagents.middleware import AsyncSubAgentMiddleware, AsyncSubAgent # Any ACP-compliant server — regardless of what framework runs inside orchestrator = create_deep_agent( model="anthropic:claude-sonnet-4-6", middleware=[ AsyncSubAgentMiddleware(subagents=[ AsyncSubAgent(name="research", url="http://crewai-acp-server:8080"), AsyncSubAgent(name="coder", url="http://openai-acp-server:8081"), ]), ], ) The wrapping frameworks are invisible. The protocol standardizes streaming, cancellation, and structured error handling. Where Is This Heading? The inter-agent space is actively consolidating around a few patterns: ACP is the most explicit attempt at standardization. It bets that an open schema survives framework churn. You can swap the worker implementation without changing the orchestrator.Handoffs are winning on simplicity within the OpenAI ecosystem. For teams already on the OpenAI SDK, they're ergonomic and production-proven. The cross-framework limitation only matters if you leave the ecosystem.The agent-delegation model (Pydantic AI via agent-as-tool, AutoGen's event-driven redesign) is a better fit for peer networks where no single orchestrator coordinates everything.Framework-native protocols will likely remain dominant for the near term. Cross-framework standardization requires enough pain from fragmentation to motivate all players -> we're getting there, but not quite yet. If you're building a multi-agent system today and expect to stay within one framework, pick that framework's native protocol. If you're building infrastructure that needs to orchestrate agents across frameworks or you expect your team to evaluate multiple frameworks, investing in ACP-compatible interfaces from the start gives you the most flexibility.

By Ninaad Rao

Why Push-Based Systems Fail at Scale — and How Hybrid Fan-Out Fixes It

Real-time systems look simple on architecture diagrams. A user posts content, the backend publishes an event, and connected users instantly receive notifications through persistent WebSocket connections. At small scale, the model works beautifully. At large scale, it becomes one of the fastest ways to melt distributed infrastructure. Most push-based architectures fail for one reason: they assume traffic is evenly distributed. Production traffic never is. One user may have 50 followers. Another may have 10 million. Designing both scenarios using the same fan-out strategy creates massive operational problems during peak traffic. That is why large-scale platforms evolved from naive push delivery into hybrid push/pull systems optimized around uneven load distribution. The Naive Push Architecture The first design most engineers create is straightforward: A user publishes a postThe backend sends the event to a brokerWebSocket servers receive the eventNotifications are pushed to all connected followers On paper, the architecture looks clean. The system appears scalable because: WebSockets provide real-time deliveryBrokers decouple servicesHorizontal scaling seems possible But hidden underneath the simplicity is a dangerous scaling assumption: every user generates similar traffic patterns. That assumption collapses the moment a celebrity account posts. The Celebrity Fan-Out Problem Imagine a user with 10 million followers posting a new update. The system now attempts to: Generate millions of delivery events,Route them through brokers,Maintain millions of active socket writes,Deliver updates almost simultaneously. The bottleneck is no longer application logic. The bottleneck becomes: Broker throughputConnection managementQueue depthNetwork bandwidthRetry amplification This is where many real-time systems fail in production. As delivery pressure increases: Queues begin backing upConsumers lag behindWebSocket nodes become saturatedLatency grows from milliseconds into seconds or minutes Then retries begin. Clients retry because acknowledgments are delayed. Servers retry because deliveries fail. Load balancers redistribute unstable traffic. The system begins amplifying the overload condition itself. This behavior is common in distributed systems: Reliability mechanisms designed to recover from failure end up accelerating collapse under overload. The architecture appears stable during normal traffic. It fails at the exact moment traffic matters most. Why Pure Push Architectures Break The real issue is fan-out-on-write. Every post immediately creates work proportional to follower count. For small accounts, this is inexpensive. For celebrity-scale accounts, a single write operation generates massive downstream pressure: Enormous queue pressureHigh-volume socket deliveryEnormous broker traffic The system becomes optimized around worst-case fan-out instead of average workload. That is operationally expensive and difficult to stabilize. This is why most large-scale feed systems avoid pure push delivery for all users. The Hybrid Push/Pull Model Modern systems solve the problem differently. Instead of treating every account identically, they dynamically switch between: Push-on-writePull-on-read The decision is usually based on follower thresholds. Push-on-Write for Small Accounts For smaller accounts: Updates are immediately pushed,Queue workers fan out notifications,Followers receive low-latency real-time updates. This keeps the user experience fast while infrastructure costs remain manageable. Pull-on-Read for Large Accounts For celebrity-scale accounts: Posts are stored normallyFan-out is avoidedFeeds are assembled when users open the app Instead of generating millions of writes immediately, the workload shifts to read time. This dramatically reduces broker pressure and prevents large fan-out storms from destabilizing the platform. Twitter/X publicly discussed similar strategies years ago because global push fan-out becomes prohibitively expensive at scale. The important engineering insight is: Push and pull are not competing architectures. They are complementary scaling strategies selected dynamically based on traffic patterns. Feed Assembly Introduces New Complexity Once systems adopt pull-on-read, another problem appears: feed assembly. Now the platform must dynamically build personalized feeds using: Follower relationshipsRanking algorithmsMuted usersBlocked accountsRecent activityRecommendation signals This shifts complexity from writes to reads. To reduce repeated database work, systems commonly introduce: Redis timeline cachesMaterialized feed viewsAsynchronous feed buildersHot-feed caching layers The challenge becomes balancing: FreshnessLatencyConsistencyInfrastructure costCache invalidation The architecture is no longer just “real-time delivery.” It becomes distributed workload management. WebSockets Make Infrastructure Stateful Many system design discussions stop once WebSockets are introduced. Production systems become significantly harder after that point. WebSockets create stateful infrastructure. Now the platform must know: Which user is connectedWhich server owns the connectionHow to recover missed events after reconnects This changes routing behavior completely. Requests can no longer be routed blindly across stateless servers. Most systems introduce: Sticky sessions,Session affinity,Distributed connection registries,Redis pub/sub coordination. Then mobile networks create another challenge: temporary disconnects. A user loses connectivity for three seconds. What happened during that gap? Without replay recovery, notifications disappear permanently. Replay Buffers and Recovery Logic Reliable real-time systems usually implement: Sequence IDsReplay buffersReconnect checkpointsGap recovery logic When the client reconnects: It sends the last processed sequence IDThe server identifies missing eventsReplay buffers resend missed messagesLive streaming resumes This is where systems move beyond interview-level architecture. The challenge is no longer simply delivering events. The challenge is maintaining continuity during instability. Real-world distributed systems spend enormous engineering effort handling: Partial failuresReconnect stormsDuplicate deliveryInconsistent network conditions Operational Tradeoffs Teams Often Underestimate One of the biggest mistakes in real-time architectures is optimizing only for delivery speed while ignoring operational cost. Push-heavy systems keep large numbers of persistent connections open simultaneously. At global scale, this introduces pressure across multiple infrastructure layers: Connection memory usageBroker throughputNetwork egressHeartbeat trafficReconnect storms during outages Even healthy systems can become unstable during regional network disruptions. For example, if thousands of mobile clients reconnect at the same time after a temporary outage, WebSocket gateways may suddenly experience authentication spikes, replay requests, and connection churn simultaneously. This often creates secondary overload events long after the original incident is resolved. This is why mature systems introduce additional controls such as: Connection rate limitingReplay window expirationBackpressure handlingCircuit breakersAdaptive retry strategies Another overlooked problem is message ordering. In distributed fan-out systems, messages may arrive out of order because events are processed asynchronously across multiple workers or partitions. Without sequence tracking, users may briefly see inconsistent timelines or duplicate notifications. Production-grade systems therefore prioritize the following instead of assuming perfect real-time synchronization: Idempotent delivery,Sequence-aware replay,Eventual consistency handling The engineering challenge is not simply pushing events quickly. The challenge is maintaining stability while millions of users interact with the platform under unpredictable traffic conditions. Final Thoughts Most distributed systems look elegant until traffic becomes uneven. That is the hidden reality behind large-scale architecture. The difficult part is not handling average load. The difficult part is surviving pathological load without collapsing the platform. Real systems evolve through operational pain: Broker saturationRetry stormsReplay failuresQueue buildupCascading latency amplification The best architectures are rarely the simplest ones. They are the ones that continue functioning when the system is under maximum stress. In distributed systems, every design is ultimately a negotiation between: LatencyThroughputDurabilityAvailabilityCost Those forces shape every scalable platform on the internet. The systems that survive at scale are not the ones with the cleanest diagrams. They are the ones designed to absorb failure without collapsing under pressure. References Apache Kafka DocumentationRedis Pub/Sub DocumentationWebSocket Protocol RFC 6455Twitter Scalability Architecture DiscussionDesigning Data-Intensive Applications by Martin KleppmannGoogle SRE Book — Handling Overload

By Jayapragash Dakshnamurthy

One Stolen Key, One Stolen Token: Why Machine Identity Is Cloud-Native's Quietest Crisis — and the Only Fix That Actually Holds

On December 2, 2024, a security vendor called BeyondTrust noticed something wrong inside its own AWS account. By the time the investigation closed, the story that emerged was almost absurdly simple for something with this much fallout: an attacker — later attributed to the Chinese state-sponsored group Silk Typhoon — had used a software flaw to reach into a BeyondTrust cloud account and pull out an API key. Not a password. Not a phishing victim's login. A string of characters that a piece of software used to talk to another piece of software. With that one key, the attacker walked straight into the U.S. Department of the Treasury, reset internal passwords, accessed workstations inside the Office of Foreign Assets Control, and read unclassified documents before anyone noticed. The Treasury disclosed it to Congress on December 30. The Department of Justice indicted the alleged operators in March 2025. If you've never worked in security, here's the plain-English version of what happened: somewhere inside the machinery that runs modern software, there's almost always a "key" — a credential one computer program shows another to prove it's allowed to be there. Humans log in with passwords and, increasingly, a second factor on their phone. Software mostly doesn't. It just holds a key, often for months or years at a time, and whoever holds that key gets treated as trustworthy, no questions asked. The Treasury breach happened because one of those keys ended up in the wrong hands and nothing else stood between that key and a federal agency's internal documents. Two months later, a different flavor of the same problem produced the largest theft of digital assets in history. $1.5 Billion, One Developer's Laptop In February 2025, the cryptocurrency exchange Bybit lost approximately $1.5 billion in Ethereum in a single operation. Palo Alto Networks' Unit 42 threat research team later tied the attack to Slow Pisces, a North Korean state-linked group also known as Lazarus or TraderTraitor, and traced the entry point back to a developer at a third-party vendor that managed Bybit's multi-signature wallet infrastructure. The attackers didn't break Ethereum's cryptography. They stole that developer's AWS session tokens — another form of machine credential — and used them to gain administrative access to cloud infrastructure that could authorize transactions, then quietly altered what a routine-looking transaction actually did before it executed. Unit 42 then found the same pattern at a second cryptocurrency exchange later in 2025, this time running through Kubernetes, the orchestration system that now runs much of the cloud-native world. The attackers phished a developer, used the access on the developer's machine to drop a malicious workload directly into the exchange's production Kubernetes cluster, and had that workload expose its own service account token — a credential Kubernetes automatically hands to every running pod so it can talk to the cluster's control plane. The stolen token happened to belong to a CI/CD management identity with sweeping permissions. From there, the intruders queried secrets across namespaces, planted a backdoor, and pivoted into the exchange's cloud-hosted backend, reaching the financial systems behind it. Unit 42's broader research found suspicious activity consistent with service-account-token theft in 22 percent of cloud environments analyzed in 2025, and recorded a 282 percent year-over-year jump in Kubernetes-directed attacks overall. Different industries, different attackers, same root cause: a non-human credential that was both long-lived and broader in scope than the task in front of it ever needed. Why This Keeps Happening Identity and access management, as a discipline, was built for people. People have managers, onboarding dates, performance reviews, and an HR system that flags them the day they leave. A workload has none of that. A microservice can spin up, do its job, and disappear thousands of times a day; a service account, by contrast, often gets created once and never revisited again. CyberArk's research has been blunt about the resulting imbalance: machine identities now outnumber human ones by more than 80 to 1 in the average enterprise, and the security architecture protecting most of them still assumes the old, human-shaped world — an org chart, not a fleet of ephemeral containers. That mismatch is exactly why static secrets sprawl the way they do. A developer hardcodes a key during a deadline crunch, intending to externalize it "later." A Terraform state file ends up holding plaintext cloud credentials because nobody flagged it in review. A default Kubernetes service account token, more permissive than anyone realized, gets mounted into a pod by default because turning that off requires deliberate configuration most teams never get around to. None of these are exotic mistakes. They're the ordinary residue of moving fast, and they accumulate the way unpaid debt does — quietly, until the day someone calls it in. The structural fix has a name by now, even if adoption is uneven: frameworks like SPIFFE and its production runtime SPIRE replace the static key with a short-lived, cryptographically attested identity — something closer to a backstage pass that's reissued before every single show rather than a master key cut once and handed out forever. A workload proves what it actually is — which Kubernetes service account launched it, which container image it's running — and receives an identity document valid for minutes, not months. Steal that, and an attacker is racing a clock that resets automatically rather than one that only resets when a human notices something is wrong. Cloud providers offer narrower versions of the same idea for their own platforms — AWS's IAM Roles for Service Accounts, Google's Workload Identity Federation — letting a workload trade a short-lived token for cloud access instead of carrying a standing key in the first place. But identity alone doesn't close the loop, and this is the part most "zero trust" conversations skip past. None of it matters if nothing in your pipeline actually enforces it. Security By Design Is a Promise. CI/CD Is Where You Find Out If It's Kept. Plenty of organizations will tell you, with complete sincerity, that they practice "security by design." Most of them mean it stopped at an architecture review months before the first line of code shipped. That's not a fix, it's a memory of one. Code that deploys daily — sometimes hourly — doesn't wait for an annual audit to catch a misconfigured token or an over-privileged service account, and by the time a quarterly review would have caught the BeyondTrust-style key or the Bybit-style session token, the damage in both real cases was already done. The only version of "security by design" that survives contact with a real production pipeline is the one written as code and enforced automatically, at every stage, by something that can actually say no. Picture the pipeline this way: Plain Text Developer commits code | v CI build triggers | +--> SAST (code flaws) + SCA (dependency CVEs) + secrets scan | | | fail? -----> build blocked, developer notified | | | pass v Generate SBOM + sign artifact (Cosign) + build provenance (SLSA) | v Policy-as-code gate (OPA / Kyverno) | +--> checks: image from approved registry? running as non-root? | signature valid? provenance matches expected builder? | service account scoped to least privilege? | | fail? -----> deployment rejected, logged, alert raised | pass v Deploy to production | v Runtime monitoring + short-lived workload identity (SPIFFE/SPIRE, IRSA) | v Continuous re-verification — nothing trusted indefinitely Every box in that chain is a place where the Treasury breach or the Bybit breach could have stopped instead of escalating. A policy-as-code rule using Open Policy Agent's Rego language, or Kyverno's Kubernetes-native YAML equivalent, can flatly refuse to schedule a pod requesting broader RBAC permissions than its declared task needs — which would have directly undercut the over-privileged CI/CD identity that the crypto-exchange attackers rode into the cluster. A signing and attestation step using Cosign, tied to SLSA provenance, means a deployed artifact has to prove which build system actually produced it before it runs at all — closing exactly the kind of trust gap that let a single compromised AWS asset cascade into a stolen infrastructure API key at BeyondTrust. None of this is theoretical tooling. Red Hat's own Enterprise Contract documentation describes signing as tying an image to a specific builder identity precisely so an attacker can't substitute a malicious binary without the signature itself breaking and announcing the tampering. The Uncomfortable Bottom Line I don't think either of this year's headline breaches happened because anyone involved was careless in some obvious, fireable way. They happened because the credential — not the firewall, not the encryption, not the cleverness of the malware — was the actual asset under attack the entire time, and almost nothing downstream of "the key worked" was built to ask a second question. Gartner named non-human identity management a top strategic security trend for exactly this reason in 2025, and OWASP followed with a dedicated Non-Human Identity Top 10 the same year, an overdue acknowledgment that the tooling built for human logins was never going to be enough. My honest prediction, watching this pattern repeat across a federal agency and two of the largest crypto exchanges on earth within twelve months of each other: the organizations that treat policy-as-code enforcement and short-lived machine identity as default infrastructure — not optional hardening bolted on after an incident — are the ones that won't end up writing the next version of this story. Everyone else is currently running on borrowed time, secured by a key that, statistically, is already older than it should be.

By Igboanugo David Ugochukwu

CORE

Why AI-Generated Code Is Making Regression Testing More Important, Not Less

There is a widespread assumption circulating in engineering teams right now that goes something like this: if AI can write code faster, it probably makes testing less of a bottleneck too. The logic seems reasonable on the surface. Faster code, faster tests, faster everything. This assumption is wrong, and teams that act on it are going to find out the hard way. AI-generated code does not reduce the need for regression testing. It amplifies it. And the teams that understand this early will have a significant quality advantage over those that do not. The Fundamental Misunderstanding When developers use AI coding assistants to generate functions, services, or entire modules, they are not producing code that has been verified against the real behavior of their system. They are producing code that is syntactically correct and structurally plausible, written by a model that has no knowledge of how their specific application actually runs in production. This is a critically important distinction. A human developer who has worked on a codebase for months carries implicit knowledge about which edge cases matter, which downstream services are flaky, and which data patterns appear in production that were never anticipated in the original requirements. An AI model has none of this context. It produces code that looks right and often is right for the happy path, but it has no way of knowing what the code needs to handle in the real world. The result is a class of defects that regression testing is uniquely positioned to catch: behaviors that work in isolation but break in the context of the full system. The Velocity Trap Here is where teams get into trouble. AI coding tools are genuinely fast. Developers using them can produce working code at a rate that was not possible before, and the productivity gains are real. But velocity without verification is just a faster path to production failures. The pattern plays out predictably. A team adopts AI coding assistance, development speed increases, the engineering leadership is happy, and everyone agrees to keep moving fast. What nobody adjusts is the regression testing strategy. The test suite that was sized for the previous pace of development is now covering a larger surface area of code, generated at higher volume, by a process that has no awareness of production context. Coverage gaps compound quietly. Nobody sees them until something breaks in production in a way that takes two days to trace back to a function that an AI wrote last sprint and nobody fully read. What AI-Generated Code Actually Gets Wrong The failures that emerge from inadequate regression coverage of AI-generated code tend to cluster in specific areas. Integration points are the most common failure zone. AI generates code based on interfaces and contracts. It looks at API signatures, function definitions, and data schemas. What it cannot see is how those contracts actually behave when real traffic flows through them. Consider a realistic scenario: an AI-generated service calls a downstream payment processor using the documented API specification. The code is technically correct. But the payment processor returns a slightly different response shape when a transaction is declined due to insufficient funds versus when it is declined due to a card expiry. The specification documents neither distinction. The AI has no way to know they exist. A regression suite built from real production traffic would catch this within the first test run. A regression suite built from the same specification the AI used to write the code will not catch it until a customer sees a wrong error message in production. Mock drift compounds the problem. When tests for AI-generated code are written using mocked dependencies, those mocks represent what the developer or AI thought the dependency would do. Over time, the real dependency changes and the mocks do not. Tests keep passing, the real behavior keeps drifting, and the regression suite provides false confidence rather than real coverage. AI-generated code optimizes for the stated requirement. It handles the case described in the prompt competently. It does not handle the cases that were not in the prompt: the empty array that should return a specific error, the timestamp that crosses a timezone boundary, the concurrent request that triggers a race condition. These are edge cases that only emerge from real usage patterns, and they are precisely what a regression suite built from real traffic catches where tests written from requirements do not. The Regression Testing Response Understanding these failure modes points directly to what needs to change in regression testing strategy when AI-generated code becomes part of the development process. Test generation needs to be grounded in real behavior, not assumed behavior. The traditional model of writing tests based on requirements becomes increasingly insufficient when the code being tested was generated by a model that had access only to those same requirements. The regression suite ends up testing exactly what the AI thought the code should do. Tests need to be grounded in what the system actually does when real requests flow through it. Integration test coverage becomes more important than unit test coverage. AI-generated code can usually pass unit tests because it generates syntactically correct implementations of isolated functions. The failures emerge at integration points. Regression testing that focuses on the integration layer, verifying that services interact correctly under realistic conditions, catches the class of failures that AI-generated code is most likely to introduce. Regression coverage should update continuously rather than incrementally. The pace of development with AI assistance creates a situation where code is being added to the codebase faster than manual test authoring can keep up. If the regression suite is maintained manually, it will always be behind. Coverage needs to grow with the codebase automatically, derived from real usage rather than added by developers who are already stretched by higher output demands. Production behavior should feed back into test validation. Closing the loop between how the system behaves in production and what the regression suite is testing is one of the most important shifts a team can make. When tests are derived from actual production traffic rather than written specifications, the mock drift problem largely disappears because the tests reflect what services actually do, not what developers assumed they would do. The Counter-Intuitive Conclusion There is a temptation to see AI-generated code and automated testing as solving the same problem from different angles. If AI can generate both the code and the tests, the reasoning goes, maybe the coverage problem solves itself. It does not. An AI that generates code and then generates tests for that code is essentially testing its own assumptions about how the code should behave. It will consistently produce tests that pass against the code it wrote, and those tests will systematically miss the gap between what the AI thought the code should do and what the system actually needs to do under production conditions. The gap between AI intent and production reality is exactly where regression testing has always been most valuable. AI-generated code makes that gap wider, not narrower, because the code is being written by something with no production experience at all. The teams that treat AI coding assistance as a reason to invest less in regression testing will eventually face production incidents that trace directly to this decision. The teams that treat it as a reason to invest more, particularly in coverage grounded in real system behavior rather than written specifications, will find that AI assistance genuinely accelerates development without accumulating the hidden quality debt that comes with uncovered integration failures. The Bottom Line Regression testing was never just a safety net. It is the mechanism by which a team validates that their understanding of the system matches how the system actually behaves. When AI is generating the code, that validation matters more than ever, because the code is now written by something that has never seen your system run. Invest accordingly.

By Sancharini Panda