Testing, Deployment, and Maintenance Resources

DZone's Featured Testing, Deployment, and Maintenance Resources

On-Device Debugging and JUnit 5

By Shai Almog

CORE

This is the first follow-up to Friday's release post, and it covers the two changes from this release that affect how you iterate on a Codename One app rather than what the app itself does. On-device debugging that treats Java as Java on a real iPhone or a real Android device, and standard JUnit 5 against the JavaSE simulator. The first is the one we have been wanting for a long time, and is the one that takes the most explaining, so most of the post is about it. On-Device Debugging That Treats Java as Java Codename One has always supported on-device debugging in the strict technical sense. You could attach Xcode to a .ipa, you could attach Android Studio to a running APK, you could read the native call stack, you could step through Objective-C or the C that ParparVM emits. What you could not do was set a breakpoint in MyForm.java, hit it on a real iPhone, and inspect a Java field on a Java object as a Java object. You also could not debug an iOS app without a Mac in the loop somewhere, because the only debugger that understood the binary was Xcode. The translation step between the Java you wrote and the C that ParparVM produces left no way back across the gap on the device. PR #4999 (iOS) and PR #5012 (Android) close that gap. As of this week, any JDWP-speaking debugger (IntelliJ IDEA, jdb, VS Code's Java Debugger, Eclipse, NetBeans) can attach to a Codename One app and treat the running process as a JVM. Supported targets: iOS The iOS Simulator (requires a Mac, because the iOS Simulator only runs on a Mac),A real iPhone reached over Wi-Fi from the developer machine on the same network. You do not need a local Mac to debug on a real iPhone. The Codename One build cloud runs the iOS build for you and produces a signed .ipa; install it on your iPhone the usual way (TestFlight, ad-hoc, or the standard Build Cloud install link), and the JDWP attach over Wi-Fi works from a Linux or Windows IDE just as well as from a Mac. The Mac is only required for the local Xcode build path and for running the iOS Simulator. Android The Android emulatorA real Android phone over USBA real Android phone over wireless adb The Android attach uses standard adb, so you need the Android SDK platform tools installed on the developer machine. Those are available on macOS, Linux, and Windows, so any of the three is fine for Android debugging. What It Looks Like A breakpoint inside an iOS app, hit on the iOS Simulator next to IntelliJ IDEA: The same Debug tool window you use for any other Java project. The frames panel on the left has the full Java call stack. The Variables panel shows this and the locals as Java values, with the same drill-down you would get on a regular JVM. The simulator on the right is the real iOS app, paused at the breakpoint, waiting for the next step. How the Pieces Fit Together On iOS, the IDE never talks to the device directly. The CN1 Debug Proxy is a small Java process you run on your developer machine. It binds two TCP ports: one for the iOS app to dial into using the CN1 wire protocol, and one that speaks standard JDWP for the IDE. The IDE sees a normal remote JVM. The iOS app sees a debug proxy. The proxy translates between the two and walks the ParparVM struct layout so Java fields, method calls, and values round-trip cleanly in both directions. On Android, the proxy is unnecessary. Dalvik/ART implements JDWP themselves, so IntelliJ attaches directly to the device through adb's built-in JDWP forwarder. The Maven plugin's new cn1:android-on-device-debugging goal does the adb orchestration and the port forwarding for you. A capability difference between the two platforms worth knowing up front: on Android, a native interface's Impl class is regular Java, so the JDWP attach steps through it the same way it steps through any other class in your project. On iOS the Impl is Objective-C, which JDWP does not speak, so you cannot step through it from the IDE. You can still step through the Codename One framework code and your own Java up to and through the native-interface call, and you can inspect the value the call returns; the body of the Objective-C method is the only thing that is opaque from the JDWP side. Attach Xcode in parallel if you need to step through the Objective-C as well. Tutorial: IntelliJ + iOS The Codename One archetype now generates two run configurations under an On-Device Debug folder in the IntelliJ run-config dropdown: CN1 Debug Proxy and CN1 Attach iOS. The tutorial below assumes a project generated from the Initializr recently enough to have those. If you have an older project, generate a new project with initializr and copy over the .idea directory and maven pom.xml files. 1. Enable the Build Hints Open common/codenameone_settings.properties and uncomment the four lines the archetype generated: Properties files ios.onDeviceDebug=true ios.onDeviceDebug.proxyHost=127.0.0.1 ios.onDeviceDebug.proxyPort=55333 ios.onDeviceDebug=true flips the iOS build into the instrumented variant. The other three configure the proxy connection. The fourth hint, ios.onDeviceDebug.waitForAttach=true, is the block-on-load option, and we recommend leaving it on. With it enabled, the iOS app shows a "Waiting for debugger" overlay at launch and does not progress past Display.init until the proxy issues its first resume. The recommendation is mostly about making the on-device-debug variant visible. Without the overlay it is easy to launch an on-device-debug build expecting the debugger to attach and not realize it is silently waiting for a proxy that is not running, and it is also easy to mistake an on-device-debug build for a regular build and then be surprised when it does not perform as smoothly as the release variant. The overlay rules out both of those. For a physical iPhone the proxyHost value should be the laptop's LAN IP (run ifconfig | grep "inet " to find it) rather than 127.0.0.1. The iOS Simulator can always use 127.0.0.1. 2. Build the iOS App Either path works: Local Xcode build (mvn cn1:buildIosXcodeProject) and then run from Xcode.Cloud build for a real device (mvn cn1:buildIosOnDeviceDebug) and install the resulting .ipa. Both produce an iOS binary instrumented for on-device debugging because the build hint is set. 3. Start the Proxy In IntelliJ, pick CN1 Debug Proxy from the run-config dropdown and click the green Run button (not the bug icon; Debug on this config would attach IntelliJ to the proxy itself, which is not what you want). The Run tool window shows: Plain Text On-device-debug proxy starting: symbols : .../cn1-symbols.txt device : listening on tcp://0.0.0.0:55333 jdwp : listening on tcp://0.0.0.0:8000 [device] listening on port 55333 for ParparVM app to dial in When the [jdwp] line appears, the proxy is ready. 4. Attach the Debugger Switch the run-config dropdown to CN1 Attach iOS and click the Debug button. IntelliJ connects to localhost:8000 and opens its standard Debug tool window. You can now set breakpoints anywhere in your Java code or in the framework. 5. Launch the App Launch the iOS app under the iOS Simulator (from Xcode) or on the tethered device. With waitForAttach=true it pauses at the "Waiting for debugger" overlay until the proxy issues its first resume. Hit Resume on the IntelliJ Debug toolbar; the app proceeds, your breakpoints fire as the app exercises them. The proxy's Run window is also your device console. Anything the app writes to System.out, Log.p, printf, or NSLog from native code is forwarded to the proxy and printed in the CN1 Debug Proxy Run window with a [device] prefix. This is genuinely useful and is one fewer thing you need Xcode for. The caveat is that the forwarding starts when the proxy connection is established, so output written during the very first millisecond of process launch (before Display.init) is not always captured. If you need every byte from t=0, attach Xcode's console for that specific run. Tutorial: IntelliJ + Android Android is simpler because the proxy is not needed. The archetype generates two run configurations under the same On-Device Debug folder: CN1 Android On-Device Debug (Maven, builds and installs the APK and forwards JDWP) and CN1 Attach Android (Remote JVM Debug at localhost:5005). 1. Enable the Build Hint In common/codenameone_settings.properties: Properties files android.onDeviceDebug=true This single hint flips the manifest to debuggable="true" and turns R8 / Proguard off for this build. Release builds without the hint are unaffected. 2. Run CN1 Android On-Device Debug Picks up the hint, builds the APK, installs it on the connected device or emulator, sets the debug-app for wait-for-attach, launches the Activity, forwards JDWP to localhost:5005, and streams logcat --pid=<pid> into the Run window with a [device] prefix. For wireless adb, pass -Dcn1.android.onDeviceDebug.wireless=<ip:port> and the goal will adb connect before installing. Both the Android 11+ adb pair flow and the legacy adb tcpip flow work. 3. Attach the Debugger Switch to CN1, Attach Android, and click Debug. IntelliJ connects to localhost:5005. Set breakpoints anywhere; they fire when exercised. Source resolution covers both the codenameone-core and codenameone-android sources jars, so breakpoints inside the framework or inside the Android port resolve to the right files. On Android, native interfaces are themselves Java, so a breakpoint inside the Impl class of your own native interface fires just like a breakpoint anywhere else in your code; you can step through the implementation, inspect locals, and evaluate expressions the same way. The dev guide has the full reference, including the wireless-pairing flows, the VS Code and Eclipse equivalents, and a troubleshooting section: iOS on-device debugging and Android on-device debugging. When to Use It (and When Not To) For most bugs, the JavaSE simulator is still, by a large margin, the fastest loop. Reach for on-device debugging when the bug is platform-specific: ParparVM-specific threading, an iOS-only layout glitch under the modern native theme, a real-radio Bluetooth interaction, a Touch ID gate, an Android-only manifest interaction, anything that only reproduces under iOS background memory pressure. The kind of bug that previously sent you reaching for Log.p and a rebuild loop. That bug now has a debugger pointed at it. JUnit 5 Against the Simulator The other change in this release is the new JUnit 5 integration in the JavaSE port (PR #5032). To be clear about what this is: it is standard JUnit 5. There is no fork of JUnit in com.codename1.testing.junit. That package holds a small set of annotations and a CodenameOneExtension that plugs into the regular JUnit Jupiter lifecycle. You write @Test methods using org.junit.jupiter.api.Test, you assert with org.junit.jupiter.api.Assertions, and your IDE's native test runner picks them up the way it does on any other Java project. Why a separate integration at all? The legacy com.codename1.testing.AbstractTest framework, driven by the cn1:test Maven goal, still exists and is still the only way to run tests on a real iOS or Android device (JUnit Jupiter is not available on ParparVM). The trade-off is that AbstractTest tests have to compile under the Codename One device subset, with no reflection, no java.net.http, no java.nio.file, no Mockito, no AssertJ, no assertThrows. JUnit-style tests run only on the JavaSE simulator JVM, but the JVM is a regular JVM, so reflection, Mockito, AssertJ, and parameterized tests are all available. Both styles coexist in the same project under common/src/test/java. You pick per test class. The runners discover disjoint sets (cn1:test looks for UnitTest implementers; Surefire looks for @Test methods), so a mvn install runs both passes in the same phase without overlap. A Minimal Test Tests live in common/src/test/java. The shape most apps want is one that boots the project's app class through the same init / start sequence the simulator uses, then asserts against the form the app actually opens: Java package com.example.myapp; import com.codename1.testing.junit.CodenameOneTest; import com.codename1.testing.junit.RunOnEdt; import com.codename1.ui.CN; import com.codename1.ui.Display; import com.codename1.ui.Form; import org.junit.jupiter.api.Test; import static org.junit.jupiter.api.Assertions.assertEquals; import static org.junit.jupiter.api.Assertions.assertTrue; @CodenameOneTest class GreetingFormTest { @Test @RunOnEdt void formShowsExpectedTitle() { MyAppName app = new MyAppName(); app.init(null); app.start(); assertEquals("Hi World", Display.getInstance().getCurrent().getTitle()); assertTrue(CN.isEdt(), "@RunOnEdt method runs on the Codename One EDT"); } } That is more useful than constructing a Form directly in the test because it exercises the same startup path the simulator runs. The assertions check the form your app opens, not a form the test wrote. The natural way to run it is from the IntelliJ gutter. Click the green icon next to the class declaration: The results land in the standard Run tool window: Click the green icon next to a specific @Test method to run just that method. The same flow works in VS Code's Test Explorer and in Eclipse's JUnit view. If you prefer the command line: Shell mvn -Ptest test # run the JUnit suite mvn -Ptest test -Dtest=GreetingFormTest # one class mvn -Ptest test -Dtest=GreetingFormTest#formShowsExpectedTitle @CodenameOneTest is the class-level entry point. It wires the simulator extension into the JUnit Jupiter lifecycle, boots Display.init(null) once per JVM (idempotent, so subsequent classes share the same Display), and skips the class with a TestAbortedException if the JVM is genuinely headless (so CI runners that have no display do not poison the rest of the run). @RunOnEdt dispatches the test body through CN.callSerially, which is what you want any time the body touches UI state. It rethrows the body's exceptions on the JUnit thread so the stack trace stays clickable in the IDE. Place it on the method for one test, on the class to apply to every test. A Couple More Common Cases A test that exercises a plain validator, with no UI involved at all: Java @CodenameOneTest class EmailValidatorTest { @Test void rejectsEmptyString() { assertFalse(new EmailValidator().isValid("")); } @Test void acceptsCommonAddress() { assertTrue(new EmailValidator().isValid("[email protected]")); } } This is the "pure model code" shape. No @RunOnEdt, no UI, runs on the JUnit worker thread, fast. A test of a form under a specific visual configuration: Java @CodenameOneTest class GreetingFormVisualTest { @Test @RunOnEdt @DarkMode @LargerText(scale = 1.6f) void titleStillFitsInDarkModeAtAccessibilityScale() { new GreetingForm().show(); Form current = Display.getInstance().getCurrent(); assertEquals("Hello", current.getTitle()); assertTrue(current.getPreferredW() <= Display.getInstance().getDisplayWidth()); } } The visual-config annotations (@Theme, @DarkMode, @LargerText, @Orientation, @RTL) apply on the EDT in one batch, followed by a single theme refresh, so the test body sees the simulator in the exact configuration you asked for without flicker. A test that injects a custom property for the duration of one method: Java @Test @RunOnEdt @SimulatorProperty(name = "feature.flag", value = "on") void newCodePathRunsWhenFlagIsOn() { // Display.getProperty("feature.flag", "off") returns "on" here runFeature(); assertEquals("expected", Display.getInstance().getCurrent().getTitle()); Class-level @SimulatorProperty applies to every method in the class. Method-level overrides class-level. Use the container @SimulatorProperties for more than one (the package source level rules out @Repeatable). The full reference, including the dependency-block YAML for common/pom.xml and javase/pom.xml and the @Theme / @Orientation / @RTL details, is at Testing with JUnit 5 in the developer guide. Wrapping Up That is the workflow half of this release. Tomorrow's post covers the new platform APIs that moved into the core this week: AI and OAuth/OIDC are the headline pieces, with wifi/connectivity and a few smaller items alongside them. Back to the weekly index. More

Building an Agentic Incident Resolution System for Developers

By Pavan Belagatti

CORE

Agentic engineering gets really interesting when it moves beyond dashboards and alerts and starts taking action. One of the clearest places to apply it is incident response. Instead of waking someone up at 2:00 a.m. just to answer basic questions, I can build a system that understands what broke, who owns it, what changed recently, what the dependencies are, and whether the problem can be healed automatically. That is exactly what I set up with Port as the context layer and Datadog as the monitoring and tracing layer. Datadog tells me something is wrong. Port tells me what that thing means inside the organization. Once those two are wired together with automation, I get a practical example of agentic engineering in action: incidents can be investigated, enriched with context, auto-resolved when possible, or escalated to the right team with the right details. The Problem That Normal Alerting Does Not Solve Traditional monitoring is good at detection. It is not always good at decision-making. An alert can tell me that the error rate crossed 5 percent, latency spiked, or a service is unhealthy. But when an incident happens, that is rarely the only thing I need to know. I still need answers to questions like these: Who owns this service?Was there a recent deployment?Is this service tier one or lower priority?What other systems depend on it?Is there a runbook for this exact issue?Should I retry, roll back, restart, or escalate? That gap is where a lot of incident time gets wasted. The first several minutes are often spent gathering organizational context instead of resolving the issue. In many teams, that delay is the difference between a minor blip and a major operational headache. This is why agentic engineering matters. The goal is not only to detect events but to make systems context-aware enough to respond intelligently. In this setup, Datadog is excellent at telemetry, monitoring, tracing, and alerting. Port fills in the missing layer: ownership, dependencies, runbooks, service criticality, and engineering context. Why Port Is the Missing Context Layer Think of Datadog as the system that knows what is happening technically, and Port as the system that knows what it means operationally. Datadog can tell me a service has a 15 percent error rate or that latency is high. Port can tell me that the affected service belongs to the platform team, is business critical, depends on another backend, and has a known runbook for recovery. That combination is what turns monitoring into agentic engineering. Without context, automation is shallow. It can only react to thresholds. With context, automation can choose the right path. In my setup, Port stores and exposes things like: Service ownershipDeployment historyDependenciesCriticality or tierRunbooksTags and metadata Once that data is available to workflows, the response system can behave more like an on-call engineer with memory instead of a blind script reacting to numbers. How I Built the Self-Healing Incident Flow The architecture is simple enough to understand, but powerful enough to be useful. I used a small Flask application as the demo app. Then I connected Datadog Agent so it could collect telemetry from the app and forward traces and incident signals to Datadog APM. From there, I created monitors for specific services and error conditions. When those monitors breach a threshold, the flow kicks off through automation. I wired that into Port and also used GitHub Actions as part of the response loop. So the chain looks like this: A service emits telemetry.Datadog detects abnormal behavior.A monitor fires when the threshold is breached.Port enriches the incident with organizational context.A workflow investigates and checks live signals.If the issue matches a known recovery path, it gets auto-resolved.If not, it gets escalated to the correct team with full context. That is the loop I wanted. Not just alerting. Not just dashboards. A closed-loop incident handling system. This is a practical version of agentic engineering applied to observability and incident management. The Ingredients in the Stack Here is what I used in the demo: Flask application to simulate a real service environmentDatadog Agent for collecting telemetryDatadog APM and monitors for alerting and service visibilityPort as the context and operations layerGitHub Actions for automated investigation and remediation steps I also intentionally induced failures to create realistic incidents. That way, I could show the full path from alert to response, not just a static architecture slide. What the Demo Application and Monitoring Setup Looked Like The app itself was a straightforward shop-style Flask app with a products page. Nothing fancy there. The important part was that failures were being created behind the scenes so Datadog would pick up real service issues. Once the Datadog agent was connected, infrastructure, logs, and services were visible inside the platform. I created monitors for multiple services such as payment, auth, shipping, and the main app. Each monitor was configured around conditions like elevated error rate or failure rate. Some services were healthy at a given moment, while others stayed in alert because the recent signal history was still above the configured threshold. This is one reason I like using Datadog here. It gives a solid visual understanding of service behavior, traces, and incidents. But again, that is only one half of agentic engineering. The second half is turning that telemetry into informed action. Where GitHub Actions Fits Into Agentic Engineering GitHub Actions became the execution layer for the response workflow. Every time an incident crossed the condition I had configured, the workflow ran. In the run logs, I could see the process authenticate with Port, fetch the entity data from Port, perform investigation steps, and then finish with either a resolution path or an escalation path. That detail is important. The workflow was not operating blindly. It first asked Port for context. Then it checked live metrics and determined whether remediation was appropriate. This is a clean example of agentic engineering because the automation is using context, reasoning steps, and current state before acting. The workflow outcome had two possible endings: Auto-resolve if the incident was recoverable through a known safe actionEscalate with full context if the issue needed human intervention Even when escalation happened, it was still a big improvement. The receiving team did not get a vague alert. They got ownership, summary, severity, recommended action, and the relevant runbook context. That dramatically reduces triage time. What Auto-Resolved Incidents Actually Mean Auto-resolution is easy to oversell, so I prefer being specific. It does not mean every incident magically disappears. It means the system can detect a known scenario, apply a predefined runbook or remediation path, confirm the signals are back to normal, and then mark the incident resolved without involving a human. In the setup I showed, some incidents were auto-resolved, and others remained open or under investigation. That is exactly how it should work. A trustworthy agentic engineering system does not try to automate everything. It automates the safe, repeatable, low-ambiguity cases and escalates the rest. In the incident dashboard, I could see: Total incidentsOpen incidentsDatadog sourced incidentsAuto-resolved incidentsSeverity breakdownStatus breakdown One example involved a payment service problem related to a slow database query caused by a missing index. The incident showed up with high severity, the runbook was executed, and the final status was resolved through the self-healing workflow. Other incidents stayed open because they required deeper investigation. What I Could Inspect Inside Port Incidents One thing I really liked in this setup was how clearly incidents could be represented inside Port. I could open a resolved incident and see the relevant details, such as severity, status, summary, and whether a runbook had already executed. I could also open an active incident and see that it was still being investigated, rather than pretending the system knew more than it did. For unresolved issues, the incident records made it obvious that the system had preserved context for the next step. That is a big operational win. Instead of forcing the owning team to reconstruct the entire chain of events, the incident entity already captures what service is affected, what happened, what the likely area is, and what actions have already been attempted. This is where agentic engineering becomes more than automation. It becomes operational memory. Why This Reduces Time to Resolution Better context shortens every stage of response: Detection stays with DatadogContext gathering comes from PortExecution runs through workflowsEscalation goes to the correct team with less confusion The result is less time spent asking basic questions and more time spent resolving the actual technical problem. That is the real productivity gain. Agentic Observability Is a Practical Form of Agentic Engineering I think a lot of people hear the word “agentic” and imagine something overly futuristic. But this is already a grounded, useful implementation. When I combine observability data, organizational context, and workflow execution, I get what I would call agentic observability or agentic incident response. In other words, agentic engineering here is not about a chatbot replacing engineers. It is about giving the system enough structured knowledge and automation capability to do the boring, repetitive, high-confidence parts of operational work. Port also supports integrations beyond Datadog. In the demo environment, I showed that I could connect various developer tools through Port data sources, including tools in the observability and alerting ecosystem such as Dynatrace, New Relic, and Sentry. That opens the door to broader engineering automation, not just incident workflows. Using Port Chat as a Natural Language Layer Another interesting piece was Port chat. Because the platform already had context from connected systems, I could ask a natural language question, such as requesting the auto-resolved incidents sourced from Datadog. The system then used the available tools and context to return a structured answer. The result included the incidents that had been auto-resolved, along with useful details such as service name, severity, and resolution summary. The big takeaway was not that chat exists. Plenty of products have chat now. The useful part is that the chat interface sits on top of real engineering context. That again is what makes it part of Agentic Engineering. The language layer is connected to data, workflows, and operational entities. It is not guessing. It is querying the same system of record used by the incident flow. What This Setup Is Good For and What It Is Not I would absolutely use this pattern for recurring operational issues where the remediation path is known and safe. Examples inside a real engineering organization might include restarting a component, rolling back a fresh deployment, running a validation step, or escalating with a complete context package. Those are ideal candidates for agentic engineering. I would not treat it as a license to automate every production issue. High-risk actions still need guardrails. Unknown failure modes still need humans. Complex outages across multiple dependencies still need engineering judgment. The point is not to remove people from the process. The point is to remove waste from the process. A healthy implementation usually follows this rule: Automate what is repetitive and reversibleEnrich what is ambiguousEscalate what is risky or novel If I stick to that, agentic engineering becomes reliable instead of reckless. How I Think About the Value of This System The biggest value is not that six incidents got auto-resolved in a demo. The bigger value is that the entire incident loop became structured. Datadog provided the signal. Port provided the meaning. Automation provided the action. That is the pattern I want to replicate across engineering systems. Even when no incident is auto-resolved, the system still improves outcomes because it captures context and routes work intelligently. If an issue does get resolved automatically, even better. That means nobody has to be paged for a problem the system already knows how to fix safely. That is the promise of agentic engineering when done properly. Not flashy demos for their own sake, but operational systems that reduce toil, improve response quality, and preserve engineering attention for harder problems. More

Testing Is Not About Finding Bugs

By Abhinav Garg

Cutting Data Pipeline Costs and Data Freshness Issues With Netflix Maestro and Apache Iceberg: A Practical Tutorial

By Intiaz Shaik

Getting Started With GitHub Copilot CLI for Coding Tasks

By Gunter Rotsaert

CORE

Conversational Risk Accumulation: Stateful Guardrails Beyond Single-Turn LLM Checks

Why Long Chats Need Session-Level Guardrails (CRA) Who this is for: Anyone building chat features, support bots, internal Q&A, coaching tools, RAG assistants. The Usual Setup (and What It Misses) A typical flow: User sends a message.You run moderation, rules, or a small model on that message (sometimes the reply too).If it passes, the big model answers. That is per message. It does not really “remember” the story of the chat. In a long chat: Message 5 looks normal.Message 12 still passes your keyword list.By message 20, something is wrong only if you compare it to how the chat started. So you can pass every single check and still end up with a bad session. That gap is what we call CRA: risk that adds up across turns, not in one obvious line. Figure 1: Each turn can look “green” while the overall thread is not. CRA in Plain English CRA = Conversational Risk Accumulation Idea: Each turn might look okay on its own, but together they break the purpose of the chat or what your company is okay with. What to build: Keep a little session memory (not the full transcript in logs — think IDs, hashes, and scores). After each assistant reply, update a few numbers that describe “how this session feels right now.” Those numbers are hints for dashboards, alerts, and gentle UI — not a courtroom verdict. Three Simple Scores + One Total (Example) We use a small, fixed set of scores and one combined score. Version tag in code: cra_telemetry_v1. Figure 2: Three inputs, one combined CRA score. ScorePlain meaningHow you might compute it (conceptually)S1Topic driftCompare the user’s recent text to how the chat started (or a stated goal). If they wander far from that, S1 goes up.S2Sensitive-looking repliesThe assistant’s answer looks like it contains patterns you care about (fake email shapes, “API key” wording, etc.). This means “flag for review,” not “we proved a leak.”S3Refusal tone shiftingTrack refusal-style phrases in the assistant’s answers over time. If refusals seem to soften late in the thread, S3 captures that shape.CRAOverall session riskA weighted sum of S1, S2, and S3, plus a small extra bump if the user or assistant text looks like prompt injection playbooks. Example weights we used: 35% S1, 45% S2, 20% S3. Rule of thumb: If you cannot explain a score in one short sentence to a product manager, do not use it to auto-block users. Hard Guardrails = Simple, Fast, “No” Hard guardrails are rules, not vibes. They should be cheap and run before you waste tokens. Examples: Max request size – reject giant payloads (HTTP 413).Rate limits – cap requests per IP so one client cannot drain your budget (429).Known-bad phrases – block obvious “ignore all previous instructions” junk (400).“Don’t paste secrets” – block prompts that look like “here is my SSN” (400) with a clear error.Lock down outputs – if your product only allows certain actions, check model output and tool calls against an allowlist before anything runs. These are not CRA. They are basics. CRA sits beside them. Figure 3: Hard = block or validate. Soft = warn, log, nudge. Soft Guardrails = CRA-Friendly, “Heads Up” Soft means: warn, log, maybe show a banner — not silent blocking. After a response, the API can add fields such as: cra_soft_notices – short text for humans (“high drift”, “sensitive-looking wording”, …).cra_signals – numbers for debugging: S1, S2, S3, CRA, turn count. Why start soft: Rules and heuristics misfire. A user might ask for fake email examples for a demo; S2 might spike on purpose. That is why the score is a signal, not proof. Bonus: Cache Duplicate Questions (Save Money) If someone double-clicks Send or retries the same text, do not call the model twice. Cache key idea: Python normalize(question) + mode + endpoint Cache the JSON answer for a few minutes. Mark responses with something like cached: true so the UI can say “from cache.” Browser Tip: Don’t Mix Up “New Chat” and Old Intent If S1 uses “first message of this session” as the anchor, browser storage can fool you: a new tab can look like a new thread while an old “first message” is still stored. Fixes: Store the anchor per session_id, not one global value.Expire or rotate the browser session after idle time so deploys and stale tabs do not reuse the wrong anchor. Telemetry vs. Guardrails (Two Different Jobs) TelemetryGuardrailJobMeasure and learnBlock or change behaviorWhen it hurts youToo many logs, privacyFalse positives, angry usersCRAGood fitUse soft first; hard only after review In logs, avoid raw secrets. Prefer hashes, lengths, and labels (channel, product area). Three Lines for Your Security Reviewer CRA is about conversation behavior over time, not a replacement for database security or tool-permission design.Labels for “bad session” are rare in the real world — use CRA to prioritize review, not as automatic guilt.If weights are public, people might game them — keep basic hard rules and spot checks anyway. Rollout Order (Keep It Boring) Ship hard limits (size, rate, obvious injection, output checks).Add session logging with safe IDs.Show soft notices only inside internal tools first.Tune thresholds on real traffic.Only then add hard session actions (pause tools, re-auth, etc.). Takeaway One-message checks are not enough for long chats. CRA gives you a simple story and a small set of session scores. Hard rules stop obvious abuse; soft CRA helps you see drift before it becomes an incident. Start with telemetry. Add blocking only when you understand the false positives. About the author: Sanjay Mishra is author of two books, The SQL Universe and Oracle Database Performance Tuning: A Checklist Approach. His research spans RAG architectures, NL2SQL, LLM safety, and enterprise AI governance, with work published in IEEE Access, Springer LNNS, and SSRN. He speaks regularly at universities and industry events on applied AI and data engineering. Tags / topics: #LLM #Security #Guardrails #Observability #OpenAI #Architecture #Chatbots

By Sanjay Mishra

From ETL to Lakeflow: Shifting to a Declarative Data Paradigm

If you've worked on a data platform for more than a few years, you've almost certainly built the same pipeline twice. First, the way the team wrote pipelines in 2019: a notebook here, a Python script there, an Airflow DAG to glue it all together, and a long document explaining the order things had to run in. Then the rewrite, two years later, when somebody quit, and nobody could remember why a particular task had a sleep(180) in it. Lakeflow is Databricks' answer to that pattern, and the shift it's pushing for is bigger than the marketing makes it sound. It isn't a new orchestrator. It's a move from imperative pipelines, where you write the steps, to declarative pipelines, where you write the destination and let the engine figure out the steps. What follows is the practical version of that shift — what's actually different, where the gains are real, and how to migrate without ending up with a half-converted lakehouse. 1. The Imperative ETL Trap: Why Traditional Pipelines Are Hitting a Wall Imperative ETL is a fancy name for the way most pipelines are still written: a sequence of steps, hand-ordered, run on a schedule. It works fine until it doesn't, and the failure modes are remarkably consistent across teams I've worked with: The DAG outgrows its author. The person who wrote the original 30-task Airflow DAG moves teams. The next engineer is afraid to delete anything because they can't tell which tasks are still needed.Backfills are surgical operations. Re-running yesterday means manually figuring out which downstream tables are stale, in what order. Half the team's tribal knowledge lives in Slack threads about backfills.Quality checks are bolted on. Data quality lives in a separate framework, often a separate codebase, often run by a separate team. By the time a check fails, the bad data is already in the warehouse.Lineage is a slide in a deck. Whatever lineage exists was drawn by hand for a quarterly review and was out of date the day after. None of these are bugs in the imperative model. They're features of it. When you write the steps, you own the steps — including all the cross-task assumptions the engine doesn't know about. 2. What "Declarative" Actually Means in Lakeflow Declarative is one of those words that gets used loosely. In Lakeflow Pipelines, it has a specific, narrow meaning: you describe each table's logical definition (its source query, its expected schema, its quality rules), and the engine determines execution. It picks the order. It decides which tables are streaming and which are batch. It scales the cluster. It figures out incremental processing. It produces lineage automatically because lineage is now a derived property of the dependency graph it built for you. What it isn't: It isn't "low-code." You're still writing SQL or PySpark. The thing that's gone is the orchestration boilerplate around it.It isn't a magic upgrade for any pipeline. Pipelines that genuinely need procedural logic — multi-step API calls with branching, complex pre/post-processing — still belong in Lakeflow Jobs (the orchestrator) or even external code, called from the pipeline.It isn't free. There's a learning curve in stopping yourself from writing the steps you used to write. The first month, most teams over-specify. The mental shift: stop describing how the data should flow. Describe what each table is. Lakeflow figures out the flow. 3. The Lakeflow Architecture: Connect, Pipelines, Jobs Lakeflow is three components that share one governance layer (Unity Catalog). They map roughly onto the three traditional layers of a pipeline — ingestion, transformation, orchestration — but with the imperative wiring removed. Figure 1. Lakeflow's three components on top of Unity Catalog. Pipelines is the declarative core; Connect feeds it, Jobs schedules it. A few practical points about this picture. Lakeflow Connect is where managed connectors live (Salesforce, Workday, Postgres CDC, and a steadily growing list); it's the part you reach for instead of writing yet another ingestion script. Lakeflow Pipelines is where the declarative paradigm actually lives — every other component is conventional. And Lakeflow Jobs is the part that looks most like Airflow: task graphs, retries, alerts. The trick is that the things inside a Pipelines task aren't tasks themselves — they're table definitions, and the engine builds the internal DAG from their dependencies. 4. Translating an Imperative Pipeline to a Declarative One The clearest way to feel the difference is to look at the same logic written both ways. Imagine a small bronze→silver→gold pipeline for transactions: ingest raw files, deduplicate, then aggregate to daily totals. 4a. The imperative version (notebook + Airflow style) Python # bronze.py df = spark.read.json("s3://landing/txns/") df.write.format("delta").mode("append").saveAsTable("bronze.txns") # silver.py -- runs after bronze finishes raw = spark.table("bronze.txns") clean = (raw.dropDuplicates(["txn_id"]) .filter("amount IS NOT NULL")) clean.write.format("delta").mode("overwrite").saveAsTable("silver.txns") # gold.py -- runs after silver finishes agg = (spark.table("silver.txns") .groupBy("ingest_date", "account_id") .sum("amount") .withColumnRenamed("sum(amount)", "daily_total")) agg.write.format("delta").mode("overwrite").saveAsTable("gold.daily_totals") # airflow_dag.py -- the part that actually controls execution bronze_task >> silver_task >> gold_task 4b. The same logic, declared in a Lakeflow Pipeline Python import dlt from pyspark.sql.functions import sum as _sum @dlt.table( name="bronze_txns", comment="Raw transactions landed from S3.", ) def bronze_txns(): return (spark.readStream .format("cloudFiles") .option("cloudFiles.format", "json") .load("s3://landing/txns/")) @dlt.table(name="silver_txns", comment="Deduplicated, validated transactions.") @dlt.expect_or_drop("non_null_amount", "amount IS NOT NULL") @dlt.expect("unique_txn", "txn_id IS NOT NULL") def silver_txns(): return (dlt.read_stream("bronze_txns") .dropDuplicates(["txn_id"])) @dlt.table(name="gold_daily_totals") def gold_daily_totals(): return (dlt.read("silver_txns") .groupBy("ingest_date", "account_id") .agg(_sum("amount").alias("daily_total"))) Two things vanished in the rewrite. There is no DAG file, because the dependencies are inferred from dlt.read / dlt.read_stream calls. There is no separate data quality framework — quality lives next to the table definition, where it belongs. The engine decides what's streaming and what's batch from the calls themselves; bronze is a stream, silver is a stream of the bronze stream, gold is a batch over silver. None of that ordering is in the code I wrote. 5. Quality, Lineage, and Operational Visibility for Free The expectations decorators above (@dlt.expect, @dlt.expect_or_drop, and the stricter @dlt.expect_or_fail) are not just convenience syntax; they become first-class objects in the pipeline. Every run produces a per-expectation pass/fail count, queryable directly: SQL -- How many silver rows failed each expectation, per run, last 7 days SELECT pipeline_run_id, flow_name, expectation_name, passed_records, failed_records, dropped_records FROM event_log("<pipeline-id>") WHERE event_type = 'flow_progress' AND timestamp >= current_timestamp() - INTERVAL 7 DAYS ORDER BY timestamp DESC; Lineage shows up automatically in Unity Catalog — both the table-level edges (gold_daily_totals depends on silver_txns) and column-level edges (gold's daily_total derives from silver's amount). Operationally, this is the change that has the largest day-to-day impact: when somebody asks "what does this column mean and where did it come from," you stop having to guess. What this replaces: Great Expectations runs scheduled separately, OpenLineage stitched together by hand, and a homegrown observability dashboard reading task logs. All three of those projects either go away or shrink dramatically. 6. Migration Strategy: How Teams Actually Move Off Imperative Pipelines I've not seen a successful big-bang migration. The pattern that works is layered: Phase 1 — New pipelines only Make Lakeflow Pipelines the default for any new pipeline. This sounds obvious; the discipline is in saying no when somebody wants to add "just one more" Airflow DAG to the imperative side because it's faster this week. Phase 2 — Convert the painful ones Pick the existing pipelines that hurt the most — the ones with the longest backfill stories, the most ad-hoc quality checks, the worst lineage gaps. Those are the ones where the declarative model pays for the rewrite cost fastest. Don't start with the easy ones; their owners won't thank you for the disruption. Phase 3 — Retire the orchestration boilerplate Once a critical mass of pipelines has moved over, you can shrink (or in many cases delete) Airflow setups, custom dependency-tracking tools, and the side projects that grew up around imperative ETL. This is the phase where the cost savings actually show up in headcount and infrastructure bills. Migration step Effort Watch out for New pipelines on Lakeflow Low Team momentum — easy to revert to old patterns. Convert the top 3 painful pipelines Medium Different streaming/batch semantics in expressed dependencies. Move expectations off external DQ tools Medium Existing alerting wired to the old framework. Retire imperative orchestrator High External callers (BI tools, ML jobs) that triggered DAGs directly. 7. Where Declarative Still Hurts: Honest Limitations I'd be lying if I said this was free. The places where the declarative model still bites: Procedural logic doesn't fit. If your "pipeline" is really a sequence of API calls with branching error handling, that's a Lakeflow Job (or external code), not a declarative table.Cross-pipeline orchestration is its own thing. Lakeflow Pipelines builds the DAG inside a pipeline. If you need pipeline A to wait for pipeline B, you still need Lakeflow Jobs above them.Debugging shifts from steps to definitions. When something is wrong, you're not stepping through a script — you're reading the event log and figuring out which expectation or upstream table caused it. The tooling is good; the muscle memory is different.Cost can surprise you. Auto-scaling on a misbehaving streaming source has the same risk it always has. Set max workers thoughtfully on day one; don't leave it to defaults. Conclusion The shift to declarative pipelines isn't really about syntax. It's about who owns the boring parts. In an imperative pipeline, the team owns the order, the retries, the lineage, the quality checks, and the cluster scaling — and pays in headcount when any of those break. In a declarative pipeline, those become properties of the engine, and the team owns the part that's actually interesting: the table definitions and the business logic. Lakeflow is the cleanest implementation of that idea I've used in production, and the teams I've watched migrate haven't asked to go back.

By Seshendranath Balla Venkata

How to Build a Local LLM Agent to Automate Work List Generation from Monthly Reports (With Jira Integration)

XB Software's management team spent hours manually extracting work items (“bug fix”, “released version 1”, etc.) from dozens of developer reports. The task was repetitive, error‑prone, and a security risk when using cloud‑based AI tools, since it means exposing internal activity to external servers. To solve this, we built a local LLM‑powered agent that runs entirely on our own servers, normalizes chaotic report data, filters out useless noise, enriches descriptions from Jira, and generates a clean list of actual accomplishments. In this article, we break down the architecture and explain why a CPU‑only, on‑premise approach is practical for enterprise clients who prioritize data privacy. The Problem: Manual Work List Generation Is Slow, Inconsistent, and Insecure Usually, our managers followed the same routine: collect a month’s worth of developer reports, manually scan through hundreds of entries, and pick out the items that actually represented completed work. This process was straightforward but flawed. The first issue was data quality. Developers write reports in wildly different formats. Some include detailed Jira ticket IDs and descriptions; others are cryptic one‑liners like “fixed issue”. When a manager who wasn’t deeply involved in the project later reviews these reports, the meaning is often lost. What does “adjusted header” refer to? Which feature did “refactored code” touch? What we really needed was an AI-powered task management approach that could process this unstructured data automatically. The second issue was duplicate work. Managers would occasionally include tasks that had already been declared in previous months, creating overlaps. Another example is a task that spans several days. In this case, the same activity could be logged repeatedly, producing many near-identical entries. There was no automated way to compare new reports against historical data. The third issue was security. Initially, we experimented with feeding entire monthly reports into ChatGPT, asking it to clean up the data and suggest a final list. It worked reasonably well, but we were handing over a full month of internal project activity to a cloud service. For many enterprise businesses, especially those in finance or healthcare, that level of exposure is unacceptable. The Solution: A Secure, On‑Premise AI Agent for Task Extraction from Reports Our approach was to implement a console‑based application that converts reports into tasks automatically. It runs on our internal server, triggered by a cron job (or an optional API call) at the end of each monthly reporting cycle. The AI agent processes raw reports for each active project, applies a series of transformations, and outputs a polished list of work items. The entire pipeline runs on a CPU‑only server using Ollama to serve a local instance of the Gemma 4 E2B model. For embedding generation (used in duplicate detection), we use the tiny nomic‑embed‑text model, which is only a few megabytes in size. Here’s a high‑level view of the process flow: Let’s walk through each stage in detail. 1. Normalization: Making Chaos Readable A single project might receive 80+ individual reports per month with varying levels of detail. The first task for our AI agent was to normalize these disparate inputs into a consistent, machine‑readable format. This step alone turns a jumble of free‑form text into structured data that the rest of the pipeline can reliably process. 2. Chunking: Working Within Token Limits This is where we hit our first major technical constraint. Running on CPU via Ollama, our Gemma 4 model is limited to a context window of 4,096 tokens. That’s not a lot. A single month of reports from a busy project can easily exceed that. We solved this by chunking. The AI system calculates the approximate token count of the combined report text and splits it into batches of about 20 reports each. This ensures that the LLM never runs out of context space and that each chunk receives full attention. Within each chunk, we also further split entries that contain multiple tasks in a single line (e.g., “Did A, did B, did C”). After this splitting, 22 raw reports became 94 individual work items in one of our test runs. 3. Jira Enrichment: Adding Missing Context One of the most valuable features of our AI agent is its ability to automatically fetch additional context from Jira. When the system detects a Jira ticket ID in a report, it calls the Jira API to retrieve the ticket description. Developers often write terse reports assuming the ticket ID is enough. But when that report later appears as “AAA‑123 – done”, it tells nothing. By pulling the full, manager‑written description from Jira, our AI agent replaces the vague entry with a clear, professional summary of what was actually accomplished. 4. Filtering Out the Noise Not every report entry is worth including. Generic statements like “working on…” or “following up” don’t convey meaningful work. We built a bad‑word filter, one of the key components of our intelligent document processing (IDP) pipeline. It flags entries containing these vague phrases. The LLM processes each chunk and identifies data that match our exclusion list. In our test, this filter removed 69.1% of entries, and only 29 items out of 94 survived the cut. What remained were concrete, specific descriptions of completed tasks. 5. Selecting the Best Candidates Once we have a clean set of candidates, we need to choose the top N entries to present. The number N varies by project and is stored in our internal reporting database. To account for further filtering in the next step, we typically select a larger pool, say, 80 items. 6. Vector Duplicate Detection: Ensuring We Never Repeat Ourselves This is the secret sauce that prevents duplicate entries. Before finalizing the list, the AI agent compares each candidate against a historical database of all work items we’ve ever submitted for that project. Here’s how it works: Embedding generation. Each work item is converted into a vector (a list of numbers) using the nomic‑embed‑text model. This vector captures the semantic meaning of the text.Similarity calculation. The system compares the new candidate’s vector against the vectors of all previously stored data for that project.Threshold decision. If the similarity score exceeds 0.85 (85%), the candidate is flagged as a duplicate and removed. This threshold catches not just exact matches but also near‑duplicates where the phrasing or word order has changed while the underlying idea remains the same. The historical data is stored in a lightweight PostgreSQL table with just a few fields: project_id, text (the final description), embedding (the vector), and created_at (date of creation). After duplicate removal, we’re left with a set of truly unique, high‑quality work items. These are then formatted for final delivery to the project manager. Real‑World Performance: What Test Run Tells Us Let’s walk through an actual test run to see the numbers in action. These test run results demonstrate how an AI report analysis tool can summarize reports into tasks even with noisy, inconsistent input. StageItems inItems outreductionRaw reports22——After line splitting—94—Bad‑word filter942969.1% removedDuplicate detection291644.8% removed Technical Deep Dive: Why CPU‑Only Deployment Works One of the most common objections to running local LLMs is the perceived need for expensive GPU hardware. We deliberately chose a CPU‑only deployment to keep costs manageable and to prove that on‑premise AI doesn’t require significant infrastructure investments. Model Selection: Gemma 4 E2B We evaluated several local models and settled on Gemma 4 E2B. Here’s why: Size: At 5 billion parameters, it fits comfortably in RAM without needing a GPU. Our server has extra memory allocated specifically for the model;Performance: It’s fast enough for batch processing;Quality: The model handles JSON output reliably, and follows detailed prompts with minimal hallucination. NOTE: If you work with a multilingual team, make sure that the model you use understands target languages natively. Proper Model Settings and Prompt Engineering for Consistency Each pipeline stage has its own carefully crafted prompt that includes: A clear role definition (e.g., “You are a specialized Data Parsing Engine”);Good examples and bad examples of expected output;Explicit formatting rules (JSON structure, field names);Instructions to avoid creativity (temperature set to 0). For the bad‑word filter, we provide a list of prohibited terms and their synonyms: “working on,” “following up,” “in progress,” “discussed,” etc. The LLM simply acts as a pattern matcher with semantic understanding. It can recognize that “still working on the header” is conceptually similar to “in progress” and flag it accordingly. Also, for data‑processing tasks like this, we always disable “thinking” or “chain‑of‑thought” modes. Those are useful for complex reasoning but introduce unnecessary variability and output length in structured extraction tasks. Extra Challenges We Overcame Challenge 1: LLM unpredictability. Even with the temperature set to 0, LLMs can occasionally produce unexpected output. We added timeout limits to prevent the model from getting stuck in a loop, and we structured our prompts to request strictly formatted JSON that is easy to validate programmatically. Challenge 2: CPU processing speed. Processing 94 items across multiple LLM calls takes time. We solved this by running the AI agent as an overnight cron job, so speed is never a bottleneck. The manager arrives in the morning to a ready‑to‑review list. Why This Approach Matters for Enterprise Clients 1. Complete Data Sovereignty When you use on-premise Artificial Intelligence solutions, no data ever leaves your infrastructure. The LLM runs locally, the embedding model runs locally, and the historical database resides on your own PostgreSQL server. 2. No Vendor Lock‑In Cloud AI services change their pricing, deprecate models, or alter their APIs without notice. By using local AI agents and Ollama, you retain full control over the entire stack. Need to switch to a different model tomorrow? Just pull a new one and update the configuration. 3. Predictable Costs The only ongoing cost is the electricity to run the server. There are no per‑token API fees, no monthly subscriptions, and no surprise bills after a particularly busy month of processing. For organizations that process thousands of reports annually, the savings are substantial. 4. Customizable to Your Workflow Because we own the code, we can adapt the pipeline to fit your specific reporting format, integrate with your existing project management tools, and fine‑tune the prompts to match your industry’s terminology. This enables using AI for business process automation across diverse sectors, from construction to healthcare. From Manual Chore to Automated Precision Before, turning chaotic developer notes into clean reports meant choosing between tedious manual work and exposing sensitive data to cloud AI. Our private AI agent for document analysis offers a third way. Namely, secure, on‑premise automation. By combining Gemma 4 on standard CPU hardware with vector‑based duplicate detection and direct Jira enrichment, we’ve turned hours of monthly review into a hands‑off process. The system normalizes vague entries, filters out noise, and guarantees you never repeat a task description.

By Sergey Laptick

Can We Build Elite Search Agents Without Massive Industrial RL Pipelines?

Search agents have become essential infrastructure for frontier language models, yet their development remains locked behind corporate walls. These systems need to handle a fundamentally difficult problem: given access to tools and a knowledge base, explore systematically, make smart decisions about which paths to pursue, and know when to pivot strategies. Unlike a human researcher who can draw on intuition and common sense, an LLM agent works from what it's learned during training, which means it needs explicit instruction in how to search well. The practical stakes are high. Search agents' power research tools, web-based reasoning systems, and complex information retrieval. But most breakthroughs happen inside companies with unlimited budgets. Academic researchers hit a wall: the techniques that work are proprietary, the datasets are private, and the computational resources required seem astronomical. This creates a frustrating bottleneck where innovation clusters around industrial research labs, leaving the broader research community unable to experiment, iterate, or contribute meaningfully to the field. Why Industrial Pipelines Felt Inevitable The prevailing wisdom emerged naturally from how major AI labs approached agent training. They borrowed techniques from large language model development: start with massive pre-training to build foundational knowledge, apply continuous pre-training to adapt that foundation to new domains, fine-tune on supervised examples to teach specific behaviors, then polish everything with reinforcement learning to optimize against reward signals. Each stage supposedly unlocks something the previous stage couldn't reach. The logic seemed bulletproof. If you want frontier-level capabilities, you need frontier-level methods and resources. Pre-training builds knowledge. Continuous pre-training specializes it. Supervised fine-tuning teaches specific skills. Reinforcement learning optimizes for actual performance. Remove any link in this chain, and you'd expect degradation. This assumption led to a clear conclusion: building state-of-the-art search agents required industrial-scale infrastructure. Tongyi DeepResearch, for example, achieved strong performance through exactly this pipeline, spending enormous computational resources across all four optimization stages. For any academic team or resource-constrained organization, this seemed like an insurmountable barrier. The Dataset Design Revolution Then came a simpler observation: what if the bottleneck wasn't the algorithm, but what data you fed it? The researchers behind OpenSeeker-v2 noticed something crucial. Most work on agent training focused on optimization techniques, assuming the data was a fixed quantity. But what if the data itself could be fundamentally restructured? What if you could take the same training paradigm (simple supervised fine-tuning) and make it exponentially more powerful just by changing which trajectories you used as examples? This insight reframes the entire problem. Instead of asking "how do we squeeze more signal out of expensive optimization," ask "what makes a trajectory worth learning from?" Some trajectories teach the agent to think strategically. Others are lucky guesses that teach nothing. Some expose the agent to decision points where multiple tools could apply. Others are straightforward execution of a predetermined path. The team introduced three modifications to their data synthesis process, each targeting a specific dimension of training data quality. Scaling the knowledge graph means agents encounter richer search spaces during training. Instead of a small, constrained domain, they face larger graphs with more branches and exploration options. This prevents agents from memorizing solutions and forces them to develop genuine decision-making principles. A larger knowledge graph means each training trajectory involves more meaningful choices. Expanding the tool set requires agents to learn judgment. When an agent has only a few tools, it can succeed through trial and error on the same limited options. With a larger toolkit, the agent must actually reason about which tool fits which problem. This teaches generalization rather than reflexes. The agent learns principles of tool selection instead of pattern-matching to familiar scenarios. Strict low-step filtering focuses on trajectories that require careful planning rather than lucky guesses. A trajectory solving a problem in two steps teaches little about strategic reasoning. A trajectory requiring eight thoughtful steps teaches the agent to think systematically. By filtering strictly for solutions requiring multiple steps, researchers ensured every example was a lesson in strategic thinking, not an accident. Figure 1: OpenSeeker-v2 achieves state-of-the-art performance within its model scale and paradigm on four representative benchmarks, remarkably accomplishing this via simple SFT and outperforming Tongyi DeepResearch that is trained via extensive optimization pipelines The result was deceptively small: 10.6k training examples. This number matters precisely because it seems impossible. A pre-trained language model might use billions of tokens. Industrial fine-tuning typically involves hundreds of thousands of examples. Yet 10.6k examples, when carefully structured around these three principles, proved sufficient to outperform systems trained with vastly more data and computational resources. Figure 2: Comparison of average tool call counts across search-agent training data, showing how OpenSeeker-v2 training forced more extensive exploration than baseline datasets Testing Against the Real Competition Theory means nothing without empirical validation. The team tested OpenSeeker-v2 against standardized benchmarks where it faced comparison with systems trained using industrial pipelines. On BrowseComp, a benchmark testing web search and reasoning about real-time information, OpenSeeker-v2 achieved 46.0% accuracy compared to Tongyi DeepResearch's 43.4%. On BrowseComp-ZH, the same benchmark in Chinese, the gap widened to 58.1% versus 46.7%, demonstrating superior generalization across languages. On Humanity's Last Exam, a genuinely difficult benchmark requiring deep reasoning, OpenSeeker-v2 scored 34.6% to Tongyi's 32.9%. On xbench, a comprehensive benchmark of search capabilities, the difference was 78.0% versus 75.0%. These aren't marginal victories achieved through luck or benchmark overfitting. They're consistent wins across diverse evaluation metrics, with particularly striking results on the multilingual benchmark. A 30B model trained only with supervised fine-tuning on 10.6k examples outperformed a system built with "heavy CPT+SFT+RL pipeline," to quote the paper's own comparison. The significance of this finding inverts the conventional hierarchy. In AI development, more resources usually beat fewer resources. Better optimization techniques usually beat simpler ones. Yet here, a system constrained by deliberate dataset curation beat a system built with computational abundance. This suggests the constraint wasn't actually computational or algorithmic at all. It was conceptual, understanding what makes training data actually teach something valuable. Why This Opens Doors for Everyone The deeper implication extends beyond OpenSeeker-v2's specific numbers. The research demonstrates that a different path to frontier capabilities exists. Industrial teams with unlimited budgets can always outspend competitors. But a discovery that "data curation beats computational resources" shifts the entire economic structure of AI development. If you're thoughtful about which 10,000 examples you use, you don't need billion-dollar infrastructure. You need domain expertise, careful thinking, and clear principles about dataset design. This is something accessible to academic teams, startups, and researchers in resource-constrained regions. The work also sits in a broader context. Earlier approaches like OpenResearcher explored fully open pipelines for agent research, while Points Seeker examined multimodal search agents. OpenSeeker-v2's contribution is orthogonal: it shows that even within simpler architectures and paradigms, strategic dataset design enables frontier performance. This connects to broader observations about deep information seeking, suggesting that search capability improvements come from better data and clearer reasoning structures, not just more compute. Accessibility matters here because it enables reproducibility. Unlike industrial systems trained with proprietary methods on private data, OpenSeeker-v2 is open-sourced with transparent methodology. The community can examine it, build on it, and improve the dataset design principles. This creates a feedback loop where the field collectively discovers what makes training data valuable. The research also opens new questions. Can these curation principles apply to other domains beyond search agents? Does data quality multiply the efficiency of any LLM training task? Could other research groups develop improved versions of OpenSeeker-v2 by applying fresh insights about trajectory design? These questions now seem answerable rather than theoretical. Most importantly, the work reshapes how the field thinks about scaling. Sometimes the bottleneck in AI development isn't algorithmic innovation or computational power. It's understanding what signal matters most. OpenSeeker-v2 teaches that lesson in a way the broader research community can actually apply, not as a one-off engineering achievement but as a principle about how to think about training data.

By mike labs

The Repo Tracker: Automating My Daily GitHub Catch-Up

We all have that daily routine: opening a dozen browser tabs to check the health and progress of our favorite open-source projects. For me, it’s keeping a close eye on rapidly evolving ecosystems like Docling and the watsonx Agent Development Kit (ADK). Eventually, the manual refreshing had to stop. I decided to build a custom application to automate this workflow — or more accurately, a dedicated Agent. Before you write off “Agent” as just another industry buzzword, consider this: true agency isn’t just about complex LLM reasoning; it’s about autonomous execution. An agent bridges the gap between manual human effort and automated consistency, stepping in to handle what used to require our click-by-click attention. Here is how I built an automated companion to keep my pulse on the tech stacks that matter: by taking over the repetitive task of repository tracking, this tool operates as a functional agent in my development ecosystem. In this post, I’ll break down how it works and how you can implement it. Implementation In the following section, I’ll walk through the building block of the agent. Building Blocks: The Tech Stack To keep the footprint light, local, and efficient, the tool is built on a streamlined, minimal-dependency stack: Python 3: Handles the core application logic, parsing repository data, and orchestrating updates.SQLite: Acts as a lightweight, serverless database engine to persist repository states and track changes between runs.Bash: Bridges the application and the operating system, wrapping the execution logic into a clean, reproducible script.macOS & cron: Leverages native system utilities to handle automation and schedule regular execution intervals without relying on heavy third-party orchestrators. The Core Application Markdown github-check/ ├── github_monitor.py # Main monitoring application ├── web_viewer.py # Web dashboard application (Flask) ├── github_monitor.db # SQLite database (auto-created) ├── requirements.txt # Python dependencies (requests, flask) ├── .gitignore # Git ignore rules (filters .env, _* folders) ├── .gitattributes # Git attributes configuration ├── LICENSE # Project license ├── README.md # User documentation with diagrams │ ├── Docs/ │ ├── Architecture.md # This file - Technical architecture │ └── WebViewer.md # Web dashboard documentation │ ├── scripts/ │ ├── schedule_monitor.sh # Cron scheduler script │ ├── github-push.sh # Git push automation script │ ├── killer-port.sh # Port management utility │ └── hard-killer-port.sh # Force kill port utility │ ├── input/ │ └── repositories.txt # Repository list (owner/repo format) │ ├── output/ │ ├── logs/ # Execution logs (from cron) │ │ └── YYYYMMDD_HHMMSS_monitor.log │ └── YYYYMMDD_HHMMSS_report.txt # Generated reports │ ├── templates/ │ └── index.html # Web dashboard HTML template │ └── static/ ├── css/ │ └── style.css # Dashboard styles (dark theme) └── js/ └── app.js # Dashboard JavaScript (Chart.js, API calls) Core Initialization and State Management The application uses an object-oriented approach via the GitHubMonitor class. Upon instantiation, it handles its own database initialization using sqlite3. It creates two core tables—repositories and updates—utilizing indexes on frequently queried fields (repo_name and update_timestamp) to ensure quick lookups as your monitored list grows. Python def _init_database(self): """Initialize SQLite database with required schema.""" conn = sqlite3.connect(self.db_path) cursor = conn.cursor() cursor.execute(''' CREATE TABLE IF NOT EXISTS repositories ( id INTEGER PRIMARY KEY AUTOINCREMENT, repo_name TEXT UNIQUE NOT NULL, first_checked_at TEXT NOT NULL, last_checked_at TEXT NOT NULL ) ''') # ... updates table creation omitted for brevity ... cursor.execute(''' CREATE INDEX IF NOT EXISTS idx_repo_name ON repositories(repo_name) ''') conn.commit() conn.close() Resilient API Communication To interface with GitHub, the application utilizes a persistent requests.Session(). It is designed to safely handle unauthenticated requests while seamlessly embedding a personal access token (GITHUB_TOKEN) from the environment variables to bypass restrictive API rate limits. It also includes explicit HTTP status error handling (like 403 for rate limits and 404 for missing repos) alongside network timeout guards. Python self.github_token = os.getenv('GITHUB_TOKEN') # Optional: for higher rate limits self.session = requests.Session() if self.github_token: self.session.headers.update({'Authorization': f'token {self.github_token}'}) # ... Inside _get_repo_info ... response = self.session.get(url, timeout=10) if response.status_code == 200: return response.json() elif response.status_code == 403: print(f"✗ Rate limit exceeded. Consider using GITHUB_TOKEN environment variable.") return None Delta Detection Logic The core engine reads target repositories from a flat file (ignoring comments and whitespace) and loops through them. For each repository, it extracts the API’s pushed_at timestamp. It then checks the database to determine if the repository is brand new or if the remote timestamp differs from the last_checked state inside the DB, validating it against a configurable sliding time window (check_days). Python # Check if repo is in database exists, repo_id, last_checked = self._is_repo_in_db(repo_name) if not exists: # First time seeing this repo repo_id = self._add_repository(repo_name, pushed_at) self._log_update(repo_id, repo_name, pushed_at, is_first_run=True) else: # Check if there's a recent update and if it's a new update since last check if self._has_recent_update(pushed_at): if pushed_at != last_checked: self._log_update(repo_id, repo_name, pushed_at, is_first_run=False) print(f" UPDATE DETECTED!") Automated Auditing and Reporting Beyond real-time monitoring stdout logs, the application aggregates state tracking into a clean historical markdown-style report. It runs complex SQL joins to count the frequency of updates per repository and isolates the latest ten global changes. The system automatically creates a dedicated output/ directory and writes time-stamped files to ensure snapshots are preserved for long-term auditing. Python # Get all repositories with aggregated update counts cursor.execute(''' SELECT r.repo_name, r.first_checked_at, r.last_checked_at, COUNT(u.id) as update_count FROM repositories r LEFT JOIN updates u ON r.id = u.repo_id GROUP BY r.id ORDER BY r.repo_name ''') # ... Report file generation ... if output_file: timestamp = datetime.now(timezone.utc).strftime("%Y%m%d_%H%M%S") output_path = f"output/{timestamp}_{output_file}" os.makedirs("output", exist_ok=True) with open(output_path, 'w') as f: f.write(report) The Bash Script Hereafter the schedule_monitor.sh bash script, which prepares, executes, and maintains the automated tracking application. Dynamic Path Resolution Instead of relying on rigid, hardcoded absolute paths, the script begins by dynamically resolving its own location relative to the filesystem. By using dirname and the BASH_SOURCE environment variable, it anchors itself securely to the project layout. This ensures that no matter where the cron daemon triggers the script from, it can always accurately find the target Python application (github_monitor.py) and establish a consistent execution working directory. Automated Logging and Diagnostics Because a background cron job runs without a visual terminal (stdout), tracking down execution errors requires an audit trail. The script handles this by isolating a dedicated logs directory (output/logs) and utilizing a date-and-time string (date +"%Y%m%d_%H%M%S") to generate a unique file for every single runtime iteration. It appends clear timestamp banners marking exactly when a check started and concluded. Environment Validation and Execution Before attempting to launch the monitor, the script safely checks the host machine’s environment for valid runtimes. It runs a quiet check (command -v) to see if python3 or a fallback python command is accessible. If a Python binary is found, it triggers the underlying script, passing down the configurable time-window argument (--days 1) while explicitly routing both standard output and potential error stack traces (2>&1) straight into the active log file. Self-Cleaning Log Retention Running automated tasks indefinitely carries the risk of slowly cluttering local storage with thousands of historical text files. To enforce clean housekeeping, the script concludes its run with an automated garbage-collection routine. It uses the native Unix find command to scan the log directory, isolates any tracking logs older than 30 days (-mtime +30), and automatically purges them from the system. Shell #!/bin/bash # GitHub Repository Monitor Scheduler # This script can be used with cron to schedule regular checks # Configuration SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" PROJECT_DIR="$(dirname "$SCRIPT_DIR")" PYTHON_SCRIPT="$PROJECT_DIR/github_monitor.py" LOG_DIR="$PROJECT_DIR/output/logs" CHECK_DAYS=1 # Create log directory if it doesn't exist mkdir -p "$LOG_DIR" # Generate timestamp for log file TIMESTAMP=$(date +"%Y%m%d_%H%M%S") LOG_FILE="$LOG_DIR/${TIMESTAMP}_monitor.log" # Run the monitor and log output echo "=== GitHub Monitor Run: $(date) ===" >> "$LOG_FILE" cd "$PROJECT_DIR" || exit 1 # Check if Python 3 is available if command -v python3 &> /dev/null; then PYTHON_CMD="python3" elif command -v python &> /dev/null; then PYTHON_CMD="python" else echo "Error: Python not found" >> "$LOG_FILE" exit 1 fi # Run the monitor $PYTHON_CMD "$PYTHON_SCRIPT" --days "$CHECK_DAYS" >> "$LOG_FILE" 2>&1 # Log completion echo "=== Completed: $(date) ===" >> "$LOG_FILE" echo "" >> "$LOG_FILE" # Optional: Keep only last 30 days of logs find "$LOG_DIR" -name "*.log" -type f -mtime +30 -delete exit 0 # Made with Bob TL;DR: How to Make a Cron Job on a macOS Machine? There are several ways to do this on a macOS (my machine). The Modern macOS Way (launchd) launchd uses .plist (XML) files to manage schedules. It feels a bit wordier than cron, but it’s the most reliable method for Mac. Create a .plist file: open your terminal or a text editor and create a file in ~/Library/LaunchAgents/. Let's call it com.user.myjob.plist. Add the configuration: paste the following XML into the file. This example is set to run a script every day at 10:30 PM (22:30). XML <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd"> <plist version="1.0"> <dict> <key>Label</key> <string>com.user.myjob</string> <key>ProgramArguments</key> <array> <string>/Users/yourusername/scripts/myscript.sh</string> </array> <key>StartCalendarInterval</key> <dict> <key>Hour</key> <integer>22</integer> <key>Minute</key> <integer>30</integer> </dict> <key>StandardOutPath</key> <string>/tmp/myjob.out</string> <key>StandardErrorPath</key> <string>/tmp/myjob.err</string> </dict> </plist> Load and start the job: in the Terminal, tell macOS to look at the new file and start scheduling it: Shell launchctl bootstrap gui/$(id -u) ~/Library/LaunchAgents/com.user.myjob.plist If you need to stop it or unload or cancel the job, run: launchctl bootout gui/$(id -u) ~/Library/LaunchAgents/com.user.myjob.plist The Classic Way (cron) If you prefer the classic Linux/Unix crontab style because you already know the syntax, macOS can still do it. Open the crontab editor (in the terminal, and you’ll get something like vim); Shell crontab -e Add your cron syntax: add the job using the standard 5-asterisk cron formatting. For example, to run a script every day at midnight: Shell 0 0 * * * /Users/yourusername/scripts/myscript.sh Save and exit! The Crucial macOS Step for Cron Because of macOS security restrictions, cron will often fail silently because it doesn’t have permission to access your files. You have to grant it access: Open System Settings > Privacy & Security > Full Disk Access.Click the + icon.Press Cmd + Shift + G and type /usr/sbin/cron, then hit enter.Toggle the switch to On for cron. Which one should to choose? Use launchd if you want your job to reliably run even if your MacBook lid was closed/asleep at the exact minute it was scheduled to trigger. Use cron if you just need something quick and familiar for a desktop Mac that is always awake. The Database (SQLite) The repositories Table This table acts as the registry for the GitHub repositories you choose to track. It records when a repository was first introduced to the monitor and mirrors its remote state by tracking the latest push timestamp. id (INTEGER PRIMARY KEY AUTOINCREMENT): Unique internal identifier for each repository, used as the primary key.repo_name (TEXT UNIQUE NOT NULL): The full GitHub identifier in the owner/repository format (e.g., IBM/watsonx-adk or DSUR/docling). The UNIQUE constraint guarantees that a repository cannot be duplicated in the registry.first_checked_at (TEXT NOT NULL): An ISO 8601 UTC timestamp capturing the exact moment the repository was first indexed by your application.last_checked_at (TEXT NOT NULL): Stores the latest pushed_at timestamp fetched from the GitHub API. This field is overwritten whenever a new delta/update is detected, serving as the benchmark for future comparisons. The updates Table This table functions as a historical append-only ledger. Every time the tool encounters a change (or indexes a repository for the first time), it appends a record here, creating a reliable audit trail of project activity. id (INTEGER PRIMARY KEY AUTOINCREMENT): Unique identifier for each specific update record.repo_id (INTEGER NOT NULL): Foreign key referencing repositories(id), establishing a 1:N relationship (one repository can have many logged updates).repo_name (TEXT NOT NULL): Denormalized repository name to allow quick querying of logs without mandatory joins.update_timestamp / pushed_at (TEXT NOT NULL): The pushed_at timestamp provided directly by the GitHub API API, indicating when the remote change actually occurred.check_timestamp (TEXT NOT NULL): An ISO 8601 UTC timestamp capturing when your local agent executed and caught the update.is_first_run (BOOLEAN NOT NULL): A flag (0 or 1) tracking whether this log entry represents the initial discovery of the repository or a subsequent update. Relationship Diagram The database structure relies on standard relational integrity: Optimization Indexes To prevent execution slowdowns as your tracking history grows over months of automated cron cycles, the database explicitly initializes two performance indexes: idx_repo_name on repositories(repo_name): Pre-sorts rows by repository name. This ensures that when the application calls _is_repo_in_db() to check if a project exists, SQLite performs an O(logn) binary search instead of an expensive O(n) full-table scan.idx_update_timestamp on updates(update_timestamp): Optimizes time-series queries, sorting updates by their timestamps to speed up reports or dashboards isolating recent changes. Data Storage Details Serverless and Local: Because SQLite is an in-process library, the entire database is stored as a single, ordinary cross-platform file (github_monitor.db) directly within your project directory.Dynamic Typing (Storage Classes): SQLite uses dynamic type affinity. While the schema declares standard SQL types like TEXT and BOOLEAN, dates are stored as ISO 8601 text strings. Booleans are managed natively by SQLite as integers (0 for false, 1 for true). The User Interface to Monitor the Results and Access the Repositories Markdown # web_viewer.py Flask App ├── Routes │ ├── index() -> Dashboard HTML │ ├── get_stats() -> Statistics JSON │ ├── get_repositories() -> Repositories JSON │ ├── get_updates() -> Updates JSON │ ├── get_timeline() -> Timeline JSON │ └── get_repository_details(id) -> Repository JSON │ ├── Utilities │ ├── get_db_connection() -> SQLite connection │ └── format_timestamp() -> Formatted date string │ └── Configuration ├── DB_PATH = 'github_monitor.db' ├── HOST = '127.0.0.1' └── PORT = 5001 Beyond the headless automation, the application features a clean, intuitive UI that serves as your central command center. This dashboard provides a crystal-clear visual overview of every repository currently being tracked by the agent. Instead of parsing raw database rows, you can audit your entire tech stack at a glance and see exactly what’s under watch. Even better, it collapses the distance between discovery and action: with a single click inside the UI, you can jump directly to any chosen repository on GitHub the moment you want to investigate a new change. Python #!/usr/bin/env python3 """ GitHub Monitor Web Viewer A simple Flask-based web application to visualize SQLite database data. """ from flask import Flask, render_template, jsonify import sqlite3 from datetime import datetime import os app = Flask(__name__) # Configuration DB_PATH = 'github_monitor.db' def get_db_connection(): """Create a database connection.""" conn = sqlite3.connect(DB_PATH) conn.row_factory = sqlite3.Row return conn def format_timestamp(ts_str): """Format ISO timestamp to readable format.""" try: if 'T' in ts_str: dt = datetime.fromisoformat(ts_str.replace('Z', '+00:00')) return dt.strftime('%Y-%m-%d %H:%M:%S UTC') return ts_str except: return ts_str @app.route('/') def index(): """Main dashboard page.""" return render_template('index.html') @app.route('/api/stats') def get_stats(): """Get overall statistics.""" conn = get_db_connection() cursor = conn.cursor() # Total repositories cursor.execute('SELECT COUNT(*) as count FROM repositories') total_repos = cursor.fetchone()['count'] # Total updates cursor.execute('SELECT COUNT(*) as count FROM updates') total_updates = cursor.fetchone()['count'] # Updates today cursor.execute(''' SELECT COUNT(*) as count FROM updates WHERE date(check_timestamp) = date('now') ''') updates_today = cursor.fetchone()['count'] # Most active repository cursor.execute(''' SELECT repo_name, COUNT(*) as update_count FROM updates GROUP BY repo_name ORDER BY update_count DESC LIMIT 1 ''') most_active = cursor.fetchone() conn.close() return jsonify({ 'total_repos': total_repos, 'total_updates': total_updates, 'updates_today': updates_today, 'most_active': dict(most_active) if most_active else None }) @app.route('/api/repositories') def get_repositories(): """Get all repositories with their update counts.""" conn = get_db_connection() cursor = conn.cursor() cursor.execute(''' SELECT r.id, r.repo_name, r.first_checked_at, r.last_checked_at, COUNT(u.id) as update_count FROM repositories r LEFT JOIN updates u ON r.id = u.repo_id GROUP BY r.id ORDER BY r.repo_name ''') repos = [] for row in cursor.fetchall(): repos.append({ 'id': row['id'], 'repo_name': row['repo_name'], 'first_checked_at': format_timestamp(row['first_checked_at']), 'last_checked_at': format_timestamp(row['last_checked_at']), 'update_count': row['update_count'] }) conn.close() return jsonify(repos) @app.route('/api/updates') def get_updates(): """Get recent updates.""" limit = 50 conn = get_db_connection() cursor = conn.cursor() cursor.execute(''' SELECT id, repo_name, update_timestamp, check_timestamp, is_first_run FROM updates ORDER BY check_timestamp DESC LIMIT ? ''', (limit,)) updates = [] for row in cursor.fetchall(): updates.append({ 'id': row['id'], 'repo_name': row['repo_name'], 'update_timestamp': format_timestamp(row['update_timestamp']), 'check_timestamp': format_timestamp(row['check_timestamp']), 'is_first_run': bool(row['is_first_run']) }) conn.close() return jsonify(updates) @app.route('/api/repository/<int:repo_id>') def get_repository_details(repo_id): """Get detailed information about a specific repository.""" conn = get_db_connection() cursor = conn.cursor() # Get repository info cursor.execute('SELECT * FROM repositories WHERE id = ?', (repo_id,)) repo = cursor.fetchone() if not repo: conn.close() return jsonify({'error': 'Repository not found'}), 404 # Get updates for this repository cursor.execute(''' SELECT * FROM updates WHERE repo_id = ? ORDER BY check_timestamp DESC ''', (repo_id,)) updates = [] for row in cursor.fetchall(): updates.append({ 'id': row['id'], 'update_timestamp': format_timestamp(row['update_timestamp']), 'check_timestamp': format_timestamp(row['check_timestamp']), 'is_first_run': bool(row['is_first_run']) }) conn.close() return jsonify({ 'repository': { 'id': repo['id'], 'repo_name': repo['repo_name'], 'first_checked_at': format_timestamp(repo['first_checked_at']), 'last_checked_at': format_timestamp(repo['last_checked_at']) }, 'updates': updates }) @app.route('/api/timeline') def get_timeline(): """Get update timeline data for visualization.""" conn = get_db_connection() cursor = conn.cursor() cursor.execute(''' SELECT date(check_timestamp) as date, COUNT(*) as count FROM updates GROUP BY date(check_timestamp) ORDER BY date DESC LIMIT 30 ''') timeline = [] for row in cursor.fetchall(): timeline.append({ 'date': row['date'], 'count': row['count'] }) conn.close() return jsonify(timeline) if __name__ == '__main__': if not os.path.exists(DB_PATH): print(f"Error: Database file '{DB_PATH}' not found!") print("Please run github_monitor.py first to create the database.") exit(1) print("=" * 60) print("GitHub Monitor Web Viewer") print("=" * 60) print(f"Database: {DB_PATH}") print("Starting server...") print("Open your browser at: http://localhost:5001") print("Press Ctrl+C to stop") print("=" * 60) # Use port 5001 to avoid macOS AirDrop conflict on port 5000 app.run(debug=True, host='127.0.0.1', port=5001) # Made with Bob So at the end we get; Centralized watchlist: View all monitored repositories instantly in a clean, human-readable dashboard rather than querying the SQLite tables directly.One-click navigation: Every tracked repository in the UI functions as an active shortcut — clicking a project immediately takes you directly to its GitHub page to review the latest commits or releases. Configured via Plain Text: Simple and Source-Controlled The repository watchlist is intentionally kept detached from the core code, stored in a flat, human-readable text file named repositories.txt. This design embraces a "configuration-as-code" philosophy: you don't need to write SQL queries or modify Python variables just to change what you track. You simply list the targets in a standard owner/repo format, one per line. The application’s parser is built to be forgiving and clean, automatically skipping empty lines and stripping out any lines prefixed with a #. This allows you to organize your watchlist with custom sections, leave developer notes, or temporarily comment out a project without losing track of it. Markdown # GitHub Repositories to Monitor # Format: owner/repo (one per line) # Lines starting with # are comments and will be ignored # Example repositories for testing: torvalds/linux microsoft/vscode python/cpython # Add your repositories below: docling-project/docling ibm/ibm-watsonx-orchestrate-adk ibm/mcp-context-forge generative-computing/mellea containers/podman podman-desktop/podman-desktop Conclusion: From Concept to Production in 30 Minutes What started as a simple, repetitive kind of daily habit — manually refreshing browser tabs to check for updates on critical frameworks like Docling and the watsonx Agent Development Kit — has been transformed into a fully automated, local developer ecosystem. By decoupling the watchlist into a frictionless, plain-text configuration file and leveraging a robust Python engine paired with an internal SQLite state ledger, the project eliminates human overhead entirely. With an OS-native cron scheduler handling the heavy lifting in the background and a sleek user interface providing one-click navigation to the source, the tool serves as a functional, autonomous agent that keeps my development workflow perfectly synchronized with the open-source world. The most remarkable aspect of this project, however, wasn’t just the architecture — it was the velocity. By collaborating with IBM Bob as an AI-driven development partner, the entire lifecycle of this tool moved from ideation to a production-ready implementation in exactly 30 minutes. From initializing the database schemas and crafting resilient API delta logic to wrapping the application in a self-cleaning bash scheduler, Bob industrialized the code creation process seamlessly. It is a powerful testament to how modern, spec-driven prototyping can compress days of development overhead into a single focused, half-hour session, delivering immediate architectural value without the bloat. That’s a wrap! Links Blog post code repository: https://github.com/aairom/github-checkIBM Bob: https://bob.ibm.com/

By Alain Airom

Deployment Lessons You Only Learn the Hard Way

Over the last two decades, my code has been deployed in a live environment. I disrupted stress testing processes on Black Friday, rendered user authentication impossible at 2 am, and saw a system handling 40 million users break due to a minor modification in the configuration file. It is not about being a bad engineer. It is about being practical. Every senior engineer I respect has a war story. What separates them from those living in chaos is simple: great ones who have seen it before built their models around recovery. No dumb luck or heroic save. Reliable deployments require all three to work jointly. A crack monitoring system that detects slow-building problems in seconds is required. You need backoff strategies so that you can initiate the rollback without even blinking. Having a playbook for recovery beforehand is crucial; one should be prepared before the need arises. I will now walk you through what each of these systems looks like. 1. Monitoring: See Everything Before Users Do Monitoring exists in nearly all teams. However, most teams keep overlooking outages for 8 to 12 minutes after every deployment. This is the gap between the two. Not even a lack of tools. But false signals. Over the course of two decades, I have finally narrowed it down to four metrics that matter for every deployment. Google calls these Golden Signals. I call them the only things worth waking up for. Failure rate: This does not count failures; rather, it is the percentage of failures to successes. Error rate.P99 latency: Approximately the slowest one percent of users. There is no chance for the average latency to hide a disaster.Traffic uniformity: A sudden drop in the distribution chart is as alarming as an unpredictable burst. Either of these might signal something that has gone wrong.Saturation: CPU, memory, connection pool headroom. How close are you to the cliff? Set all four of these up as alerts and hook them into your deployment pipeline. If a sudden spike appears in the record within two minutes of a push, you need to know right away. Below is the Prometheus alert that I use for error rate. Simple. Effective. It alerts me even before the users start complaining. Shell Below is the Prometheus alert that I use for error rate. Simple. Effective. It alerts me even before the users start complaining. - alert: HighErrorRate expr: rate(http_errors[5m]) / rate(http_requests[5m]) > 0.05 for: 2m annotations: summary: 'Error rate above 5% - check recent deploy' A 2% threshold is fine during office hours, with an increase to 5% overnight, provided you adjust it to suit your traffic patterns. The actual number is not the main concern; the essential thing is to be alerted about it. Teams make the error of sending alerts for every possible event. Alert fatigue is a genuine problem in the field. Within a month, your team will stop paying attention to pages if there are too many fires. Choose four signals from the provided list. Create alerts that have significant value. The first ten minutes of normal deployment warm-up should be used to silence regular operational activities. The next step is to monitor the situation with intense observation. 2. The Five Rollback Strategies That Actually Work Rollback does not exist as a single operational procedure. Teams tend to manage it as if they can simply flip a switch to control it. The system requires five different operational methods. Each method operates best in its specific usage situation. The incorrect choice will result in time loss, which you cannot afford. You must learn all five methods before your upcoming deployment. Strategy 1: Git Revert The unsharpened device. Most rapid in execution. Always available. Your initial action should be to create a new commit that reverses the change. The deployment process will begin after you push the commit. The pipeline will proceed to redeploy the system. Shell git revert <commit-hash> --no-edit git push origin main Opt for git revert rather than git reset. Revert helps maintain a clear history of modifications. Reset rewrites it. The shared branch history should never be changed under pressure. The execution time will take three to four minutes when your pipeline operates at high speed. Strategy 2: Blue-Green Switch The organization maintains two identical production environments. One environment operates. One environment remains inactive. You deploy to the inactive environment. Smoke test it. Then flip your load balancer. You should restore the previous state. The rollback process works at the speed of a configuration reload. Shell # Roll back with one AWS CLI command aws elbv2 modify-listener \ --listener-arn $LISTENER_ARN \ --default-actions Type=forward,TargetGroupArn=$BLUE_TG Time to execute: thirty seconds. Tradeoff: double the infrastructure cost. Worth it at scale. Evaluate for your budget Strategy 3: Feature Flags The most surgical tool you have. You do not roll back the deploy. You kill a flag. The broken code path stops executing instantly. Everything else keeps running. No pipeline. No infrastructure change Shell if (flags.isEnabled('new-checkout-flow', userId)) { return newCheckout(cart); // kill this flag to disable } return legacyCheckout(cart); // always-safe fallback Time to execute: ten seconds. I have used this to instantly disable a broken feature for twelve million users without touching a single deployment. Wrap every high-risk code path in a flag. Do it before the deploy. Strategy 4: Canary Deployment This one prevents disasters instead of cleaning them up. Ship to one to five percent of traffic. Watch the metrics for fifteen minutes. If they look bad, delete the canary. If they look good, roll out to everyone. Shell # 1 canary pod alongside 9 stable pods = 10% traffic kubectl scale deployment api-stable --replicas=9 kubectl scale deployment api-canary --replicas=1 Your worst case is now that five percent of users saw an issue. Not one hundred percent. Every team that adopts canaries wonders how they shipped without them. Strategy 5: Config Rollback Sometimes the problem is not code. It is a setting. Environment variables. Connection pool sizes. Timeout values. Rate limits. These change constantly. They break things in ways that look exactly like code bugs. Keep your config versioned. Keep your secrets in a vault that supports versioned rollback. Know which config change shipped alongside which deploy. Time to execute: sixty seconds. Most underused rollback in the industry. Add it to your playbook now. 3. Failure Recovery: Write the Playbook Before You Need It The worst time to figure out your recovery process is during an incident. Your adrenaline level is elevated. Slack is experiencing excessive activity. Your CEO has sent you a direct message: Your mind is unable to function properly. The situation you face is a biological issue that should not be viewed as a personal failure. Teams that recover within five minutes are not necessarily more intelligent. They prepared for this ahead of time. The Incident Response Loop Every occurrence moves through the same five stages. Your mission is to sail through quickly. Detect (under 2 minutes): Alert fires. On-call engineer acknowledges. Incident channel opens.Triage (under 7 minutes): Is this P0 or P1? How many users are affected? Is it the recent deploy?Mitigate (under 20 minutes): Stop the bleeding. Rollback, kill a flag, scale up. Users first.Resolve (under 60 minutes): Find root cause. Ship permanent fix or confirm rollback holds.Review (within 48 hours): Write the post-mortem. Assign action items. Close the loop. Typically, teams complete the first three with ease. They bypass the review step. The review process stops repeated incidents from occurring again. The report needs to be written in a way that assigns no blame and provides clear steps for future action. The Runbook You Should Write This Week The runbook document provides engineers with a guide to follow during emergency situations that occur at 3 AM when they lack sleep. The document provides particular instructions that address particular failure modes of the system. I maintain a complete document for every service that I manage. Here is the minimum it needs: Symptoms: What does the alert show? What does the dashboard look like?First check: One command to confirm the diagnosis without making anything worse.Mitigation: The fastest path to stopping user impact. Even if it is not the permanent fix.Escalation: Who to call and when. After thirty minutes without progress, someone else gets paged.Done state: How does success look like, and when exactly do you think of closing an incident? That final point carries greater importance than most people regard. The absence of a definite completion state causes incidents to continue indefinitely. Engineers persist with their debugging assignment until they reach a point where users no longer experience problems. Game Days: Practice Before the Real Thing The requirement mandates the execution of a scheduled quarterly test, which involves intentional system damage. The testing process requires the selection of either a staging or a non-production environment. The procedure requires you to execute the rollback process while you record the duration of your operations. My first attempt at this with a new team revealed that three of the four documented rollback steps had become unusable. The infrastructure underwent modifications, but the team failed to detect them. We found that on a Tuesday afternoon. The discovery occurred outside the Friday night incident time window. The single exercise we performed saved our organization from this danger. You should execute the process at regular intervals because it will provide you with the same benefits that we received. The Bottom Line The tasks at hand require no complex skills to complete. The installation process for Prometheus takes one afternoon to complete. The process of git reverting requires thirty seconds to complete. The development process for a runbook takes two hours to complete. The implementation of a feature flag requires one entire sprint duration. The challenging task requires execution during system operational status. The active system operation requires work to produce results. The most important work needs to be done first before anything else can be accomplished. The teams that achieved five-minute recovery times invested their resources on a Tuesday when everything was calm. The recovery process occurred at a time when no operational problems existed. Begin your work with the establishment of monitoring systems. Choose one rollback method that matches your system architecture requirements and create a documentation record for it. Create a runbook document for your most important service. The existing materials provide sufficient information. The three tasks you must complete will already make you more advanced than the typical teams I have encountered in my previous work. The upcoming software release will cause system failure. Design your system to handle failures without creating panic among users.

By Sandesh Basrur

Building a RAG-Powered Bug Triage Agent With AWS Bedrock and OpenSearch k-NN

Bug triage on a graphics engineering team is one of those tasks nobody really wants to own. A new crash report comes in, and somebody has to work out whether it looks like a known issue, what the stack trace points at, which subsystem the affected code lives in, and which sub-team should pick it up. The answers exist in the issue tracker, the source repo, and the architecture docs, but pulling them together by hand takes time. And the engineers best at it are the ones you least want spending hours on it. On our team, the archive of resolved bugs had grown to over 1,100 issues. That is a real corpus. It contains the answer to a lot of incoming questions, but only if you can find the right three or four entries quickly. The agent described here does that lookup automatically, combines it with crash log parsing and source code search, and produces a root cause analysis with a confidence score. Triage that used to take hours now takes minutes. This article is about the architecture choices: why AWS Bedrock with Claude, why OpenSearch with HNSW indexing, why DynamoDB for workflow state, and why ECS Fargate. None of these choices is unique. The reasoning behind them is what's portable. What the Agent Actually Has to Do Before the architecture, it's worth being concrete about the work. When a bug report arrives, the agent produces an analysis built on five signals: Historical pattern match against the knowledge base of resolved issues.Source code match against the repositories the trace points into.Crash stack analysis on the trace itself.Log evidence from whatever logs were attached or linkable.Fix ownership, derived from who has historically fixed bugs in the affected components. Each signal contributes to a final confidence score. The combination matters because no single signal is reliable on its own. A stack trace can match a bug that was fixed three releases ago, a source-code hit can be unrelated, and ownership data can be stale. A useful triage answer leans on multiple signals together. That is the work. The architecture exists to support it reliably, repeatedly, and without baking in assumptions that will hurt later. Why RAG, and Why These Pieces The obvious wrong move is to skip retrieval and pass the whole corpus to the model. Context windows aren't the bottleneck people think they are. Even when they're large, signal-to-noise gets bad fast, and cost and latency scale with input size. For any given bug, the relevant slice is small: a few prior tickets, a couple of source files, maybe one architecture doc. Retrieval-augmented generation (RAG) is the right shape because the retrieval layer's job is precisely to find that slice. OpenSearch With HNSW Indexing The knowledge base lives in OpenSearch with vector search over a k-NN HNSW index. HNSW (Hierarchical Navigable Small World) suits corpora in the low thousands to low millions of documents. Query time stays low, and recall stays high without the tuning effort IVF-based indexes demand at smaller scales. OpenSearch was chosen over a dedicated vector database for operational reasons. It runs in the same AWS environment as the rest of the stack, supports keyword and vector search in the same index when you need hybrid retrieval, and doesn't add a new vendor to the diagram. For a team-internal tool, the integration cost of a separate vector DB outweighs the marginal performance gain. Titan Embeddings Embeddings are generated with Amazon Titan. The main reason: the data (bug reports, stack traces, code snippets) never has to leave AWS. That removes a class of compliance questions that come up the moment you start sending source code or internal tickets to an external embedding API. Titan handles technical text well enough for this corpus, and it shares IAM, quotas, and billing with everything else. Claude on Bedrock as the Reasoning Model The reasoning step takes the retrieved context and the parsed crash log and produces the actual analysis. It runs on Claude through Bedrock. Two properties matter here. First, Claude handles long, messy, structured input well: stack traces aren't clean prose, and the surrounding context is a mix of code, logs, and ticket descriptions. Second, it expresses uncertainty rather than picking a confident-sounding wrong answer. For a system whose output a human engineer is going to read and either trust or push back on, that calibration matters more than fluency. The Five-Signal Confidence Score The most consequential part of the system isn't the model call. It's the scoring layer that wraps it. The agent doesn't just say "this looks like a duplicate of bug X." It produces a confidence score, and that score is what triagers use to decide whether to accept the suggestion or dig in themselves. The score is a weighted combination of the five signals listed earlier. Each contributes a sub-score; the weights reflect how predictive each signal has been, in this team's experience, of a correct triage outcome. The interesting design choice is that the weights are not static. Real bug reports don't always include all five signals. Some arrive without attached logs. Some point at code with no clear ownership history. With static weights, missing signals would drag the final score down even when the available signals were strongly aligned. The agent redistributes the weight of any unavailable signal across the available ones, normalized to sum to one. The conceptual shape: Python # Conceptual sketch of dynamic weight adjustment BASE_WEIGHTS = { "historical_match": w1, "source_code_match": w2, "crash_stack": w3, "log_evidence": w4, "fix_ownership": w5, } def adjusted_weights(available_signals): active = {k: v for k, v in BASE_WEIGHTS.items() if k in available_signals} total = sum(active.values()) return {k: v / total for k, v in active.items()} This is a small piece of code that does a disproportionate amount of the work of making the agent's output trustworthy. A given confidence score should mean roughly the same thing whether the bug arrived with logs or without. DynamoDB for Workflow State A triage run is not a single API call. The agent parses the report, retrieves embeddings, runs vector search, fetches matched documents, pulls source code context, calls the reasoning model, computes the score, and writes results back. Each step can fail or be slow independently. Workflow state for each in-flight triage lives in DynamoDB. The schema is intentionally simple: a triage ID as the partition key, a status field, and the accumulated context. Two reasons it's external rather than in-process memory. First, recovery. If the model call fails or times out, the workflow should resume without redoing the embedding and retrieval work. Token costs add up otherwise. Second, observability. The Flask dashboard the team uses to monitor triage operations reads from this same DynamoDB table. That includes real-time status, filterable history, analytics, and the routing view for issues that don't belong to this team. There is no separate event log to maintain. Workflow state is the source of truth, and the dashboard is a view onto it. ECS Fargate for Orchestration The triage workflow runs on ECS Fargate. The choice is shaped by what the workflow looks like: a sequence of calls to external services (Bedrock, OpenSearch, the issue tracker), with the long pole being model latency. Not CPU-heavy, not bursty. Incoming bugs arrive at a steady rate. Fargate handles this shape cleanly. No cold start, no execution time limit, and the operational model is straightforward: container in, container out, IAM and networking inherited from the cluster. The Flask dashboard runs in the same Fargate cluster, sharing the same VPC and observability tooling. The general pattern: short, stateless, bursty work fits Lambda. Orchestrated workflows with slower external calls and a need for predictable behavior fit Fargate. For a team-internal agent that runs continuously, Fargate's properties matter more than its slightly higher baseline cost. Keeping the Knowledge Base Current None of this works if the corpus goes stale. The ingestion pipeline syncs three sources continuously: the issue tracker, where newly resolved bugs become new entries; the documentation repo; and the source code repositories, which provide both file content and ownership signal. The pipeline is fully automated. New content is chunked, embedded with Titan, and indexed in OpenSearch without manual intervention. Ingestion is decoupled from query. They share the index but nothing else, so a slow ingestion run never affects live triage latency, and a problematic batch can be rolled back without touching the query path. What's Worth Taking From This The model layer (Bedrock, Claude, Titan) is interchangeable. Swap them for OpenAI plus their embeddings, or for a self-hosted setup, and the architecture still works. What is not interchangeable, or not easily, is the shape of the rest: Retrieval before reasoning. Don't ask the model to do retrieval against a large corpus. Get the relevant slice with a dedicated retrieval layer, then hand it over with a tight prompt.Multiple signals with dynamic weights. Single-signal confidence scores break under real-world data. Multiple signals with weight redistribution handle the cases where inputs are incomplete.Persist workflow state externally. Even for short workflows, having state in a queryable store pays off in failure recovery and gives the dashboard a single source of truth.Decouple ingestion from query. They have different reliability requirements and should be able to fail independently.Match compute to workload shape. Fargate for orchestrated, latency-tolerant workflows. The wrong choice here shows up later as cold starts, timeouts, or surprise bills. The agent has been doing useful work since it shipped. The thing that took the longest to get right wasn't any single component. It was the scoring layer and the decision to make state external. Those are the parts that determine whether a system like this is something the team relies on or something the team works around.

By Rajasekhar sunkara

Frame Buffer Hashing for Visual Regression on Embedded Devices

I run test automation for a graphics team that ships software to streaming devices. About a year ago, we changed how our visual regression suite stores and compares its references. The old approach kept around 18GB of PNG golden images in the test repo and ran a pixel-by-pixel diff on every comparison. The new approach stores around 19KB of MD5 hashes in a JSON file and compares hash strings. Storage dropped by roughly three orders of magnitude. Comparisons became effectively free. A category of flaky tests stopped being flaky. This article is about how that works, when it makes sense, and when it doesn't. It also covers the parts that surprised me, because the approach has real downsides and I want to be honest about them up front. How It Works The idea is simple once the constraints are right. On the embedded devices we test, we have access to the raw GPU frame buffer through the graphics stack. The test harness reads it as a bytes object, computes an MD5 hash of those bytes, and compares the hash against a stored reference. If the hashes match, the test passes. If they don't match, the test captures the actual frame and saves it as a failure artifact for a human to look at. The stored reference is a 32-character hex string per screen, kept in a JSON file checked into the test repo alongside the test code. The full implementation is short: Python import hashlib import json from pathlib import Path REFERENCE_FILE = Path("references/visual_hashes.json") def frame_hash(frame_bytes: bytes) -> str: """MD5 of the raw GPU frame buffer.""" return hashlib.md5(frame_bytes).hexdigest() def load_references() -> dict: if REFERENCE_FILE.exists(): return json.loads(REFERENCE_FILE.read_text()) return {} def check_frame(test_id: str, frame_bytes: bytes, references: dict) -> tuple[bool, str]: """Returns (passed, actual_hash).""" actual = frame_hash(frame_bytes) expected = references.get(test_id) if expected is None: return False, actual # no reference yet return actual == expected, actual def on_failure(test_id: str, frame_bytes: bytes, actual: str): """Only called when hashes diverge. Save the frame for review.""" artifact_dir = Path(f"artifacts/{test_id}") artifact_dir.mkdir(parents=True, exist_ok=True) (artifact_dir / f"{actual}.raw").write_bytes(frame_bytes) That's essentially the whole system. Because the references are text, intentional UI changes show up as normal source-control diffs in code review instead of opaque binary blob swaps. Because the comparison is string equality on a hex digest, it's effectively instant regardless of frame size. Why MD5 Specifically MD5 is cryptographically broken. You can construct collisions on demand, and using it for password storage or signature verification is malpractice. None of that matters here. Visual regression testing is not a cryptographic problem. The two inputs being compared are the rendered output of our own GPU yesterday and the rendered output of our own GPU today. There is no adversary trying to construct a frame buffer that hashes to a specific value. What you actually need from a hash function in this context is fast computation, low accidental collision rate on real-world inputs, and stable output across runs and platforms. MD5 covers all three. The accidental collision probability between two different rendered frames at typical buffer sizes is small enough that we have not encountered one. SHA-256 covers the same three properties at slightly higher CPU cost. If the cryptographic concern is going to come up in code review every quarter, just use SHA-256. The Conditions That Have to Hold This approach only works when three things are true about your environment. The first is access to the raw frame buffer before any encoding step. Browser-based testing, mobile UI testing through the standard automation frameworks, and most desktop application testing give you a captured screenshot, which has been through some encoding step before you see it. PNG encoders can vary across versions, and two systems can render the same pixels and produce different PNG files. If your only access point is a captured screenshot, you are comparing post-encoding output, and encoder noise will sink hashing. On embedded devices with a graphics stack you control, you usually do have raw frame buffer access, which is why this worked for us. The second condition is that the rendering pipeline has to be deterministic. Same input, same GPU state, same output bytes. If antialiasing produces different pixels for the same logical input from one run to the next, or if time-based animations get sampled at slightly different moments, or if the GPU driver rounds inconsistently, the hashes will diverge for reasons that aren't real bugs. In our case, the pipeline is deterministic, so this isn't a problem. In a lot of environments, it isn't, and you would need pixel-diff with a tolerance threshold or perceptual hashing to handle the noise. The third condition is that capture points have to be stable. The test harness has to call the capture function at the same logical point in the pipeline every run, after the same set of operations. This is usually the easiest of the three to engineer. Frame buffer access either exists or it doesn't, and determinism is sometimes a property you can't change. Capture point stability is just a discipline about where you instrument your tests. If any of these three conditions fail, frame buffer hashing is the wrong tool. Pixel-diff with a tolerance threshold is the right default for most setups, and perceptual hashing covers the middle ground where you have raw access but some non-determinism. The narrow case this article is about is the one where all three hold. What You Give Up The biggest tradeoff is failure diagnosis. With golden images, when a test fails, you have a stored reference and a new screenshot, and you can render a side-by-side diff or an overlay highlighting the changed pixels. With hash comparison, you have two strings that don't match. The failure handler captures the actual frame on the spot, but the reference image (which doesn't exist anymore in storage) has to be reconstructed by running the same test against a known-good build whenever you want to do a side-by-side comparison. That extra step is annoying when failures are common. In our case, they aren't, so the cost is manageable. If your suite has a high baseline failure rate, the math changes, and you may want to keep both the hashes and the reference images, using the hash for fast pass/fail detection and the image only for diagnosis. The other thing you give up is fuzzy matching, but that's the same point as the determinism condition. Fuzzy matching exists to compensate for non-determinism in the rendering pipeline. If your pipeline is deterministic, you don't need it. If it isn't, you do, and hashing won't work. What It Changed for Us Storage going from 18 GB to 19 KB is the change people notice first, but the second-order effects matter more in day-to-day work. Repository operations got faster because the test repo no longer carries gigabytes of binary history. Cloning a fresh checkout takes a fraction of the time it used to. PR reviews got cleaner because UI changes show up as readable JSON diffs instead of opaque PNG swaps. The flaky-test rate from encoder noise dropped to zero, which was the change that got the most attention from people on the team. Some of the old goldens had been re-saved at some point with slightly different encoder settings, and tests would fail mysteriously even though the rendered pixels were identical to the human eye. The only fix had been to regenerate the golden, which nobody really trusted. Removing the encoder from the comparison loop removed the entire class of failure. CI runs got faster, too, because hash comparison is essentially free compared to image diffing. None of these wins is novel; Skia, PDFium, and the apitrace project have used hash-based comparison of rendered output for years. What was new for us was committing to it as the primary mechanism for an entire UI test suite on embedded hardware, and accepting the implication that the stored reference is text rather than a binary asset. If you're working in an environment where the three conditions hold, the implementation is small enough that a prototype takes a day. If even one of them is missing, this isn't the right tool, and the alternatives are well understood. The interesting part is recognizing which environment you're actually in.

By Rajasekhar sunkara

Amazon Quick: AWS's Agentic Workspace, Explained for Engineers

AWS has been building agentic infrastructure for some time now — Bedrock, AgentCore, Strands — mostly aimed at engineers who want to build their own agent systems from scratch. Amazon Quick is a different layer of the same bet: a ready-to-use agentic workspace that targets teams directly, without requiring custom orchestration code. This article walks through what Quick is, how its components fit together technically, how the MCP integration model works with real code, and where it sits relative to the rest of AWS's agent stack. What Amazon Quick Is Amazon Quick is an AI assistant for work that connects to your existing tools — Slack, Microsoft Teams, Outlook, CRMs, databases, and local files — and gives a unified layer for querying, automating, and acting across them. It launched in preview at AWS's "What's Next with AWS" event on April 28, 2026. The product is aimed at teams, not just individual users. One person can build a custom agent scoped to a specific dataset or workflow, and the whole team benefits from it. Responses from Quick agents are grounded in your actual business data, not the underlying model's training distribution. Under the hood, Quick is built on Amazon Bedrock AgentCore and uses the Model Context Protocol (MCP) as its standard for connecting to external tools. It runs on AWS IAM and VPC, which means it inherits the same security and compliance posture as the rest of your AWS workloads. Components Quick bundles five distinct capabilities. It helps to understand each one separately before thinking about how they compose. ComponentWhat it doesSpacesCollaborative workspaces where teams pool files, dashboards, and data sources. Agents in a Space are grounded in that Space's data.AgentsCustom, domain-scoped agents built on your team's specific data. One person builds, everyone uses.ResearchMulti-source synthesis across internal data, the public web, and third-party datasets. Produces structured reports.Visualize (Quick Sight)Integrated BI layer. Conversational access to dashboards, charts, and forecasting — no separate BI tool required.Automate (Quick Flows)Workflow automation from simple daily tasks to complex multi-step processes with cross-app action execution. Each component is available through the web app, mobile, and a native desktop app (currently in preview for macOS and Windows) that can read local files and calendar context without requiring browser access. Where Quick Sits in the AWS Agent Stack AWS is building in two directions at once. AgentCore is the infrastructure layer for engineers who want to compose their own agent systems — runtime, memory, gateway, observability — with any model and any framework. Quick is the product layer on top: opinionated, team-facing, and deployable without writing orchestration code. The practical implication: if you're an engineer building internal tools or automation pipelines, you'll likely interact with both layers. AgentCore for the infrastructure wiring; Quick as a surface where non-technical teammates interact with the agents you build. The Integration Architecture The core question for any engineer evaluating Quick is: how does it actually connect to external systems, and what does the request path look like? Quick uses MCP (Model Context Protocol) as its primary integration standard. This is significant because MCP is an open protocol — it means Quick agents are not locked into AWS-specific connectors, and any MCP-compatible server can be registered as a tool source. High-Level Request Flow The sequence below shows the full lifecycle of a single agent-triggered tool call — from the moment Quick receives a prompt through to the response returning from a downstream API. Quick acts as the MCP client. Your MCP server exposes tools via listTools and callTool. Quick discovers them at registration time and makes them available to any agent or automation in the workspace. Authentication flows through OAuth 2.0, with support for Dynamic Client Registration (DCR) so Quick can register itself automatically without manual credential setup. Building an MCP Server for Quick Here is a minimal Python MCP server using the mcp SDK that exposes two tools Quick can invoke — get_ticket and list_open_tickets. This pattern works whether you host the server yourself or run it on AgentCore Runtime. Install Dependencies Python pip install mcp[server] httpx uvicorn Server Implementation Python # server.py from mcp.server import Server from mcp.server.sse import SseServerTransport from mcp.types import Tool, TextContent import httpx import json from starlette.applications import Starlette from starlette.routing import Route app = Server("jira-quick-integration") JIRA_BASE_URL = "https://yourorg.atlassian.net" JIRA_TOKEN = "Bearer <your-token>" # in production, load from AWS Secrets Manager @app.list_tools() async def list_tools() -> list[Tool]: return [ Tool( name="get_ticket", description="Retrieve details for a single Jira ticket by issue key.", inputSchema={ "type": "object", "properties": { "issue_key": { "type": "string", "description": "The Jira issue key, e.g. ENG-1234" } }, "required": ["issue_key"] } ), Tool( name="list_open_tickets", description="List open Jira tickets assigned to a given user.", inputSchema={ "type": "object", "properties": { "assignee": { "type": "string", "description": "The Jira username or email of the assignee" } }, "required": ["assignee"] } ) ] @app.call_tool() async def call_tool(name: str, arguments: dict) -> list[TextContent]: headers = {"Authorization": JIRA_TOKEN, "Content-Type": "application/json"} async with httpx.AsyncClient() as client: if name == "get_ticket": key = arguments["issue_key"] resp = await client.get( f"{JIRA_BASE_URL}/rest/api/3/issue/{key}", headers=headers ) resp.raise_for_status() data = resp.json() summary = data["fields"]["summary"] status = data["fields"]["status"]["name"] return [TextContent(type="text", text=f"{key}: {summary} [{status}]")] elif name == "list_open_tickets": assignee = arguments["assignee"] jql = f"assignee={assignee} AND status != Done ORDER BY updated DESC" resp = await client.get( f"{JIRA_BASE_URL}/rest/api/3/search", headers=headers, params={"jql": jql, "maxResults": 20} ) resp.raise_for_status() issues = resp.json().get("issues", []) results = [ f"{i['key']}: {i['fields']['summary']}" for i in issues ] return [TextContent(type="text", text="\n".join(results) or "No open tickets found.")] raise ValueError(f"Unknown tool: {name}") # Wire up SSE transport for Quick compatibility sse = SseServerTransport("/messages/") async def handle_sse(request): async with sse.connect_sse( request.scope, request.receive, request._send ) as streams: await app.run(streams[0], streams[1], app.create_initialization_options()) starlette_app = Starlette( routes=[Route("/sse", endpoint=handle_sse)] ) if __name__ == "__main__": import uvicorn uvicorn.run(starlette_app, host="0.0.0.0", port=8080) A few design constraints to be aware of when building for Quick: Each MCP tool call has a 300-second hard timeout. Operations that exceed this fail with HTTP 424. Keep individual tool calls narrow and fast.The tool list is treated as static after registration. If you add or remove tools on the server, the Quick admin must re-establish the connection to pick up changes.Quick supports both Server-Sent Events (SSE) and streamable HTTP as transports. Streamable HTTP is preferred for new implementations. Registering the MCP Server in Quick Once your server is running and publicly reachable over HTTPS, registration in Quick takes the following path: Shell Quick Console → Integrations → Add Integration → MCP Fields: Server URL: https://your-mcp-server.example.com/sse Auth type: OAuth 2.0 (or Service, or None) Client ID: <from your identity provider> Authorization URL: https://auth.example.com/oauth/authorize Token URL: https://auth.example.com/oauth/token If your identity provider supports OAuth Dynamic Client Registration, Quick will auto-register and you skip the manual client ID step entirely. Quick sends an initial unauthenticated request to the MCP server; if it receives a 401 with a WWW-Authenticate header containing a resource_metadata URL, it fetches the metadata document and proceeds with DCR automatically. Once registered, Quick calls listTools at startup and exposes every discovered tool to agents and automations in the workspace. The AgentCore Gateway Option For teams that don't want to write and operate an MCP server from scratch, Amazon Bedrock AgentCore Gateway provides a managed alternative. You point Gateway at a Lambda function or an OpenAPI spec, and it handles the MCP wrapping, auth, logging, and semantic tool discovery automatically. If you use it, Quick never calls your internal APIs directly — everything flows through Gateway's auth and routing layer, as shown in the sequence diagram above. The semantic search capability is worth noting specifically. When an agent has access to dozens or hundreds of tools, passing the full tool list on every turn wastes context and causes the model to pick the wrong tool. Gateway's built-in x_amz_bedrock_agentcore_search tool lets Quick find the right tool by semantic similarity rather than scanning the entire registry each turn. Practical Considerations A few things worth keeping in mind before integrating: Tool scope matters. When agents are given too many tools simultaneously, selection accuracy degrades — the model reasons over too many options per turn and picks incorrectly more often. Keeping each agent or MCP server to a focused set of 3–5 tools produces better results than exposing everything through one endpoint. This is a known pattern in multi-agent architectures and applies equally to Quick agents. The 300-second timeout is real. Design each tool call to complete a single, bounded operation. Avoid chaining multiple downstream API calls inside a single tool invocation. If you need a multi-step workflow, model it as separate tools and let the agent orchestrate the sequence. Local context on the desktop app. The desktop app reads local files and calendar events directly, without upload. For engineers who work primarily in terminals and local editors, this is a meaningful integration point — meeting context, local documentation, and recent file changes are all available to the assistant without any configuration. MCP interoperability. Because Quick uses MCP as the standard, the same MCP server you build for Quick can also be consumed by Claude Code, Amazon Q Developer, and other MCP-compatible clients. The integration contract is portable. References Amazon Quick — Product overview and featuresIntegrate external tools with Amazon Quick Agents using MCP (AWS ML Blog, Feb 2026)MCP integration — Amazon Quick User GuideAmazon Bedrock AgentCore — Overview and documentationIntroducing Amazon Bedrock AgentCore Gateway (AWS ML Blog)Top announcements of the What's Next with AWS, 2026 (AWS News Blog, Apr 2026)

By Jubin Abhishek Soni

CORE

Agentic AI Has an Observability Blind Spot Nobody Is Talking About

Here is what a production cascade looks like when nobody did anything wrong. An alert fires on a microservice showing elevated latency. The signal is accurate. The automated remediation agent picks it up immediately and does exactly what it was built to do: restart the affected service and reroute traffic. The action is within scope, the credentials are valid, and three seconds later, the platform reports a successful remediation. Then, four dependent services go dark. The postmortem will call it a cascade. The dashboard will show a clean execution on the first incident and a second incident opening 90 seconds later. Nobody will find an error log on the remediation itself because there was none. The agent was not wrong. The action was technically correct. What nobody had built was the ability to ask: given what the system is carrying right now, is this the moment to add more disruption to it? That is not a monitoring gap. Monitoring told everyone exactly what was broken. It is an observability architecture gap — the difference between knowing what is failing and knowing whether the system can safely absorb what you are about to do to fix it. Figure 1: The alert was correct. The instrumentation gap was not in detection — it was in the question asked before acting. The Failure Pattern Is More Consistent Than Teams Expect I ran into this structurally while doing chaos engineering on enterprise SD-WAN infrastructure at Cisco. We were running experiments against production-grade environments across large financial services and telecom customers, and standard chaos tooling kept finding the wrong failures. It was injecting faults into systems whose state had already shifted past the parameters we had set at the start of the experiment. The faults that caused real damage were the ones that chained with conditions already present in the environment — elevated resource utilization, two services over, a background process that had been running for 45 minutes, consuming memory that a restarted service needed, a connection pool sitting at 89 percent because of an unrelated batch job. None of those conditions was hidden. Everyone was instrumented. The problem was that nobody was reading them together as a composite signal before deciding how hard to push the system. We were answering the wrong question. We built a methodology to fix it. Instead of setting static experiment parameters, the engine reads live telemetry before each iteration, derives from that telemetry the system's current capacity to absorb perturbation, and calibrates the intervention intensity accordingly. A feedback loop between the actual impact and the intended impact across successive iterations finds the behavioral boundary without disabling the environment. That methodology became USPTO Patent No. US12242370B2. Patent: https://patents.google.com/patent/US12242370B2/en What we built for SD-WAN infrastructure is the same thing agentic AI deployments need now. The underlying problem is identical: an automated actor is making decisions about whether and how to intervene in a live system, using a signal that accurately describes what is broken but says nothing about what the system can safely absorb in the moment the decision is made. Why AWS FIS and Gremlin Will Not Find This for You Infrastructure fault injection is good at what it does. AWS FIS, Gremlin, and Chaos Toolkit test whether your Lambda survives throttling, whether the event pipeline recovers from a queue outage, and whether the hosting environment holds up under resource pressure. These are legitimate questions, and the tooling answers them well. They just do not test the failure mode that is generating the most expensive incidents as agentic AI deployments scale. An agent's worst production failure is not a cold start timeout or a concurrency breach. It is a clean, successful invocation that executes the wrong sequence — because the combination of inputs, tool call results, and current system state put the agent at the edge of its operational envelope, and nobody built a test that ever got it there. Air Canada's chatbot did not crash. It executed correctly in a scenario the designers never tested. No infrastructure fault injection exercise would have found that boundary because the boundary was not in the infrastructure. The same structure shows up in autonomous remediation. The agent reads a real signal, takes a valid action within its authorized scope, and produces an outcome nobody intended because the action was correct in isolation but wrong given the composite state around it. Standard tooling reports a clean execution. The cascade shows up in the next incident ticket. Finding the behavioral boundary requires a test methodology that reads live system state before calibrating experiment intensity — not one that applies static parameters to a system whose state has already shifted. Static parameters applied to dynamic systems find the failure modes you designed the test to find. They miss the ones that actually hurt. Three Instrumentation Gaps to Close Before Your Agents Hold Production Credentials These did not come from a research paper. They came from postmortems — at Cisco across financial services and telecom customers, and at Splunk across thousands of enterprise observability deployments. The same three gaps show up every time. 1. Concurrent workload state across the dependency graph, not just the service under incident. A service restart that is safe in isolation is frequently dangerous when adjacent services are already running above their normal resource ceilings. The absorb capacity question is a system-level question, not a component-level one. Most runbooks do not include a pre-action resource check across the dependency graph of the service being touched. Automated agents have no reason to be different. What to build: a pre-action query that checks whether any first-degree dependency of the service being remediated is above 80 percent of its 24-hour baseline utilization. One data point. It exists in most observability stacks already. It is almost never surfaced in an incident context. 2. Pending operations competing for the same recovery resources. A recovering service needs I/O headroom during the 60 to 90 seconds after restart while it rebuilds its in-memory state. A background index rebuild consuming 30 percent of available I/O is invisible to the incident response flow because it is not itself failing. It does not show up in any alert. It shows up in the postmortem as a contributing factor. What to build: a pre-action inventory query against active background and scheduled operations on the same infrastructure tier as the remediation target. Not continuous monitoring — just one read before acting. 3. Intervention intensity matched to current system state, not last month's playbook. The remediation that worked last Tuesday was calibrated to last Tuesday's system state. Applying it at the same intensity to a system currently carrying three extra loads is not a reliable practice — it is reusing a number that made sense in a context that no longer exists. Every automated remediation action should answer one question before executing: Is the system in the same absorb capacity range as when this intervention was validated? If it is not, stage the action, reduce intensity, or escalate. This is not complicated engineering. It is a check that almost nobody has built. The automation is not the problem. The automation acting without a pre-action absorb capacity check is the problem. Building that check is a day's work. Not building it is how you get cascades that look like they came from nowhere. "We were validating system health, not output integrity. That experience changed how we define resilience; it is no longer just about systems staying up but about systems staying correct under stress." — John Russo, VP Healthcare Technology Solutions, OSP Labs Which Automated Actions Need This Check and How Urgently Not every intervention carries the same absorb capacity risk. Here is a working classification based on what I have watched produce incidents. The cluster restart and downstream workflow rows are where most of the expensive postmortems come from. Intervention Absorb Risk Minimum Pre-Action Check Automate or Escalate Read-only diagnostics (health checks, metric queries, log pulls) Very Low None Fully automatable, no check needed Traffic rerouting (LB weight shifts, circuit breaker trips) Low to Medium Downstream service vs. 24hr baseline Automate with dependency check; escalate if downstream >75% baseline Single service restart (pod recycle, instance restart) Medium I/O headroom + active background ops on same tier Automate if headroom clear; escalate if background ops active Cluster-level restart (rolling or full, multiple instances) High Full dependency graph resource state + pending ops inventory Stage the restart; never run under pre-existing cross-service stress Config or schema change (feature flags, parameter updates) High to Very High All checks + rollback path validated Human review required outside the nominal absorb capacity range Agent-initiated downstream workflow (external API calls, cross-service triggers) Very High (often irreversible) Intent-execution separation + full pre-action assessment Human authorization unless the action is fully reversible Table 1: The cluster restart and downstream workflow tiers are where most production cascades originate. The check is cheap. The postmortem is not. How to Build the Absorb Capacity Layer Adding absorb capacity as a first-class observable does not mean replacing what you have. Your existing metrics, traces, and logs are doing their job. The gap is not in those signals — it is in the layer that reads them together and produces a single pre-action number before any automated intervention fires. The architecture has three parts. First, a live absorb capacity index: a rolling calculation across the dependency graph of each critical service, reading resource utilization deltas against the 24-hour baseline, shared connection pool saturation, active background operation inventory, and concurrent workload state. Output is a single number per service cluster — current absorb capacity as a percentage of the validated intervention tolerance.Second, an intervention intensity governor that reads that number before any automated remediation executes. If the index is within range, the action proceeds. If not, the governor selects a reduced-intensity variant, stages the action, or sends it to human review. It does not touch the remediation logic. It gates execution.Third, a behavioral boundary testing loop adapted from the intent-based chaos engineering methodology in Patent US12242370B2. Periodic pre-production tests read live telemetry, derive calibrated adversarial pressure from the current absorb capacity model, and use an actual-versus-intended impact feedback loop to keep the model current. Without this loop, the pre-action check is comparing today's system state against a capacity model that was valid when you built it six months ago. Figure 2: The absorb capacity layer sits between existing observability and the autonomous agent. The behavioral testing loop (Patent US12242370B2) keeps the capacity model current as the system evolves over time. The Check That Almost Nobody Has Built Most teams I have worked with have good observability. The signals are there. The alerting is tuned. The dashboards show what is failing in real time. What they have not built is the layer that reads all of it together and answers a different question: not what is broken, but whether the system is in a state that can take what you are about to do to it. Autonomous remediation agents and agentic AI systems make that question urgent in a way it was not when the decision-maker was a human engineer with pattern recognition built over years. The human hesitated. They glanced at adjacent services. They asked the on-call SRE if anything else was running before they pushed the big red button. The agent does not hesitate. It reads the signal, acts within scope, and files the result as success. RL-calibrated infrastructure failures are recoverable. A cluster goes down, the runbook fires, the service comes back. Behavioral failures in systems with real external side effects — agents that trigger downstream workflows, confirm transactions, modify records across services — are not always recoverable in the same way. The damage lands in external systems before any alert fires. Adding absorb capacity as a first-class observable is not a large infrastructure project. The signals you need are already in your stack. The composite read, the pre-action check, the governor that gates execution — none of this requires new technology. It requires deciding to ask the right question before the agent acts, and building the thin layer that makes that question answerable in real time. The observability you have is telling the truth. It is just not telling the whole truth yet.

By Sayali Patil