-
Notifications
You must be signed in to change notification settings - Fork 1.9k
PR Agent
The PR Agent validates fixes through automated testing, explores alternatives using multiple AI models, and synthesizes everything into actionable recommendations.
"How would AI solve this problem with ZERO human influence?"
That is the fundamental question behind try-fix and this entire workflow. Each try-fix attempt is the AI solving an issue from scratch — no distractions, no anchoring to an existing PR's approach. We found that when AI is shown a human fix and asked to "come up with alternatives," it gets lazy and just validates the PR ("looks fine to me"). Generating fixes in a bubble prevents this and produces genuinely independent approaches.
A PR review and an issue fix are conceptually the same workflow. The only difference is that with a PR, a human attempt already exists — without one, it doesn't. In both cases the AI's job is identical: understand the bug, generate independent fixes, compare all candidates (including the human attempt if one exists), and select the best one.
The workflow deliberately separates two AI capabilities:
-
Smart generation (Try-Fix) — AI uses reference material, code-review findings, and codebase knowledge to generate the best fix it can. The goal is smarter context for better generation, not expert review of the output. Each model works in isolation so we can measure AI fix quality independently.
-
Expert hardening (Report) — After the best fix is selected, the expert reviewer evaluates it for hardening needs: Are the tests thorough? Is the code robust? Are edge cases covered? This is a separate, later step that identifies what the winning fix might still be missing — not a gate on try-fix itself.
If expert hardening reveals improvements that the winning fix missed, those insights feed back into improving the try-fix process: Why didn't the AI generate this better approach? What context or knowledge was missing? This creates a virtuous cycle where try-fix gets smarter over time, progressively reducing the need for human-authored fixes.
The long-term goal is that AI-generated fixes become the norm. By having AI solve problems independently and measuring its quality in isolation, we can systematically identify gaps, improve the generation process, and build toward a future where human fixes are the exception rather than the rule.
%%{init: {'theme': 'dark', 'themeVariables': { 'primaryColor': '#1e1e2e', 'primaryTextColor': '#cdd6f4', 'primaryBorderColor': '#45475a', 'lineColor': '#6c7086', 'secondaryColor': '#313244', 'tertiaryColor': '#181825'}}}%%
flowchart LR
subgraph gate ["🧪 GATE"]
direction TB
G1[Detect tests in PR]
G2[Verify tests fail without fix]
G3[Verify tests pass with fix]
G1 --> G2 --> G3
end
subgraph review ["🤖 PR REVIEW"]
direction TB
R1[Pre-Flight: gather context]
R2[Try-Fix: 4 models sequentially]
R3[Report: write recommendation]
R1 --> R2 --> R3
end
subgraph post ["📊 POST"]
direction TB
P1[Post AI summary comment]
P2[Apply agent labels]
P1 --> P2
end
gate --> review --> post
style gate fill:#1e1e2e,stroke:#89b4fa,stroke-width:2px,color:#cdd6f4
style review fill:#1e1e2e,stroke:#cba6f7,stroke-width:2px,color:#cdd6f4
style post fill:#1e1e2e,stroke:#a6e3a1,stroke-width:2px,color:#cdd6f4
Every fix is tested. The agent doesn't theorize—it implements each approach, runs tests, and reports what works.
For the best results, start in plan mode to create and review a detailed plan before execution:
-
Enter plan mode: Press
Shift+Tabor use/plan -
Request a review plan:
/plan review PR #12345 - create a detailed plan for the review - Review the plan: Copilot will create a structured plan. Review the steps and make adjustments.
-
Exit plan mode: Press
Shift+Tabto switch back to execution mode -
Execute the plan:
proceed with the plan
copilot
# Ask it to review a PR
please review PR #12345| Phrase | Description |
|---|---|
| "Review PR #XXXXX" | Review an existing PR with independent analysis |
| "Work on PR #XXXXX" | Investigate and implement a fix |
| "Fix issue #XXXXX" | Works whether or not a PR exists |
The pipeline is orchestrated by Review-PR.ps1:
.\Review-PR.ps1 -PRNumber 33687
.\Review-PR.ps1 -PRNumber 33687 -Platform iosCreates a review branch from main and squash-merges the PR onto it. If there are merge conflicts, posts a comment on the PR and exits.
Runs verify-tests-fail.ps1 directly (no Copilot agent — pure script):
- Detects tests in the PR diff via
Detect-TestsInDiff.ps1 - Verifies tests fail without the fix (baseline)
- Verifies tests pass with the fix applied
Results:
- PASSED — tests catch the bug ✅
-
SKIPPED — no tests detected in PR (recommends
@copilot write tests for this PR) - FAILED — tests didn't behave as expected ❌
The gate result is posted as a PR comment and passed as context to Step 2.
Invokes Copilot CLI with the prompt "Use a skill to review PR #XXXXX", which triggers the pr-review skill. This runs three phases:
Pre-Flight — Reads the linked issue, PR description, and comments. Classifies changed files. No code analysis — just context gathering. Output: pre-flight/content.md
Try-Fix (mandatory) — Four models explore independent fix ideas sequentially, each working in a bubble with zero influence from the PR's approach:
| Order | Model |
|---|---|
| 1 | Claude Opus 4.6 |
| 2 | Claude Opus 4.7 |
| 3 | GPT-5.3-Codex |
| 4 | GPT-5.5 |
Each model generates an independent fix — the question is always "how would you solve this from scratch?" — implements it, and runs tests. Models receive code-review hints and reference material for smarter context, and only review the PR's fix to ensure their approach is genuinely different — not to anchor on it. Between attempts, the baseline is restored via EstablishBrokenBaseline.ps1 -Restore.
After all 4 attempts, cross-pollination rounds let each model see all attempt summaries and propose any new ideas. Repeats until all say "NO NEW IDEAS" (max 3 rounds).
The best passing fix is selected by comparing simplicity, robustness, and codebase consistency. Output: try-fix/content.md
Report (Expert Hardening) — After the best fix is selected, the expert reviewer evaluates it for hardening needs: are tests thorough, is the code robust, are edge cases covered? If improvements are identified, they're documented as feedback to improve future try-fix generation. Writes the final recommendation (APPROVE or REQUEST CHANGES). Output: report/content.md
Runs post-ai-summary-comment.ps1 to post the review as a PR comment combining gate result, try-fix comparison, and recommendation.
Runs Update-AgentLabels.ps1 to parse the phase output files and apply labels:
| Label | Meaning |
|---|---|
s/agent-reviewed |
PR was reviewed (always applied) |
s/agent-approved |
Agent recommends approval |
s/agent-changes-requested |
Agent recommends changes |
s/agent-review-incomplete |
Agent couldn't complete all phases |
s/agent-gate-passed |
Tests catch the bug |
s/agent-gate-failed |
Could not verify tests catch the bug |
s/agent-fix-win |
Agent found a better fix |
s/agent-fix-pr-picked |
PR's fix was best |
s/agent-fix-implemented |
(Manual) Author adopted agent's suggestion |
All phase output is written to CustomAgentLogsTmp/PRState/{PRNumber}/PRAgent/:
gate/content.md ← Gate result
pre-flight/content.md ← PR context and file classification
try-fix/content.md ← Fix comparison table
attempt-{N}/ ← Per-model attempt details
report/content.md ← Final recommendation
| Task | Use Instead |
|---|---|
| Just run tests manually | Sandbox Agent |
| Only write tests | Write Tests Agent |
| Extract lessons from a completed PR | Learn From PR Agent |
- Agent Skills - All available skills
- Write Tests Agent - Test creation
- Learn From PR Agent - Post-PR learning
- Sandbox Agent - Manual testing