3 Jul 2026

11 minute read time.

The risk is not generated code. It is confidence without evidence.

AI can help a software team produce a plausible first draft faster. Approval still depends on evidence. A pull request that reads well, passes continuous integration and arrives with tests can still fail to show that the risky behaviour is safe.

Near the end of the day, a small pull request lands in the review queue. It changes retry behaviour around an external payment provider. The changed lines are easy to read. CI has passed. The summary says the change improves resilience when the provider is temporarily unavailable. A few unit tests simulate a failed call, then a successful one. The retry code runs. The code follows a familiar pattern. Nothing looks careless.

There is plenty for a reviewer to accept. The code is tidy, the summary is coherent, and the tests give the team something concrete to point at. In a busy queue, that pull request would probably keep moving.

Then the timeout case exposes the problem. The payment provider receives the request and completes the charge. The local service times out before confirmation arrives. From the service’s point of view, there is no successful response, so the retry sends another payment request. Unless the payment flow handles that case through the provider contract, idempotency keys, duplicate detection or reconciliation, the second request can become a second charge.

Retrying can be safe. The pull request has to show why. It may show that the retry code runs without showing whether the payment flow is safe when a timeout, partial failure or duplicate request occurs. The team has tests. They prove too little. The tests, summary and approvals support less than the team thinks.

That gap has a name: verification debt. It is the distance between the confidence a team appears to have and the evidence it actually holds.

AI did not invent this failure mode. Software teams have always trusted tidy changes too quickly. AI makes the miss easier because it can produce code, tests, summaries and documentation that all carry the same assumption. The review can look independently checked even when each artefact is proving the same narrow thing.

The review looked complete

AI-assisted software development can make thin review material look stronger than it is. The pull request may arrive with tidy code, plausible tests and a confident summary. Documentation may appear alongside the change. The reviewer sees several items that seem to agree.

A model asked to “add retries for failed payment calls” may generate the retry loop, tests that mock a failed call followed by a successful one, and a summary saying resilience has improved. They all line up. None of them necessarily asks whether the first call may already have produced a side effect.

The risk is that several generated artefacts can repeat the same blind spot. Tests, summaries and documentation may appear to corroborate the implementation while relying on the same untested assumption.

A green CI dashboard says the code compiled, tests ran and the usual workflow completed. It says much less about whether the change respects the requirement, contract, system invariant, security control or operational rule that carries the real risk.

The 2024 DORA report gives useful background: AI adoption can increase individual productivity while carrying trade-offs in delivery stability and throughput. The narrower point here is about review evidence. Faster drafting does not make the approval decision better supported.

The green check still has value. It reports on the checks the team chose to run.

Use failure impact, not diff size

The size of the code change is a poor guide when the change touches money, identity, data deletion, infrastructure permissions, regulated data, regulated business processes or customer-facing automation.

A one-line access-management policy change may deserve more scrutiny than a large documentation update. A small retry loop around payments may carry more risk than a broad refactor in an internal tool. A migration script can look harmless until it touches data that cannot be reconstructed.

Start the review by asking what happens if the change is wrong. Who is affected? What breaks? What evidence would make the safety claim credible beyond the code being edited? Who would notice failure, how quickly could the team recover, and who has authority to accept the remaining risk?

Many teams can handle this with a simple risk classification that reviewers can enforce. Low, medium and high are usually enough. That classification can live in the pull request template rather than in a meeting.

For sensitive changes, five questions are enough:

What could go wrong if this change is wrong?
Which requirement, contract, rule or policy is being checked?
What shows that it still holds?
What signal would detect failure after release?
Who accepts the remaining risk?

Low-risk work should stay light. A copy edit, harmless refactor, small UI adjustment or throwaway internal script should not be treated as a production-risk event. Teams lose credibility when every change is dragged through the same heavy process.

A change should leave ordinary review when it changes customer-visible behaviour, touches payments, permissions, deletion or regulated data, depends on a third-party contract, creates irreversible effects, or would require support, finance, security, site reliability engineering or legal to help clean up a failure.

Medium-risk work needs meaningful tests, assumptions written down in the pull request, an understood rollback plan and a reviewer who knows the relevant part of the system.

High-risk work needs evidence strong enough for the consequence of being wrong. That may mean a contract test, review by someone responsible for the affected system, a canary deployment, an alert, or someone clearly accepting the remaining risk.

Some changes should start at medium or high risk by default: payments, identity, permissions, data deletion, regulated data, infrastructure access, irreversible operations and external integrations where nobody is sure what happens on failure. The author can propose the level, but code owners or named domain owners need authority to raise it.

The review stays light where failure is cheap and gets stricter where the organisation has less room to be wrong.

Name what must remain true

For sensitive changes, the review usually has to answer three questions.

First, is the change internally consistent? The code compiles, linting passes, unit tests run, generated tests make sense, and the pull-request summary matches the implementation. This evidence is necessary. It mostly shows that the change is coherent.

Second, does the change satisfy a requirement outside the code? That may be a business rule, API contract, system invariant, security requirement, privacy obligation or service-level objective. Contract tests, schema checks, static analysis and architecture rules can help here. They make the change answer to something the model did not invent.

Third, what happens when the change reaches production? This may require review by someone responsible for the affected part of the system, a staged rollout, monitoring, rollback steps, a named owner and a clear decision on who accepts the remaining risk. This level is needed when failure would reach customers, money, safety, privacy, security or live operations.

Most low-risk changes do not need all of this. Sensitive changes often do.

In the retry example, internal consistency answers only part of the question. The unit tests show that the retry code runs. They do not show whether the provider treats the second request as a duplicate, whether the service always sends an idempotency key, or whether an alert would catch duplicate charges before a customer complains.

The pull request should name the standard the change is being checked against. “Does the retry logic retry failed calls?” stays inside the implementation. The stronger review question asks whether the payment flow can retry safely under timeout, partial failure and duplicate-request scenarios without creating duplicate transactions.

That sends the reviewer to the provider contract, the idempotency rule, the reconciliation process, the logs, metrics, alerts and release plan. It also makes clear what would count as evidence.

In a mature payment system, this risk is usually controlled through provider contracts, idempotency keys, reconciliation logic and monitoring. The safety of the payment flow depends on assumptions outside the code being changed.

Once the check is named, the team can choose the right evidence. It might need a contract test, an end-to-end or integration test, review by someone responsible for the affected system, a staged rollout or an alert. The check has to test what must remain true, not merely confirm that the implementation behaves as written.

Security teams have already learnt a version of this lesson. A threat model gives reviewers something concrete to test against. NIST’s Secure Software Development Framework puts secure development practices into the software development life cycle rather than leaving them as clean-up work at the end. For high-risk AI-assisted changes, the safety claim should appear before approval rather than after an incident.

A passing test can prove too little

A team can have clean code and still lack evidence that the behaviour that matters has been preserved. Tests can pass while leaving the relevant requirement untouched. A tidy pull request can be approved before anyone knows whether the change is safe under the failure mode that matters.

In real reviews, the signs are usually ordinary. Tests pass, but nobody can say which requirement they prove. Generated tests mirror the generated code. Approval depends more on readability than on the failure cases that matter. The failure often reaches reliability engineering, support or customers before the team fully understands it.

A stronger test for the payment case would simulate the timeout case. The provider completes the charge. The local service times out before confirmation. The retry sends a second request. The expected result should be explicit: the second request is treated as the same payment attempt, duplicate charging is prevented, and the system records enough information for reconciliation.

A test can be correct and still leave the real risk untouched.

Access control can fail in the same way. A generated test may confirm that an authorisation function returns a value without proving that the access model is sound. A data migration test may pass against a small sample without proving completeness or reversibility. A cache test may show that a response is returned while stale data still violates a product rule or privacy obligation.

Security research on AI-assisted code generation points in the same direction. Pearce and colleagues found insecure Copilot outputs across common software security weaknesses in “Asleep at the Keyboard?”. A later targeted replication study found lower rates of vulnerable suggestions, but the risk had not disappeared. The rates vary by tool, language, prompt and review environment, so the practical claim should stay narrow: plausible generated code needs checks strong enough for the consequence of being wrong.

Human review then has to reconstruct the missing context from material that looks finished. The reviewer is reading code while also reasoning about contracts, side effects, data ownership, failure modes, logs, alerts and recovery. Line-by-line review catches local defects. The safety of the payment flow also depends on provider behaviour, idempotency keys, reconciliation and monitoring.

Make the risk decision visible

An AI-generated first draft leaves judgement with the team. In some reviews, it makes judgement harder because the first draft often arrives cleaner than a human rough draft would.

Someone still has to name the rule that must not break, test the failure case, add the signal that would show failure, and make ownership visible.

The author owns the change, but other people may understand risks the author cannot see alone. Reviewers, code owners, domain owners, security reviewers, privacy specialists, reliability engineers and product owners each see different parts of the risk. In a good review, the pull request and release plan show who reviewed which part of the risk. A green check should not hide that.

Generated tests should be treated as useful drafts. The review should ask what the test proves. Does it check the requirement, or does it mostly confirm the implementation? Does it exercise the failure mode that matters? Does it test something the generated code did not assume?

Incident reviews should examine missing evidence as well as code defects. Which missing requirement, test, review, alert or rollback step would have exposed the failure before release? Many incidents happen despite available checks because those checks proved less than the team believed.

Confidence needs evidence

Developers will keep using AI tools to draft code, tests, summaries and documentation. That can help teams ship routine changes faster. The risk is that outputs become polished before teams have produced enough evidence to trust them.

Routine edits and high-risk changes should not use the same review process. Some changes can move quickly. Others need stronger evidence, a named check and a clear owner for the remaining risk.

In the retry example, the passing local checks provide weak evidence that the payment flow is safe. The checks have to cover the provider contract, idempotency design, monitoring and recovery plan.

Confidence in a software change has always depended on evidence. AI changes how quickly a plausible change can be produced. The obligation in the pull request remains the same: show what must remain true.

The reviewer looking at that late-afternoon pull request should not have to infer payment safety from tidy code and a green check. Before approval, the pull request should show what is being protected, what evidence supports it, what would reveal failure and who accepts the remaining risk. AI can help produce the draft. Engineering still owns the confidence.

When the Green Check Proves the Wrong Thing