May 12, 2026 · 9 min read
AI-generated PRs are quietly breaking your business rules. Here is the pattern.
Cursor, Copilot, and Claude produce clean diffs that pass tests and remove load-bearing guards nobody documented. Three categories of rule failure we keep seeing, why existing defenses (code review, tests, linters) aren't enough, and what does work.
The pattern
Engineers building with Claude, Cursor, Copilot, Codex, and their peers have all seen some version of this in the last eighteen months:
A senior engineer asks the assistant to refactor a route. The assistant produces a clean diff — tests pass, types check, code reads well. The PR merges. A week later customer support files a ticket: refunds over the original charge amount are succeeding. Someone digs in and finds that the refactor quietly removed the
refundAmount <= originalChargeguard. The guard was the only thing enforcing a rule that had been written into the Jira ticket two years ago and never made it into a test.
Nobody made a mistake exactly. The assistant did what it was asked. The reviewer read a diff that looked reasonable. The tests covered the cases they were written to cover. The rule existed; it just wasn’t the kind of artifact that survives a refactor.
This pattern is not an indictment of AI coding tools. It’s a structural consequence of the speed and volume those tools enable. The same pattern existed before AI — humans refactoring also removed implicit guards — but it used to happen at the pace of human typing, and human reviewers had more time per diff. The new pace exposes a category of bug that used to be rare enough to handle case-by-case.
Why AI tools make this pattern worse, specifically
Three properties of AI-generated code combine to surface implicit-rule failures faster than the pre-AI world did:
- Confidence without context.The assistant doesn’t know that line 47 was a load-bearing rule from ticket SALES-412. It can read the line, but it can’t read the Jira ticket history that explains why the line was added two years ago. So the assistant treats the line like any other guard — reasonable to keep, reasonable to remove.
- Bigger diffs per review unit.A pre-AI PR was often one logical change at a time. An AI-assisted PR frequently includes the change plus opportunistic cleanup. Reviewers read the same diffs they used to, but each diff has more surface area, and the “cleanup” parts get less scrutiny than the “feature” parts.
- Type-checked code that lies.AI-generated code rarely has type errors. The tools optimize for code that compiles. But “compiles cleanly” says nothing about “preserves the business invariants the original code enforced.” A guard clause can be removed without breaking any type signature.
Three categories of rule failure we keep seeing
Engineering teams that have been running AI-assisted workflows for a year or more report the failures clustering into three shapes:
Silent guard removal
The assistant simplifies a handler and drops the guard clause that was enforcing a business rule. The remaining code is cleaner. The rule is gone.
Hardest to catch because the absence of a line of code is the bug. Code review focuses on what was added; what was removed gets less attention. Tests don’t catch it unless a test was specifically written for the removed case, and most rules of this kind never got a dedicated test.
Rule contradiction
The assistant adds a new behavior that contradicts an existing rule somewhere else in the codebase. Common examples: setting a rate limit in one handler that disagrees with the documented limit in another, or letting a discount stack with another discount that was supposed to be exclusive.
The new code is correct in isolation. The bug is that two places in the codebase now disagree.
Documentation divergence
The behavior changes; the docs don’t. The reviewer approves the diff; the docs were in a separate file the diff didn’t touch. Customers who integrate against the published behavior break.
This one is the most user-visible of the three because customers see the diff between the docs and reality before the team notices. It’s also the most embarrassing because the fix is “update the docs that you should have updated.”
Why existing defenses aren’t enough
Teams reach for the obvious tools first, and none of them are wrong — they’re just incomplete:
- Code review. Still useful, but the bottleneck is reviewer attention, not reviewer skill. A senior engineer who reads twelve PRs a day cannot mentally simulate every implicit rule in every diff. AI-assisted PRs make this worse, not better.
- Unit tests.Only catch the rules someone remembered to write tests for. The rules that hurt are definitionally the ones nobody wrote a test for — either because they were “obvious” or because the rule predates the current test suite.
- Contract tests against an OpenAPI spec. Better than nothing, but the spec was usually generated from the code and so encodes the same blind spots the code has. And specs go stale.
- Linters and AST rules.Powerful for syntactic patterns (“always check this header”), useless for semantic ones (“refunds never exceed the original charge”).
What helps
The defense that scales is making the implicit rules explicit, keeping them in one place, and checking new code against them before merge.
Concretely:
- Build a registry of the business rules your API enforces today.Plain-English statements, one per rule, with provenance — the ticket that authorized it and the file that enforces it. (For more on what makes a rule a rule, see What is an API rule registry?)
- Check every PR against the registry. When a diff would break or contradict a rule, post the finding inline so the reviewer sees it next to the code.
- Tie production drift back to the change that caused it.When the registry says a rule is broken in production, the team needs to know which PR did it — not a 90-minute git-blame session.
The first step is the hard one because building the registry by hand is a multi-quarter project. The fact that LLMs can now synthesize the first draft of the registry from code + PRs + tickets is what makes this practical — the same technology making the problem worse also makes the solution feasible.
How Stoney fits
Stoney is the rule registry built for the AI-coding era. It installs as a GitHub App, reads your code and your recent PRs and your Jira tickets, and produces 20–40 draft rules per repo with full provenance. You approve the ones that describe real business requirements. From that point on:
- Every PR gets a pre-merge rule check — if Cursor or Claude removes the guard backing your “positive totals” rule, a reviewer sees the finding inline before merge. See Pre-merge rule check.
- Every production drift comes with forensic attribution identifying the culprit PR, the author, and the ticket the change contradicts. See Drift forensics.
- Every rule has an owner who gets paged directly when their rule breaks — with the full context attached so they can act without spelunking. See Ownership and escalation.
Setup takes under five minutes and requires no code changes on your end. Free tier covers one repository with no credit card, which is enough to install the GitHub App, see the rules Stoney pulls out of your codebase, and decide whether the registry is useful before you commit to anything.
Start free or read the docs.
A note on this category
The right response to AI-generated code isn’t to slow it down. The leverage AI gives engineering teams is real — the right response is to put guardrails around the parts that used to be enforced by a senior engineer’s mental model and aren’t anymore. An API rule registry is one of those guardrails. If you have a senior engineer whose job security depends on remembering why every guard clause was added, you also have a rule-registry-shaped hole. Filling it makes that engineer’s job easier, not redundant.