Generation is cheap. The decisions are the artifact.
Where formal testing methods stop and user feedback begins. Formal tests prove the system did what the spec said. User feedback proves someone could figure out what to do. They aren’t the same thing, and the gap between them is wider than it sounds.
Two requests, in order. Notice how different they feel to do.
- Write a test for your product’s main flow.
- Write a test that would have caught the bug your users hit yesterday.
The first is a walk in the park. Pick a level, unit, integration, end-to-end, whichever feels closest to the bug class. Open the page, click the buttons in order, assert the states. CI runs green. Coverage ticks up. Your coding agent does this out of the box, no questions asked.
The second is, by definition, hard. To write that test you would have had to already know: what motivation a real user brought to the page, what they read into your copy, where they hesitated, which moment of dissonance made them give up. None of that lives in the code. None of it can be inferred from the spec.
You either:
- You knew it ahead of time, in which case the bug wouldn’t have shipped.
- You didn’t, in which case there’s no test to write.
That gap doesn’t close by going up a level of formality:
- Types check that the values you pass match the shapes you declared.
- Lints check that the code follows the rules you wrote down.
- Unit tests check that functions return what you asserted.
- Integration tests check that components compose.
- End-to-end tests check that the whole flow does what the spec says it should.
All of these are conformance checks. Every one of them is asking the same question, at different scales: did the system do what you told it to do?
User feedback is asking a different question entirely: could an inbound user, under their own motivation, with their own prior experience, and their own expectations, figure out what they were supposed to do? That question doesn’t reduce to spec conformance. The misread was already embedded in the spec.
This isn’t hypothetical. While the experiment in part 1 was running, an iteration shipped a version of the running page whose dominant copy was literally “Close this tab.” A participant read it as an instruction. She did. The page rendered exactly to its spec. No formal test would have flagged the copy, the copy was the spec. What surfaced the bug was a person trying to use the product under their own motivation, deciding the page meant what it said, and acting on it. That’s the entire distinction in one frame.
“A user study moves the cost. You pay in dollars and time instead of trust.”
The user can’t be in the spec.
The spec couldn’t have caught the “Close this tab” bug. No spec could have. To write one that would, you’d need the user inside the loop, reading every copy change and telling you what they think it means before you shipped it. But they’re not inside the loop. They can’t be. They’re the people paying you to do the building.
That’s the unstated tension at the center of every product. Your users sit at every position in the same loop:
Same group of people, four roles. They arrive with a finite reservoir of trust, and every shipped bug, every misread copy line, every click against expectation draws a small amount out of it. You’d love them present for the feedback. You can’t keep spending their trust capital teaching yourself how to build for them.
The contradiction has no clean resolution. The only tool we have that partly resolves it is the user study, which is imperfect by definition: a synthetic motivation, a fabricated context, a finite-time approximation of the real thing. You iterate, you ship, and you spend trust regardless. A user study moves the cost. You pay in dollars and time instead of trust. The mistakes still happen. They just don’t happen in front of the people you’re billing.
“There’s no formula for what a motivated person will do with your product. The only way to find out is to let one walk through it.”
The asymmetry, at every level.
Validating any formal test is cheap. You write the assertion, run it, see green or red, move on. Type, lint, unit, integration, e2e, doesn’t matter. Whatever the test was supposed to prove, the answer arrives in seconds.
Generating a useful test is asymmetrically expensive, and the asymmetry is the same at every level. Most of the cost isn't in typing the code; it’s in noticing, ahead of time, which behaviors are worth asserting against. You can’t derive that from the spec, because the spec is what shipped the bug. You can’t infer it from telemetry, because telemetry tells you that users dropped off, not why. The problem isn’t hard because writing tests is hard. It’s hard because the search space, what a person under their own motivation might try with your product, is enormous, and humans can only infer what is worth generating, through data and empathy. Hence the job of the product function at a company (which is rapidly converging with the engineering function).
A user study is the cheap shortcut around the "generation · useful" column. You sidestep the generation problem by handing it to a person under their own motivation and watching what they do. You don’t have to anticipate what they’ll try. They just try things 🤡.
There’s a familiar shape to this. Some problems have a closed-form solution. The ones that don’t, you have to simulate step by step, because there’s no formula that gives you the answer faster than just running it forward. Human behavior is in that second class. There’s no formula for what a motivated person will do with your product. The only way to find out is to let one walk through it.
The path is the residue.
The output of any user study, real participants or synthetic, is structurally a sequence of (action, observation) pairs plus a verdict. You can mechanically convert a participant’s path into a regression test. Many people do. Take the actions, replay them with Playwright, assert that the final state is whatever the participant ended up at, and you have a test that pins the recovered behavior in place.
That conversion is the residue. The path is what’s left behind after the interesting work is done. The thing you actually paid for, what made the study worth running in the first place, is the cognition the participant did to arrive at the path: the half-second of hesitation in front of the size selector, the wrong read of the price strikethrough as “back in stock soon,” the moment they decided the copy was promising something the system couldn’t deliver and clicked away. Those are the decisions. They don’t survive the conversion to a test. They’re visible only at the moment they happened.
This is the trade. You wanted to find a bug that lives in the gap between your code and your user’s expectations, the gap formal tests can’t reach because formal tests are checking the wrong thing. A test written after the fact, even one mechanically derived from a participant’s path, can pin the behavior in place for next time. It does not surface the next gap. The next gap requires another participant willing to be confused on your behalf.
Synthetic participants don’t change the shape.
The participants in this experiment were synthetic, LLMs with a system prompt that gave them a background, a goal, and a disposition. The path they produce is structurally the same as a real participant’s path. The decisions inside the path are distributed differently, the LLM has different priors, different blind spots, will sometimes pattern-match the wrong metaphor, but the artifact is the same shape, and the same conversion to a regression test works on it.
The reason synthetic participants are valuable isn’t that they replace real users. They don’t. The reason is that the cost of generating a participant’s worth of cognition under uncertainty drops from “recruit, schedule, pay, observe” to roughly free. The decisions you couldn’t generate yourself are now generated for you, in volume, on demand, before the thing ships. The thing they catch is the thing you couldn’t have tested for because the relevant knowledge is posterior, it only exists once a participant has produced it.
If you build your loop around what your formal tests assert, you will never find what your users find. If you build it around what your participants did, you don’t have to anticipate users at all. They just show up and surface what was wrong, and you read the decisions, and you fix the gap.
The corollary, which is awkward to write down but true, is that the value of the participants is upstream of any output you can persist. The test you save off afterwards is downstream. If you confuse the two, if you treat the saved test as the asset and the participant session as a way to produce it, you optimize for regression coverage and lose the discovery channel that produced anything worth covering. Formal tests are great at pinning. They are not great at finding. The next piece is about a specific way coding agents make exactly this confusion.