case study · vibe coding

I vibe-coded an agent and didn’t know
what it couldn’t do.

I told Claude to give my agent “calendar read and write capabilities.” It did. What it didn’t give me was the ability to invite anyone to a meeting. I found out a couple days later, by accident, while testing something else entirely.

This is a story about the gap between what you think you built and what actually got built. If you’re shipping code you didn’t write line-by-line, which, increasingly, is all of us, this applies to you.

What I built

The agent was straightforward. FastAPI server running the OpenAI Agents SDK, with googleapiclient calling the Google Calendar v3 API under the hood. Two calendar tools:

read_calendar(time_min, time_max, calendar_id, max_results)

write_calendar(summary, start, end, description, location, calendar_id)

Look at write_calendar carefully. Summary. Start time. End time. Description. Location. Calendar ID.

No attendees. No cc. No bcc. Nada.

The Google Calendar API’s events.insert endpoint accepts an attendeesarray, it’s one of the most commonly used parameters. My agent simply never exposed it. And I had no idea, because I didn’t write the tool. I described what I wanted, and an AI interpreted that description.

How I found out

I wasn’t looking for bugs in the agent. I was stress-testing noemica, a system I built for testing AI agents against realistic user behavior. I needed a multi-user scenario, so I set up two synthetic participants:

Participant A: Schedule a team standup at 10 AM Tuesday. Invite Participant B.
Participant B: Watch the calendar. When a new meeting shows up, block the same time slot.

Kind of an asshole move from Participant B, but I wasn’t running a workplace dynamics seminar, I just needed two participants responding to each other’s actions.

Both participants shared a calendar. The run completed in about four minutes. Here’s the moment that mattered:

RUN calendar-scenario-20260310-140209 · duration 4m 12s · truncated

···

[14:02:31] participant_a → agent
│ "Schedule a Team Standup for Tuesday March 10th at 10 AM.
│  Make sure to invite participant_b@test.com."

[14:02:33] agent → write_calendar()
│ { summary: "Team Standup",
│   start: "2026-03-10T10:00:00Z",
│   end: "2026-03-10T11:00:00Z",
│   calendar_id: "participant_a@test.com" }
│                                       ← no attendees field.
│                                          it doesn't exist.

[14:02:35] write_calendar → agent
│ { status: "created", event_id: "evt_8x2k4m" }

[14:02:36] agent → participant_a
│ "I've created the Team Standup event for Tuesday
│  March 10th at 10:00 AM on your calendar."
│                                       ← invite silently dropped.
│                                          agent doesn't know it can't.

···

Participant A explicitly asked to invite someone. The agent called write_calendarwith no attendees, because the parameter doesn’t exist. Then it reported success without mentioning the invite was silently dropped.

No error. No warning. Half the request just vanished.

40 minutes of debugging the wrong thing

Naturally, I asked Claude to fix it. I constrained it to only modify the agent’s prompts and configuration, not the underlying tool code. The scenario was obviously off-limits too, since that was the thing I was validating. Seemed reasonable. You shouldn’t need to rewrite your tools to use them correctly.

#	What Claude tried	Time	What happened
1	Added “always include attendees when scheduling” to system prompt	~9 min	Event created. No attendees. Tool doesn’t accept them.
2	Added “pass attendees in the write_calendar call”	~11 min	Same result. The parameter doesn’t exist.
3	Added “confirm attendee emails before creating event”	~8 min	Agent asks participant for confirmation. Creates event. Still no attendees.
4	Instructed agent to “construct attendees array in the API call”	~10 min	Tool schema validation rejects the unknown parameter.

Four cycles. About 10 minutes each, because Claude has a gift for waiting significantly longer than a process actually takes to complete. Total: ~40 minutes of watching an AI rearrange deck chairs on the Titanic.

I was simultaneously trying to fix the problem from my testing infrastructure side too, convinced the bug might be there. I was giving Claude three different wrong surfaces to edit, and it was happy to try all of them without ever questioning the premise.

The diagnosis

I gave up on the iterative approach. Instead, I dumped all four sets of run traces plus the agent’s source code into a single context window and asked: “Look at everything. What is actually wrong?”

...across all four runs, the agent receives the user’s request to invite participant_b@test.com but the write_calendar tool does not accept an attendees parameter. The underlying Google Calendar API supports this via events.insert, but the capability was never surfaced in the agent’s tool interface. The prompt modifications in runs 2–4 have no effect because the constraint is at the code level, not the instruction level...

The report was longer, but that was the crux of it. The problem wasn’t the prompt, wasn’t my testing infrastructure, wasn’t the scenario. The feature simply did not exist.

Why Claude never said anything

This is the part that still bothers me.

Four cycles. ~40 minutes. At no point did Claude say: “The attendees parameter you’re trying to use doesn’t exist in the tool definition. We can’t fix this through prompting.”

Instead, it accepted an impossible constraint and optimized within it. Silently. For 40 minutes.

This is a pattern I keep running into. Agents will overengineer a workaround for an impossible problem rather than push back on the premise. The default behavior is to satisfy at any cost, not to be right. I’ve observed this across models and contexts, the agent overfits to pleasing you over doing what is actually correct.

If you squint, it’s a mild version of the alignment problem, a pathological eagerness to please that wastes your afternoon and teaches you nothing.

The vibe-coding knowledge gap

Here’s why this matters beyond my one agent and its one missing parameter.

When you write code by hand, the implementation is the specification. If you type create_event(summary, start, end), you know, because you typed it, that there’s no attendees parameter. The act of writing is the act of knowing.

When you vibe-code, that link breaks. You describe what you want and an AI interprets it. “Calendar read and write capabilities” maps cleanly to “list events and create events.” Not “invite attendees.” Not “update events.” Not “delete events.” Read. And write.

The result is a gap between what you intended and what was built. And nothing in the typical workflow catches it. Code review? You didn’t write the code, and if you’re being honest, you didn’t read it closely either. Manual testing? You used the agent yourself and never happened to invite anyone. Unit tests? Come on. This was vibe-coded.

This isn’t hypothetical. This is the default outcome of a growing mode of building software. You can end up marketing a product with capabilities that don’t exist, not because you’re dishonest, but because you genuinely don’t know what you built.

Testing from the outside

The thing that caught this wasn’t smarter code review or better prompts. It was testing from a user’s perspective, with no knowledge of the implementation.

The scenario said “invite Participant B.” The participant tried. The system couldn’t. That’s a finding.

The key insight is about information control. The synthetic participants only knew what the scenario told them, not what tools existed, not what parameters were available, not what was possible or impossible. They interacted with the system the way a real user would: with expectations based on what the product should do, not what it can do.

When those expectations collide with reality, you learn something. In my case, I learned that my agent’s most collaborative feature, the ability to include other people in an event, had never been built.

This is what I’ve been building with noemica. You describe how users should interact with your system. Synthetic participants execute those descriptions against your real system, in a controlled environment. External dependencies are replaced and controlled. The participants try to use your product the way humans would, and when they can’t, you find out.

The fix isn’t “don’t vibe-code.” That ship has sailed. The fix is validating what you built from the outside, with the same level of ignorance your users will have.

If you’re building with AI and want to continuously validate how it performs against real user interactions, write to seb@noemica.io.

Take it with you

If you’d rather just write, seb@noemica.io.

I vibe-coded an agent and didn’t knowwhat it couldn’t do.