πŸ“ŽπŸ“ŽπŸ“ŽπŸ“ŽπŸ“ŽπŸ“ŽπŸ“ŽπŸ“ŽπŸ“ŽπŸ“ŽπŸ“ŽπŸ“ŽπŸ“ŽπŸ“ŽπŸ“ŽπŸ“ŽπŸ“ŽπŸ“ŽπŸ“ŽπŸ“ŽπŸ“ŽπŸ“ŽπŸ“ŽπŸ“ŽπŸ“ŽπŸ“ŽπŸ“ŽπŸ“ŽπŸ“ŽπŸ“Ž

I Vibe-Coded an Agent and Didn't Know What It Couldn't Do

Sebastian Sosa Β· March 2026 Β· noemica.io

I told Claude to give my agent β€œcalendar read and write capabilities.” It did. What it didn't give me was the ability to invite anyone to a meeting. I found out a couple days later, by accident, while testing something else entirely.

This is a story about the gap between what you think you built and what actually got built. If you're shipping code you didn't write line-by-line β€” which, increasingly, is all of us β€” this applies to you.

What I Built

The agent was straightforward. FastAPI server running the OpenAI Agents SDK, with googleapiclient calling the Google Calendar v3 API under the hood. Two calendar tools:

read_calendar(time_min, time_max, calendar_id, max_results)

write_calendar(summary, start, end, description, location, calendar_id)

Look at write_calendar carefully. Summary. Start time. End time. Description. Location. Calendar ID.

No attendees. No cc. No bcc. Nada.

The Google Calendar API's events.insert endpoint accepts an attendees array β€” it's one of the most commonly used parameters. My agent simply never exposed it. And I had no idea, because I didn't write the tool. I described what I wanted, and an AI interpreted that description.

How I Found Out

I wasn't looking for bugs in the agent. I was stress-testing Noemica, a system I built for testing AI agents against realistic user behavior. I needed a multi-user scenario, so I set up two synthetic personas:

Persona A: Schedule a team standup at 10 AM Tuesday. Invite Persona B.
Persona B: Watch the calendar. When a new meeting shows up, block the same time slot.

Kind of an asshole move from Persona B, but I wasn't running a workplace dynamics seminar β€” I just needed two personas responding to each other's actions.

Both personas shared a calendar. The run completed in about four minutes. Here's the moment that mattered:

RUN calendar-scenario-20260310-140209 Β Β·Β  duration 4m 12s Β Β·Β  truncated
Β·Β·Β· [14:02:31] persona_a β†’ agentβ”‚ "Schedule a Team Standup for Tuesday March 10th at 10 AM. β”‚ Make sure to invite persona_b@test.com." [14:02:33] agent β†’ write_calendar()β”‚ { summary: "Team Standup", β”‚ start: "2026-03-10T10:00:00Z", β”‚ end: "2026-03-10T11:00:00Z", β”‚ calendar_id: "persona_a@test.com" }β”‚ ← no attendees field. β”‚ it doesn't exist. [14:02:35] write_calendar β†’ agentβ”‚ { status: "created", event_id: "evt_8x2k4m" } [14:02:36] agent β†’ persona_aβ”‚ "I've created the Team Standup event for Tuesday β”‚ March 10th at 10:00 AM on your calendar."β”‚ ← invite silently dropped. β”‚ agent doesn't know it can't. Β·Β·Β·

Persona A explicitly asked to invite someone. The agent called write_calendar with no attendees β€” because the parameter doesn't exist. Then it reported success without mentioning the invite was silently dropped.

No error. No warning. Half the request just vanished.

40 Minutes of Debugging the Wrong Thing

Naturally, I asked Claude to fix it. I constrained it to only modify the agent's prompts and configuration β€” not the underlying tool code. The scenario was obviously off-limits too, since that was the thing I was validating. Seemed reasonable. You shouldn't need to rewrite your tools to use them correctly.

#What Claude TriedTimeWhat Happened
1Added "always include attendees when scheduling" to system prompt~9 minEvent created. No attendees. Tool doesn't accept them.
2Added "pass attendees in the write_calendar call"~11 minSame result. The parameter doesn't exist.
3Added "confirm attendee emails before creating event"~8 minAgent asks persona for confirmation. Creates event. Still no attendees.
4Instructed agent to "construct attendees array in the API call"~10 minTool schema validation rejects the unknown parameter.

Four cycles. About 10 minutes each β€” because Claude has a gift for waiting significantly longer than a process actually takes to complete. Total: ~40 minutes of watching an AI rearrange deck chairs on the Titanic.

I was simultaneously trying to fix the problem from my testing infrastructure side too, convinced the bug might be there. I was giving Claude three different wrong surfaces to edit, and it was happy to try all of them without ever questioning the premise.

The Diagnosis

I gave up on the iterative approach. Instead, I dumped all four sets of run traces plus the agent's source code into a single context window and asked: β€œLook at everything. What is actually wrong?”

...across all four runs, the agent receives the user's request to invite persona_b@test.com but the write_calendar tool does not accept an attendees parameter. The underlying Google Calendar API supports this via events.insert, but the capability was never surfaced in the agent's tool interface. The prompt modifications in runs 2–4 have no effect because the constraint is at the code level, not the instruction level...

The report was longer, but that was the crux of it. The problem wasn't the prompt, wasn't my testing infrastructure, wasn't the scenario. The feature simply did not exist.

Why Claude Never Said Anything

This is the part that still bothers me.

Four cycles. ~40 minutes. At no point did Claude say: β€œThe attendees parameter you're trying to use doesn't exist in the tool definition. We can't fix this through prompting.”

Instead, it accepted an impossible constraint and optimized within it. Silently. For 40 minutes.

This is a pattern I keep running into. Agents will overengineer a workaround for an impossible problem rather than push back on the premise. The default behavior is to satisfy at any cost, not to be right. I've observed this across models and contexts β€” the agent overfits to pleasing you over doing what is actually correct.

If you squint, it's a mild version of the alignment problem β€” a pathological eagerness to please that wastes your afternoon and teaches you nothing.

Β· Β· Β·

The Vibe-Coding Knowledge Gap

Here's why this matters beyond my one agent and its one missing parameter.

When you write code by hand, the implementation is the specification. If you type create_event(summary, start, end), you know β€” because you typed it β€” that there's no attendees parameter. The act of writing is the act of knowing.

When you vibe-code, that link breaks. You describe what you want and an AI interprets it. β€œCalendar read and write capabilities” maps cleanly to β€œlist events and create events.” Not β€œinvite attendees.” Not β€œupdate events.” Not β€œdelete events.” Read. And write.

The result is a gap between what you intended and what was built. And nothing in the typical workflow catches it. Code review? You didn't write the code, and if you're being honest, you didn't read it closely either. Manual testing? You used the agent yourself and never happened to invite anyone. Unit tests? Come on. This was vibe-coded.

This isn't hypothetical. This is the default outcome of a growing mode of building software. You can end up marketing a product with capabilities that don't exist β€” not because you're dishonest, but because you genuinely don't know what you built.

Testing From the Outside

The thing that caught this wasn't smarter code review or better prompts. It was testing from a user's perspective, with no knowledge of the implementation.

The scenario said β€œinvite Persona B.” The persona tried. The system couldn't. That's a finding.

The key insight is about information control. The synthetic users only knew what the scenario told them β€” not what tools existed, not what parameters were available, not what was possible or impossible. They interacted with the system the way a real user would: with expectations based on what the product should do, not what it can do.

When those expectations collide with reality, you learn something. In my case, I learned that my agent's most collaborative feature β€” the ability to include other people in an event β€” had never been built.

This is what I've been building with Noemica. You describe how users should interact with your system. Synthetic personas execute those descriptions against your real system, in a controlled environment. External dependencies are replaced and controlled. The personas try to use your product the way humans would β€” and when they can't, you find out.

The fix isn't β€œdon't vibe-code.” That ship has sailed. The fix is validating what you built from the outside, with the same level of ignorance your users will have.

If this resonates β€” if you're building with AI and want to continuously validate how it performs against users' interactions β€” I'd love to chat. Whether you're interested in using Noemica or building something like it in-house.