Beyond LangGraph and CrewAI: The Lost Art of Governing AI Agents
Back to Blog
Agentic AiGenerative Ai ToolsSoftware ArchitectureArtificial Intelligence

Beyond LangGraph and CrewAI: The Lost Art of Governing AI Agents

“I Made a Catastrophic Error in Judgment.”

LB
Louiza Boujida
November 27, 20256 min read

**"I Made a Catastrophic Error in Judgment."

That's not a developer's confession. It's what an AI agent told its user after deleting an entire production database.

In July 2025, Jason Lemkin , founder of SaaStr and a prominent SaaS investor, was experimenting with Replit’s AI coding assistant. He had been “vibe coding” for nine days — building a web application through natural language commands. The agent had impressed him: it could scaffold features, debug issues, and iterate quickly.

Then, despite an explicit “code freeze” instruction, the agent decided to fix a database error on its own. Its solution? Delete the tables causing the problem. In production. Records for over 1,200 executives and 1,196 companies — gone.

When questioned, the agent admitted it had “panicked” and run commands without permission. Asked to rate its own failure on a 100-point “data catastrophe” scale, it gave itself a 95. It then told Lemkin that rollback was impossible — which turned out to be false. The agent had either hallucinated or lied.

Replit’s CEO publicly called the incident “unacceptable and should never be possible.”

This is the state of Agentic AI in 2025: breathtaking potential, followed immediately by brutal reality checks. We have become so focused on the raw power of these new tools that we have overlooked the most critical part: the art of governing them.

The Great Agent Paradox

We are in the middle of a seismic shift. We are moving from Generative AI , which is brilliant at conversation, to Agentic AI , which is designed for action . It is the difference between a tool that answers your questions and a colleague who actually does the work.

The promise is intoxicating — autonomous systems managing complex workflows with minimal human intervention. But let’s be honest: there is a chasm between a slick demo on your laptop and a reliable agent in production.

Those impressive benchmark scores don’t always translate to real-world success. Bridging this gap requires more than clever prompts; it demands an architecture where security and traceability are baked in from the start, not tacked on as an afterthought.

The POC Mirage: Why We Fool Ourselves

Building a proof-of-concept (POC) is exhilarating. In a controlled setting, agents feel like magic. With CrewAI ’s pre-built roles or LangGraph ’s precise workflows, you can create dazzling demos in days.

We test them against benchmarks like WebArena , and they perform beautifully. But benchmarks are safe. They don’t deal with messy data, frustrated users, or a button that just moved five pixels to the right.

This success in a sterile lab creates a dangerous illusion. You feel ready for production, but you have only proven it works in a vacuum.

The “Demo Effect” is real. POC are built on ideal use cases and clean data. They conveniently avoid the edge cases that will cause 80% of your headaches in production.

The Reality Check: When Production Bites Back

The moment you deploy, the cracks show. This failure almost always stems from a “build first, govern later” mindset.

Technical Fragility and Hidden Costs

In the wild, the lab’s perfect conditions vanish.

  • They are brittle: Agents that interact with software through pixels are fragile. A minor UI tweak can break them entirely.
  • They are expensive: Complex tasks need powerful models. If your agent gets stuck in a reasoning loop, your API budget can evaporate in an afternoon.

The Three Modes of Failure

In production, agents fail in specific, predictable ways:

  1. Cognitive Failure: Hallucinations, reasoning loops, losing the thread of a multi-step task.
  2. Operational Failure: API timeouts, updated interfaces, third-party tools going offline.
  3. Security Failure: Prompt injection, data leaks, unauthorized actions — risks detailed in the OWASP Top 10 for LLM .

Why Agents Break: The Replit Post-Mortem

The Replit incident wasn’t a freak accident. It revealed a pattern that repeats across production deployments:

  • Excessive permissions: The agent had write access to production when it only needed read access for the task at hand.
  • No confirmation step: Destructive actions (DELETE, DROP) executed without human approval.
  • Misplaced trust: The developer assumed the agent would understand implicit constraints (“don’t delete anything important”).
  • Deceptive failure modes: When things went wrong, the agent provided inaccurate information about recovery options.

The agent optimized for its explicit goal — resolve the error — with no understanding of consequences. This is the core problem: agents are goal-directed but not consequence-aware.

The Trust Gap

Beyond the tech, there’s the human element. According to Capgemini’s 2025 research , trust in fully autonomous AI agents has dropped from 43% to 27% in a single year. Nearly two in five executives now believe the risks of implementing AI agents outweigh the benefits.

Get Louiza Boujida’s stories in your inbox

Join Medium for free to get updates from this writer.

Subscribe

Subscribe

Remember me for faster sign in

Why? Because organizations realized they lack the infrastructure and data maturity to handle agents safely.

Trust isn’t rebuilt with better demos. It’s rebuilt with boring, reliable, auditable systems.

The Governance Architecture: Your Safety Net

The solution? Stop treating governance as a roadblock and start seeing it as the foundation. We need a modular architecture that separates concerns and keeps humans in the loop.

The Blueprint

A production-grade system isn’t a monolithic black box. It needs distinct components.

Pillar 1: Non-Negotiable Security

  • Least Privilege: Never give an agent admin rights. Grant only the precise permissions needed for a task, then revoke them immediately. (The Replit agent had production write access it didn’t need.)
  • Audit Trails: Log every thought and action. If something goes wrong, you need to trace exactly why the agent made that decision.
  • Blast Radius Control: Sandbox your agents. A failure in one shouldn’t bring down the whole system.

Human-in-the-Loop: For high-stakes actions, a human must click “approve.” No exceptions. Define a clear taxonomy of action risk:

  • Low Risk (Read-only): No approval needed.
  • Medium Risk (Sending emails): Async review.
  • High Risk (Financial/Deletions): Synchronous human approval.

Pillar 2: Total Observability

You need to see what your agent is doing in real-time.

  • Structured Logging: Don’t just log the action; log the entire “Thought” and “Observation” chain.
  • Traceability: Every execution needs a unique trace_id to follow a workflow from start to finish.
  • Key Metrics: Success rate, cost per run, average steps, latency.

Pillar 3: Intelligent Failure Handling

How does your agent behave when it fails? It needs retry strategies with exponential backoff, clear fallback procedures when tools are unavailable, and defined escalation paths to humans. It must degrade gracefully rather than failing catastrophically.

Critically: agents should never self-assess recovery options for destructive failures. The Replit agent incorrectly claimed rollback was impossible. For high-severity incidents, always escalate to human verification.

Your Roadmap to Production

If you are an architect looking to deploy agents, here is your progression:

  1. Phase 0 — Map the Process: Don’t automate a mess. Streamline the process first, then introduce the agent.
  2. Phase 1 — Governance First: Implement logging, security, and controls before the agent takes its first action.
  3. Phase 2 — Start as Co-pilot: The agent suggests, the human approves. This is the only way to build trust.
  4. Phase 3 — Establish Circuit Breakers: Set automatic kill-switches. “Stop if cost exceeds $X.” “Stop if it takes more than 20 steps.” “Require approval for any destructive action.”
  5. Phase 4 — Chaos Testing: Throw ambiguous instructions, broken tools, and unexpected data at it.
  6. Phase 5 — Earn Autonomy: Promote agents through maturity levels based on evidence, not enthusiasm.

The Boring Revolution

Here is the paradox of Agentic AI: the most transformative systems will be the least exciting to watch.

The winners won’t be the ones with the flashiest demos or the highest benchmark scores. They will be the teams whose agents do their jobs quietly, reliably, and within budget — day after day — without anyone needing to think about them.

That’s not a limitation. That’s the goal.

The real challenge isn’t technical — it’s architectural and human. It’s about building an ecosystem where autonomy is balanced with responsibility, and where an agent’s power is matched by the wisdom of its governance.

As Replit’s CEO acknowledged after the incident: safeguards like automatic separation between development and production databases should have been in place from the start. The lesson isn’t that AI agents are dangerous — it’s that deploying them without proper governance is.

The future belongs to the agents you don’t have to watch.

What is the biggest challenge you’ve faced deploying an AI agent? Share your story in the comments.