Companies are discovering that building AI agents is easy compared to building the systems that make those agents trustworthy in production.
Enterprise AI teams spent the last two years racing to build agents. Now they are running into a different problem: very few of those agents can actually be trusted at scale.

The gap is starting to show up in public numbers. Prosus reportedly built 50,000 agents internally, but only around 5,000 are running daily. That 10-to-1 ratio has become a revealing metric for the current state of enterprise AI production. The issue is not whether companies can create agents. It is whether they can reliably determine which agents are safe to deploy, which outputs are trustworthy, and what happens when systems fail.
That distinction matters because the promised efficiency of autonomous systems assumes the systems are making correct decisions in the first place.
The Experimentation-to-Production Gap
For many engineering teams, the early wave of AI agent deployment moved quickly. Internal copilots, workflow automators, and multi-agent systems appeared across departments. Demos worked. Pilot programs looked promising.
Production environments told a different story.
Antonio Bustamante, CEO of bem, has spent years working on AI infrastructure for regulated industries, including insurance, finance, and healthcare. From his perspective, the industry’s biggest bottleneck is accountability.
He points to a widely discussed incident involving Upstream, in which an AI agent joined a Slack channel, and the human team reportedly went silent for 24 hours because nobody knew how to interact with it. Bustamante argues that the silence exposed something deeper: companies have not designed operational models for working alongside agents.
The same pattern appears inside large-scale enterprise deployments. Teams can quickly generate thousands of agents, but utilization drops once those systems encounter messy production data, unclear ownership, or uncertain outputs.
That is why many companies now find themselves with extensive AI agent deployment efforts but relatively little real enterprise AI production.
Why Multi-Agent Systems Keep Stalling
Part of the problem comes from how enterprise environments actually work.
In controlled demos, data is clean, and workflows are predictable. Real organizations rarely operate that way. Most enterprise systems contain fragmented records, inconsistent formats, missing context, and years of accumulated operational workarounds.
Bustamante compares the situation to the assembly line. Henry Ford’s manufacturing model succeeded because inputs were standardized before production was scaled. Multi-agent systems face the opposite condition. They are expected to operate on non-standardized enterprise data, which is characteristic of most enterprise environments.
Some companies have already publicly acknowledged the operational burden. In several deployments, organizations found themselves assigning human reviewers to review agent outputs continuously. In one example circulating through the industry, a multi-agent system reportedly required 20 people to validate results behind the scenes.
That changes the economics entirely. The promised gains from deploying autonomous agents disappear if humans still need to verify every decision manually.
Confidence Scoring and the Missing Accountability Layer
Bustamante argues that confidence scoring has become one of the most overlooked components in AI governance and the production of AI infrastructure. Without systems that can measure uncertainty, operators have no reliable way to determine which agents are production-ready and which require intervention.
In practice, confidence scoring means more than assigning a percentage to an answer. It requires systems that can explain uncertainty, trace decisions back to source data, and create human-in-the-loop checkpoints before errors compound across workflows.
That layer of AI accountability becomes especially important in industries where mistakes carry financial or legal consequences. A failed insurance claim review, healthcare extraction error, or financial processing mistake can become a liability event.
Bustamante describes bem’s broader thesis as “The agent orchestration platform for things that can’t fail.” The phrase reflects a growing realization across the industry: AI agent reliability depends less on how many agents you deploy and more on whether you can trace, audit, and correct decisions when something goes wrong.
What Production-Ready Infrastructure Looks Like
The next phase of enterprise AI may have less to do with building more agents and more to do with building systems around them.
Companies focused on long-term AI agent utilization are increasingly seeking infrastructure that remains flexible during execution, is rigid in outcomes, and is traceable under failure conditions. That includes confidence scoring, audit trails, intervention points, data standardization, and governance systems designed for production, not demos.
The companies that close the gap between multi-agent systems experimentation and real-world deployment may not be the ones with the most agents. They may be the ones that finally build the accountability infrastructure enterprises skipped the first time around.








