Skip to main content

Command Palette

Search for a command to run...

Why 40% of AI projects fail (and it's not the model's fault)

Updated
5 min read
Why 40% of AI projects fail (and it's not the model's fault)
D
building infra so agents can use your SaaS @Hintas

You've seen the stat. Gartner predicts over 40% of agentic AI projects will be canceled by end of 2027 due to escalating costs, unclear business value, or inadequate risk controls. Leadership blames the models. Engineers blame the data. Product managers blame scope creep. We've spent the last year building workflow infrastructure for AI agents, and the pattern we keep seeing is simpler than any of those explanations. More fixable, too.

Most AI projects don't fail because the model can't reason. They fail because nobody encoded the workflow knowledge the model needs to act.

The gap between "can call an API" and "can do the job"

Modern LLMs handle single-step tool use well. On the Berkeley Function Calling Leaderboard (BFCL), top models score around 70% overall, with near-perfect marks on simple single-turn calls. Ask Claude to check the weather or look up a customer record and it nails it.

But business tasks aren't single API calls. Processing a refund means verifying the order, checking return status, calculating the refund amount, reversing the payment, updating inventory, sending confirmation, and logging for compliance. Seven steps, strict ordering, each depending on the last.

This is where things fall apart. On OSWorld, which tests agents on real multi-step computer tasks, the best model originally scored about 12% success rate. Humans hit 72%. Recent agentic frameworks have pushed scores into the 45-61% range, but only by layering orchestration logic on top of the base model. The model alone still can't sequence its way through a real workflow.

The 40% failure rate isn't about AI capability. It's about the absence of reliable workflow execution.

Workflow knowledge is the missing layer

When a new engineer joins your team, you don't hand them API docs and say "figure it out." You pair them with someone who walks through the workflow: which service to call first, what the response looks like, what to do when the payment gateway times out on a Friday afternoon.

That knowledge exists. It lives in your Cypress test suites encoding the happy path. In Jira tickets describing the sad path. In Confluence pages that three people maintain. In the heads of engineers who built the system. It's everywhere except where an AI agent can actually use it.

The projects that fail hand an agent a pile of API endpoints and expect it to derive the workflow from schema descriptions. The projects that succeed encode workflow knowledge explicitly, either by hand (expensive, doesn't scale) or through automated extraction.

What "workflow reliability" actually means

Workflow reliability isn't just "the steps run in the right order," though that matters. It's a set of properties that production systems need, and missing any one of them will bite you.

Step 3 needs the output of step 2. Not just any output, a specific field from the response, transformed into the format step 3 expects. If the agent has to guess this mapping, it fabricates parameters at a meaningful rate. Research on agent hallucinations shows that tool-calling errors increase with the number of available tools, and compounding errors across steps can drop a 10-step workflow from 90% to 73% accuracy even when each individual step is 97% correct. That's dependency resolution, and it's table stakes.

Then there's transactional integrity. If step 5 fails after steps 1-4 succeeded, you need compensation actions. The payment was processed but shipping failed? Now you need an automated reversal, not an orphaned charge sitting in your billing system.

Deterministic execution paths matter more than most people realize. ReAct-style reasoning (think, act, observe, repeat) works for exploration but breaks down for business processes. A 20-step workflow means 20 full neural network forward passes and 20 network round trips. Each one is a chance for the agent to lose the thread. Deterministic execution maps eliminate this sequential fragility.

And then there's experiential learning. The first time a workflow hits an undocumented API quirk (rate limiting on the payment endpoint during peak hours, say), the system should learn and adapt. The fiftieth time, it should route around the problem automatically.

Why frameworks alone don't solve this

LangChain, CrewAI, AutoGen give you useful plumbing for building agent systems. They handle prompt management, tool registration, basic orchestration patterns. But they don't contain your workflow knowledge, and they can't extract it.

A framework gives you the ability to chain tool calls. It doesn't tell the agent which tools to chain, in what order, with what parameters, or what to do when step 3 returns an error code nobody documented. That's the knowledge layer, and it's separate from the orchestration layer.

Think of it like a programming language versus a program. Python gives you the ability to write anything. Your codebase is the specific thing you wrote. Frameworks give agents the ability to orchestrate. Workflow knowledge is the specific orchestration they need.

The path from 40% failure to production reliability

The projects that make it to production share a pattern: they treat workflow knowledge as a first-class engineering artifact, not something the model will figure out from context.

In practice, that means extracting workflow patterns from existing sources of truth: API specifications, test suites, internal documentation, runbooks. Validating those patterns against staging environments before deploying them. And building systems that learn from execution, so workflow maps get better every time a task succeeds or fails.

S&P Global found that 42% of companies abandoned most AI initiatives in 2025, up from 17% in 2024. MIT's research shows only 5% of AI initiatives produce measurable returns despite tens of billions in investment. These aren't model failures. They're infrastructure failures.

The 40% failure rate isn't inevitable. It's a symptom of a missing infrastructure layer. Build the workflow knowledge layer, validate it, make it available to agents in a structured format. The models are smart enough. They just need to know how the work actually gets done.


If you're interested in early access, reach out at hintas.com.

Photo by Logan Voss on Unsplash

More from this blog

H

Hintas

16 posts