From toolbox to instructions: why endpoint-level MCP isn't enough

The MCP ecosystem is booming. Every week, new MCP servers pop up wrapping another SaaS API: Stripe, Salesforce, GitHub, Jira, Notion. Tools like Speakeasy and Stainless can auto-generate an MCP server from any OpenAPI spec in minutes. The toolbox is filling up fast.

But a toolbox isn't instructions. Giving an agent access to 200 Stripe endpoints via MCP is like handing a new hire the codebase keys and saying "deploy the feature." The agent has access. It doesn't have understanding. We keep running into the same version of this problem, and it explains why so many AI projects stall out despite having perfectly functional models.

The one-tool-per-endpoint pattern

The current standard for MCP server generation is simple: parse the OpenAPI spec, create one MCP tool per endpoint, map parameters in, map responses out. Done.

And it's useful! It standardizes API access for agents, kills custom integration code, and lets any MCP-compatible client talk to any API through a universal protocol. For single-step tasks ("look up customer #12345," "get the current balance," "list open tickets") it works great.

Here's where it falls apart. SaaS APIs exist to support business processes, and business processes are multi-step. Processing a refund isn't one endpoint. It's seven endpoints called in a specific order with parameter dependencies between them. Onboarding a customer isn't one API call. It's provisioning accounts, configuring permissions, initializing billing, running compliance checks, sending welcome emails. Each step depends on the output of the last.

When an agent connects to an endpoint-level MCP server, it sees 200 independent tools. It has no information about which tools belong to which workflow, what order they run in, what parameters flow between them, or what to do when step four fails. The agent has to figure all of that out through trial and error. If you've tried to vibe-code a multi-step agent workflow, you know exactly how this goes.

What the benchmarks actually show

This isn't hypothetical. MCPMark tested 127 realistic MCP tasks across Notion, GitHub, Filesystem, PostgreSQL, and Playwright. The best model, GPT-5-medium, hit 52.6%. Claude Sonnet 4 and o3 fell below 30%. On average, models needed 16.2 turns and 17.4 tool calls per task. These aren't toy examples. They test real CRUD operations that mirror what agents actually do in production.

OSWorld-MCP found that giving agents MCP tools improved models like Gemini 2.5 Pro by up to 14 percentage points. But even the strongest model only invoked available tools 36.3% of the time. The tools were right there. The agents just didn't use them because they didn't know when or how they fit together.

Adding more tools doesn't fix this. It makes it worse. RAG-MCP research showed tool selection accuracy dropping from 43% to under 14% as the number of available tools grows. Prompt bloat overwhelms the model's ability to pick the right tool. Loading metadata for hundreds of endpoints burns tokens before the agent even reads the user's request.

The gap is workflow knowledge

What's missing between a toolbox and instructions is workflow knowledge. The stuff that tells you: of 200 Stripe endpoints, which 7 process a refund? In what order? What data flows between them, step 3 needs the charge_id from step 1, not the customer_id, and when the agent gets this wrong it fabricates parameters instead.

There are preconditions too. The order must be within the return window AND the payment method must support reversals AND the user has the right permissions. All at once. And then there's failure handling: if the payment reversal succeeds but the inventory update fails, you reverse the payment reversal. These compensation actions don't emerge from endpoint descriptions. They require understanding the business process end to end.

This knowledge already exists in your organization. Your Cypress and Playwright test suites encode the happy path. Your runbooks describe what to do when things break. Your Jira workflows capture the process. The senior engineers who built the system carry the rest in their heads.

It's not in your OpenAPI spec. The spec describes individual endpoints. It says nothing about how those endpoints combine into workflows. As we explored when writing about agent memory, this structural knowledge (entities, relationships, dependency ordering) is exactly what a knowledge graph captures and a flat tool list can't.

What "instructions" look like

Moving from toolbox to instructions means changing the agent's interface from "here are 200 tools" to "here are the workflows you can run."

Instead of an MCP server that exposes POST /api/v2/refunds, GET /api/v2/orders/{id}, PUT /api/v2/inventory/{sku}, and 197 other endpoints, the agent connects to a server with two tools:

search: the agent describes what it wants to accomplish in plain language. The system queries a knowledge graph of validated workflows and returns the match, what it does, what inputs it needs, what preconditions apply, what the expected outcome is.

execute: the agent provides the workflow identifier and input parameters. The system handles multi-step orchestration internally (calling the right APIs, in the right order, managing errors and compensations) and returns the result.

What used to be a fragile 7-step improvisation becomes a single tool call. The agent focuses on understanding what the user wants. Execution follows a validated path that doesn't depend on the model correctly sequencing API calls on the fly.

This matters for the same reason full autonomy is a trap. You want the agent making decisions where it's strong (understanding intent, handling ambiguity) and handing off execution to infrastructure where reliability actually matters.

The context window argument

There's a practical angle beyond reliability. An endpoint-level MCP server for a platform with 200 endpoints loads 200 tool schemas into the agent's context. Each schema has the tool name, description, parameter definitions, return types. At scale this eats millions of tokens, leaving minimal room for the actual task.

A workflow-level MCP server loads two tool schemas: search and execute. About 1,000 tokens of context overhead regardless of how many workflows or underlying endpoints the system supports. Workflow detail gets retrieved on demand through search, not loaded upfront.

MCP's own deferred loading mechanism works on the same principle: only load tool definitions when needed, not at init. But deferred loading is a protocol-level optimization. The toolbox-to-instructions shift is an architectural change that kills the problem at its root. Stainless ran into this firsthand: they had to build client-specific schema adaptations because Cursor caps tools at 40 and Claude Code can't handle arrays in certain positions. Those are symptoms of an architectural mismatch, not client bugs.

What needs to happen

The MCP ecosystem needs to move from endpoint wrapping to workflow intelligence. That requires capabilities current MCP server generators don't have.

First, workflow extraction. Automatically pulling multi-step patterns from existing sources: OpenAPI specs for the API surface, end-to-end test suites for happy paths, internal docs for business rules, operational runbooks for error handling. This is the same vertical knowledge that makes industry-specific AI outperform generic tools. It's domain-specific, hard-earned, and you can't prompt-engineer it into existence.

Second, workflow validation. Running extracted workflows against staging environments. A workflow that looks correct on paper but fails in practice is worse than having nothing, because it creates false confidence. Production-grade operations require saga-pattern transactions, observability, and tested compensation chains before anything touches real data.

Third, workflow evolution. APIs change. New endpoints appear, parameters get added, auth scopes shift. The workflow knowledge layer has to keep pace with the API surface without manual updates every time something changes. Every execution, successful or failed, should teach the system something new.

The toolbox era of MCP was necessary. It solved standardization and proved a universal agent-to-tool protocol is viable. But as MCP and A2A converge into a unified interoperability layer, the workflow knowledge gap only gets wider. Multi-agent coordination multiplies the number of tools and the complexity of sequencing them correctly. The next phase is about what sits on top of the toolbox: the workflow knowledge that turns tool access into task completion.

If you're thinking about building AI into your product as a foundation, this is the infrastructure that actually makes that work. The knowledge of how work gets done, encoded so agents can use it, validated before it touches production, and getting better every time it runs.

If you're interested in early access, reach out at hintas.com.

Photo by ThisisEngineering on Unsplash