Hintas

Enterprise AI ROI: stop measuring prompts, start measuring workflows

Dante Kakhadze — Thu, 16 Apr 2026 14:05:18 GMT

Every enterprise AI meeting I've sat in this year eventually lands on the same question: "where's the ROI?" Fair enough. Companies have been buying GPUs and API credits for two years now. The patience is running out.

A MIT NANDA report based on 150 leader interviews and 300 public AI deployments found that 95% of generative AI pilots fail to deliver ROI. Not because the models are bad. Because the tools "don't learn from or adapt to workflows." Meanwhile, AI Governance Today reports that 61% of AI projects are never formally measured after deployment. Sixty-one percent. Companies are spending money, shipping pilots, and then just... not checking.

Most organizations measure AI ROI at the prompt level: tokens consumed, inference latency, model accuracy on isolated benchmarks. These metrics tell you how well your AI components perform. They tell you nothing about whether AI is actually improving business outcomes.

The few teams actually seeing ROI measure something different: workflow completion. That gap explains why 73% of enterprise AI projects fail to deliver projected ROI according to McKinsey, with most respondents saying less than 5% of their EBIT is attributable to AI.

The measurement problem

Say a customer service team deploys an AI agent for refund requests. The component-level metrics look great: the LLM responds in under 2 seconds, intent classification accuracy exceeds 95%, customer satisfaction scores on individual responses are high.

Zoom out to the workflow level and the picture changes. Only 40% of refund workflows complete end-to-end without human escalation. Average resolution time actually increased because the agent gets stuck midway through multi-step processes and the customer has to start over with a person. Cost per resolution went up, not down, because failed agent attempts burned API calls and compute without completing anything.

The component performed well. The workflow performed poorly. And the business outcome depends on the workflow.

We covered this exact pattern in Why 40% of AI projects fail. The model isn't the bottleneck. The missing workflow knowledge is.

What you should be measuring instead

The metric that matters most is completion rate: what percentage of workflows finish end-to-end without human intervention? A workflow that completes 95% of the time delivers value. One that completes 40% of the time generates support tickets.

The benchmarks make the multi-step gap painfully clear. On OSWorld, which tests agents on real multi-step computer tasks, humans score 72% while the best AI agents top out around 45%. On WebArena, agents hit 61.7% on standard tasks but drop to 37.8% on the multi-step WebChoreArena variant. Each step introduces a failure probability that compounds across the chain.

The second metric is cost per completed workflow. Not cost per API call. Not cost per token. The full cost of getting from trigger to business outcome, including human escalation when the agent bails.

This one reveals something uncomfortable: partially automated workflows can cost more than fully manual ones. A human handling a refund end-to-end costs X. An agent that handles the first three steps, fails, and escalates to a human who starts over costs X plus the agent's compute and API costs. Production AI agents run $3,200-$13,000/month covering LLM API, infrastructure, monitoring, and security. If your completion rate is low, that spend is generating escalation tickets, not savings.

Then there's time to value: how long from trigger to business outcome? For a refund, that's time from request to confirmed refund. For onboarding, time from signup to active usage. AI should shrink this. If it doesn't because the agent burns cycles on retries and sequential reasoning through steps it should already know, you're paying more to go slower.

Why this keeps happening

I keep seeing the same playbook. Organization deploys an LLM, wraps their APIs in tool definitions, connects the agent, ships it. Component metrics look fine. Workflow metrics are terrible.

Only 21% of organizations using generative AI have actually redesigned their workflows. The rest bolt AI onto existing processes and wonder why it doesn't work. A BCG study of 1,250 companies found that only 5% achieve substantial value from AI at scale, while 60% report minimal gains despite investment.

The root cause is the workflow knowledge gap. The agent can call any individual API correctly. It can't reliably sequence multiple APIs into a complete business workflow because the knowledge of how those APIs connect (dependencies, parameter mappings, preconditions, error handling paths) isn't encoded anywhere the agent can access. 77% of AI project failures are organizational, not technical. Only 23% are model or data issues.

We keep running into this. As we wrote in Agentic ops in production, agents that modify real data need Saga-pattern transactions, workflow-level observability, and context-decoupled execution. Without that plumbing, you've got a very expensive autocomplete.

The ROI math changes with workflow reliability

Let's do the math on a concrete example.

A customer service team handling 10,000 refund requests per month at $15 per manual resolution spends $150,000/month. An AI agent with 95% workflow completion rate handles 9,500 requests at $2 per automated resolution ($19,000) and 500 escalations at $20 each ($10,000). Total: $29,000/month. Savings: $121,000/month.

Same agent, 40% completion rate. 4,000 requests at $2 ($8,000) and 6,000 escalations at $20 ($120,000). Total: $128,000/month. Savings: $22,000/month. And that's before accounting for the customer satisfaction hit from 6,000 failed automated interactions.

The difference between 95% and 40% completion is $99,000/month in this example. Completion rate is the lever. And it's determined by the reliability of multi-step workflow execution, which is an infrastructure problem, not a model problem.

Companies measuring at the workflow level are seeing this play out. Shell reduced unplanned downtime by 20%, saving roughly $2 billion annually. HSBC saw 2-4x better fraud detection with 60% fewer false alerts. Dole Ireland cut manual AP reconciliation by 85%. None of these teams got there by optimizing tokens per second.

So what do you actually do about it

Gartner predicts 40% of enterprise applications will feature AI agents by end of 2026, up from less than 5% in 2025. That's a lot of deployment about to happen. Most of it will underperform unless the workflow infrastructure catches up.

Start by extracting workflow knowledge from existing sources of truth instead of expecting the model to figure it out at runtime. Your API specs, test suites, and documentation already contain the workflow logic. Pull it out, validate it, make it available as structured execution paths. Retrofitting governance later costs 3-5x more than building it in from the start.

Test every workflow against staging before it touches real customer requests. This is where you catch parameter mismatches, missing preconditions, and incorrect sequencing that would show up as production failures.

If your AI dashboard shows token consumption and inference latency but not workflow completion rate and cost per completed task, you're looking at the wrong dashboard. Build workflow-level metrics into your observability stack before you deploy, not after you notice the ROI isn't there.

And make every workflow execution generate data that improves future runs. Failed workflows should feed constraints and alternative paths back into the knowledge base. Successful ones reinforce validated patterns. This is the agent memory problem applied at the organizational level. Over time, completion rate climbs as the system accumulates operational experience.

Enterprise AI ROI is real. But it lives at the workflow layer, not the model layer. Keep optimizing prompts and you'll keep wondering why the numbers don't add up. Start measuring workflows and you'll find out where the money actually went.

If you're interested in early access, reach out at hintas.com.

Photo by Stephen Dawson on Unsplash

Anthropic vs. the Pentagon: the First Amendment case that will define AI ethics for a decade

Dante Kakhadze — Tue, 07 Apr 2026 02:26:35 GMT

In July 2025, Anthropic signed a $200 million contract with the Pentagon. Claude would run on classified networks. It was the kind of deal that validated everything the AI safety crowd had been arguing — that you could build powerful AI and still maintain guardrails on how it gets used.

Eight months later, that same company became the first American business in history to be designated a supply chain risk by the Department of Defense. A label previously reserved for foreign adversaries like Huawei and Kaspersky. For an American AI company that was actively deployed across U.S. military systems.

What happened between July and February is worth paying attention to.

Two red lines

The dispute started simply enough. By September 2025, Anthropic was negotiating Claude's deployment on GenAI.mil, the Pentagon's AI platform. The DOD wanted unrestricted access to Claude across all lawful purposes. Anthropic drew two lines.

First: Claude would not be used for mass surveillance of American citizens. Second: Claude would not power fully autonomous weapons systems — weapons that select and engage targets without a human making the final call.

Neither of these was a radical position. Both align with existing DOD policy on autonomous weapons. The Pentagon's own Directive 3000.09 requires "appropriate levels of human judgment" in the use of force. Anthropic was basically asking DOD to put in writing what its own policy already says.

The DOD refused.

From contract dispute to presidential directive

In January 2026, DOD told Anthropic to grant unrestricted access or face consequences. Defense Secretary Pete Hegseth set a final deadline of February 27.

Anthropic held its position.

On February 27, Trump posted on Truth Social directing federal agencies to "IMMEDIATELY CEASE all use of Anthropic's technology." Hours later, Hegseth designated Anthropic a supply chain risk under 10 U.S.C. § 3252, a statute written to protect against adversaries who might "sabotage, maliciously introduce unwanted function, or otherwise subvert" government systems.

The same day, OpenAI announced it had struck a deal with the Pentagon to provide its own models for classified networks.

The legal theory that didn't hold up

Anthropic filed suit. The hearing, on March 24 in San Francisco, did not go well for the government.

Judge Rita Lin pressed DOD attorneys on the basis for the supply chain risk designation. The government's own internal files showed the designation wasn't triggered by any security assessment. An internal Pentagon memo referenced Anthropic's "increasingly hostile manner through the press" — not any technical vulnerability or espionage concern.

Two days later, Judge Lin issued a preliminary injunction blocking the designation. Her language was pointed:

"Punishing Anthropic for bringing public scrutiny to the government's contracting position is classic illegal First Amendment retaliation."

And:

"Nothing in the governing statute supports the Orwellian notion that an American company may be branded a potential adversary and saboteur of the U.S. for expressing disagreement with the government."

She also found DOD violated Anthropic's due process rights by giving no advance notice and no opportunity to respond before the ban took effect.

Legal scholars at Just Security and Lawfare had already argued the designation stretched § 3252 well past its intended scope. The statute covers procurement exclusions for national security systems. Hegseth's directive prohibited defense contractors from conducting "any commercial activity" with Anthropic, which looks a lot more like sanctions authority. Congress never granted DOD that power.

The leaked memo and the market reaction

Between the ban and the ruling, things got messy.

On March 4, an internal memo from Dario Amodei leaked. In it, Anthropic's CEO called OpenAI's messaging around the Pentagon deal "straight up lies," referred to Altman's public statements as "safety theater," and characterized OpenAI employees as "gullible." Two days later, Amodei publicly apologized for the tone while maintaining Anthropic's legal position.

The market reaction was immediate. ChatGPT uninstalls surged 295% day-over-day after OpenAI's Pentagon announcement. One-star reviews jumped 775%. Claude hit #1 on the U.S. App Store. Over 1.5 million users joined the "QuitGPT" movement.

People cared about the terms, not just the capability. That was the surprising part.

Why this isn't Google Maven 2.0

People keep comparing this to Google's 2018 Project Maven crisis. I don't think the comparison holds.

Maven was about whether a tech company should work with the military at all. Thousands of Google employees signed a letter. Google pulled out. The systems Maven built were image classifiers for drone footage.

Anthropic wasn't refusing to work with the Pentagon. It had a $200 million contract and wanted to keep it. The fight was over terms of deployment: what should an AI system that can reason, plan, and execute multi-step operations be allowed to do without a human in the loop?

In 2018, the AI was a classifier. In 2026, the AI is an agent. The question moved from "should tech help the military" to "under what conditions can autonomous AI systems operate in military contexts." Harder question. Higher stakes.

The part nobody wants to hear

The Electronic Frontier Foundation, along with the Cato Institute and FIRE, filed an amicus brief supporting Anthropic. But the EFF made a broader point that both sides find uncomfortable.

Privacy protections shouldn't depend on which CEO happens to care about them.

Anthropic drew a line. Good. But if Amodei gets hit by a bus tomorrow, or if the board decides the DOD revenue is too important to lose, those protections vanish. 71% of American adults say they're concerned about government data use. 70% of adults aware of AI have minimal trust in corporate data practices. People want laws, not corporate goodwill.

Amodei himself acknowledged this. "I actually do believe it is Congress's job" to address surveillance risks posed by AI, he told reporters.

The Fourth Amendment Not for Sale Act — which would close the data broker loophole that lets intelligence agencies purchase surveillance data without warrants — passed the House in 2024 and stalled in the Senate. That loophole is directly relevant here: military and intelligence agencies already routinely purchase commercial data to enable broad surveillance without judicial oversight.

What happens next

The case is now on two parallel tracks. The DOJ filed an appeal to the Ninth Circuit on April 2, with a filing deadline of April 30. A separate challenge to the § 4713 designation is pending in the D.C. Circuit. Both have to be resolved for the matter to be fully settled.

Hours after Judge Lin's ruling, Pentagon CTO Emil Michael posted on X claiming the supply chain risk designation was "in full force and effect" under a different statutory authority than the one Lin blocked. That kind of jurisdictional maneuvering suggests the administration isn't backing down.

The Ninth Circuit ruling will likely come by mid-summer. If it upholds Lin's injunction, it establishes that the government can't use procurement designations to punish companies for public speech. That precedent extends well beyond AI. If it overturns the injunction, it signals that national security designations can be weaponized against domestic companies that disagree with government policy.

Either outcome sets a template. Every AI company negotiating government contracts is watching this. Every general counsel is recalibrating what they can and can't say publicly about how their technology gets used. The chilling effect is already operating, independent of which way the Ninth Circuit rules.

What this means if you build AI systems

I keep coming back to this: Anthropic's two red lines are things we talk about all the time in non-military contexts.

We've written about why human-in-the-loop is the production architecture for agent systems. Not because autonomy is impossible, but because the consequences of fully autonomous execution in high-stakes contexts are too severe to leave to a model alone. Anthropic's position on autonomous weapons is the military version of that same argument.

And the security problems with agent permissions don't get better when the customer is the federal government. An AI agent with unrestricted access to classified systems and no contractual guardrails on surveillance is exactly the kind of risk that 48% of CISOs are already calling their top concern for 2026.

The question this case is really about: do AI companies get to define the boundaries of their own technology's deployment, or can governments compel unrestricted access through economic coercion?

Judge Lin gave a clear answer. The Ninth Circuit will give a more permanent one this summer.

If you're interested in early access, reach out at hintas.com.

Photo by Shiona Das on Unsplash

The Axios npm supply chain attack: a North Korean trojan inside the world's most popular HTTP library

Dante Kakhadze — Fri, 03 Apr 2026 18:17:29 GMT

On March 31, 2026, someone hijacked the npm account of the lead Axios maintainer and published two poisoned versions of one of the most-downloaded packages in the JavaScript ecosystem. Axios pulls over 100 million weekly downloads. If you've built anything in JavaScript in the last five years, it's probably in your dependency tree.

The malicious versions, 1.14.1 and 0.30.4, silently installed a cross-platform Remote Access Trojan through a dependency called plain-crypto-js. The attack window lasted about three hours, from 00:21 to 03:29 UTC. In that window, anyone who ran npm install and resolved to those versions got owned.

Three hours. That's all it took.

How the attack worked

The attacker didn't touch GitHub. No pull request, no compromised CI/CD pipeline, no poisoned GitHub Actions workflow. They stole a publishing token and used it to push directly to the npm registry. GitHub's code review process, branch protections, and audit logs were all irrelevant. The malicious code never existed in the source repository.

That's what makes registry-level attacks different from everything else in the threat model. Your code review is useless because there's nothing in the repo to review. The malware only exists in the published package, which is the thing your users and build systems actually install.

The payload was staged carefully. The plain-crypto-js dependency was pre-registered 18 hours before the poisoned Axios versions went live. On install, it ran platform-specific payloads: a binary at /Library/Caches/com.apple.act.mond on macOS, %PROGRAMDATA%\wt.exe on Windows, and /tmp/ld.py on Linux. The RAT called home to sfrclak.com on port 8000, with callback URLs disguised as npm traffic (packages.npm.org/product0, product1, product2 for each platform).

The malware harvested cloud access keys, database passwords, API tokens, SSH keys, and anything else it could find in environment variables and credential files. It wrote persistence hooks into .bashrc and .zshrc. Once installed, the attacker had a persistent foothold on your machine, surviving reboots and terminal restarts.

Encryption was AES-256 for the data itself, RSA-4096 for the session key. Not amateur work.

The TeamPCP campaign

The Axios compromise wasn't an isolated incident. It was the fifth in a two-week spree by a group researchers call TeamPCP.

Here's the timeline:

March 19: Aqua Security's Trivy, one of the most widely used container vulnerability scanners. TeamPCP exploited an incomplete credential rotation from a minor breach in late February, force-pushed malicious code to 76 of 77 version tags in aquasecurity/trivy-action, and poisoned all tags in setup-trivy. Every automated pipeline running Trivy scans was executing attacker code before the scan even started. The payload evolved through three versions, the last of which, CanisterWorm, included self-replication and a wiper component.

March 23: Checkmarx KICS, another infrastructure-as-code security scanner. Stolen GitHub tokens from the Trivy compromise enabled force-pushing to all 35 version tags. The irony of security scanners being weaponized to steal credentials from the CI/CD pipelines they're supposed to protect was not lost on anyone.

March 24: LiteLLM on PyPI, the popular AI proxy library with 95M+ monthly downloads. Versions 1.82.7 and 1.82.8 shipped with payloads that exfiltrated to models.litellm.cloud. The second version used .pth file execution to run at Python interpreter startup and installed a systemd service called sysmon.service for persistence.

March 27: Telnyx on PyPI. Versions 4.87.1 and 4.87.2 used a more creative exfiltration method, embedding encrypted data inside audio frames of WAV files. Eight bytes of XOR-encrypted data followed by the eight-byte key to decrypt it, hidden in sound files.

March 31: Axios on npm. The biggest target, the broadest blast radius.

Five major packages across two registries in 12 days. SANS analysts believe TeamPCP is sitting on a stockpile of compromised publishing credentials and operating as an Initial Access Broker, selling access to other threat groups. They've since announced a partnership with the Vect ransomware group on BreachForums, which means the credential theft is likely a precursor to extortion operations.

The numbers from Palo Alto's Unit 42 analysis are grim: 300+ GB of data exfiltrated, 500,000 infected machines, 16+ victim organizations publicly announced, and 48 additional packages compromised through harvested tokens. From the initial Trivy breach, TeamPCP was able to identify and infect 47 additional npm packages within 60 seconds using stolen tokens.

The collision with Claude Code

The Axios attack happened on the same day, nearly the same hour, as Anthropic's accidental Claude Code source leak. Two completely unrelated incidents converging on the npm registry within hours of each other.

Anyone who installed Claude Code via npm between 00:21 and 03:29 UTC could have pulled in the poisoned Axios as a transitive dependency. The source leak gave attackers a bonus: they immediately began typosquatting internal package names found in the leaked code. Five internal dependencies were squatted before the day was out: audio-capture-napi, color-diff-napi, image-processor-napi, modifiers-napi, url-handler-napi.

The convergence of these two incidents did more damage to npm's credibility in a single day than years of smaller incidents combined. Anthropic now recommends its native installer over npm for Claude Code installations.

Why npm's trust model is broken

The core problem: npm trusts whoever holds the publishing token. No mandatory two-factor authentication for publishing. No OIDC provenance linking a published package back to a specific CI/CD run. No verification that what gets published matches what's in the source repository.

The Axios attacker didn't need to compromise any code. They needed one stolen token. That token gave them the ability to publish arbitrary code under a trusted package name to 100 million weekly consumers with zero human review.

SLSA (Supply chain Levels for Software Artifacts) defines a framework for exactly this problem. At SLSA Build Level 3, a package's provenance is cryptographically verifiable and tied to a specific source commit and build environment. npm supports provenance attestations as of 2023, but they're optional. The Axios package didn't have them. Neither did Trivy, KICS, LiteLLM, or Telnyx.

Optional security doesn't work when the attacker only needs to find one package without it.

Coder's analysis of the incidents makes a broader architectural argument: when a developer or AI agent operates inside an isolated workspace, a compromised dependency can't reach the corporate network, exfiltrate credentials, or pivot to production. The defense isn't just better lockfiles. It's infrastructure that assumes dependencies will be compromised and contains the blast radius by design.

What you should do right now

If you installed or updated any npm packages on March 31 between 00:21 and 03:29 UTC, check your lockfile for axios@1.14.1 (SHA-1: 2553649f2322049666871cea80a5d0d6adc700ca) or axios@0.30.4 (SHA-1: d6f3f62fd3b9f5432f5782b62d8cfd5247d5ee71). Search for plain-crypto-js anywhere in your node_modules. If you find either, assume compromise.

Rotate everything. npm tokens, cloud access keys (AWS, Azure, GCP), SSH keys, database passwords, API tokens. Check your shell profiles (.bashrc, .zshrc) for injected lines. Look for RAT artifacts: /Library/Caches/com.apple.act.mond on macOS, %PROGRAMDATA%\wt.exe on Windows, /tmp/ld.py on Linux. Block sfrclak.com and 142.11.206.73 at the network level.

Going forward: pin exact versions in your lockfile and treat lockfile changes as security-relevant code reviews. Enable npm provenance for your own packages. Run npm audit signatures regularly. Consider package age policies that reject packages published less than 24 hours ago, which would have caught this attack entirely since the malicious plain-crypto-js was only 18 hours old when it was pulled in.

And look at your development environment architecture. If a compromised npm dependency can reach your AWS credentials, your Kubernetes configs, and your production database passwords from a developer laptop, the problem isn't just npm. It's that your security boundary is the developer's machine, and that boundary doesn't hold.

We've written before about how unvetted MCP servers are attack surfaces and why agent permissions need real security boundaries. The Axios attack is the same lesson from a different angle. Whether it's a malicious MCP server, a compromised npm package, or a RAT disguised as a crypto library, the question is the same: what can it reach when it runs, and what happens when it does?

The TeamPCP campaign answered that question for 500,000 machines.

If you're interested in early access, reach out at hintas.com.

Photo by Nicolas HIPPERT on Unsplash

The Claude Code leak: 512,000 lines, one misconfigured file, and the future of AI IP protection

Dante Kakhadze — Fri, 03 Apr 2026 18:09:36 GMT

At roughly 4 AM UTC on March 31, 2026, Anthropic pushed version 2.1.88 of its @anthropic-ai/claude-code package to the npm registry. Inside was a 59.8 MB source map file that should never have shipped. That single file contained pointers to the complete, unobfuscated TypeScript source of Claude Code: 512,000 lines across 1,906 files, referencing a zip archive sitting on a publicly accessible Cloudflare R2 bucket with no authentication required.

By 4:23 UTC, security researcher Chaofan Shou flagged the discovery on X. The tweet pulled 16 million views. Within two hours, a clean-room rewrite repository hit 50,000 GitHub stars and 41,500 forks. By the time Anthropic pulled the package around 8 AM UTC, the source had been mirrored to decentralized hosting, rewritten in Python and Rust, and dissected by tens of thousands of developers worldwide.

This wasn't a hack. It was a packaging mistake. And it might be the most expensive .npmignore omission in history.

Three failures, one catastrophe

The leak wasn't one misconfiguration. It was three, stacked on top of each other. Any one of them alone would probably have been caught. All three together blew the doors open.

Claude Code's .npmignore file didn't exclude *.map files. Source maps are debugging artifacts that map compiled JavaScript back to original source code. They're standard in development. They have no business in a production npm package. This kind of mistake happens to open-source projects all the time, but for a company guarding proprietary AI agent architecture worth $2.5 billion in annual recurring revenue, it hits differently.

The source maps didn't contain the code inline, either. They referenced a Cloudflare R2 bucket hosting a zip of the original TypeScript. That bucket required no authentication. Anyone with the URL could download everything. The .map file was literally a map to the treasure, and the chest was unlocked.

And then there's the Bun angle. Anthropic acquired Bun, the JavaScript runtime, and used its bundler for Claude Code builds. Bun had an open issue (#28001, filed March 11, 2026) reporting that source maps were being generated in production builds despite documentation saying otherwise. The bug sat unfixed for 20 days. Anthropic's own recently acquired toolchain worked against them.

Security researcher Roy Paz noted the breach likely resulted from bypassed release safeguards, comparing proper procedures to "a vault requiring several keys to open." At Anthropic, several of those keys were apparently left in the lock.

What the code revealed

The leaked source wasn't boilerplate. It was a production AI agent architecture, and the community tore through it fast.

The tools system spans roughly 29,000 lines and includes BashTool, FileReadTool, WebFetchTool, LSPTool, and MultiEditTool, all with granular permission-gating. This is the layer that decides what Claude Code can and can't touch on your machine. The query engine, at 46,000 lines, handles LLM calls, token caching, context management, and multi-agent orchestration. These two subsystems are the core of how a production AI coding agent works, from prompt routing to file system access.

But the real headlines came from 44 hidden feature flags, over 20 of which pointed to unreleased capabilities.

KAIROS is a persistent background daemon with an autoDream memory consolidation feature. It runs continuously, fixing errors and sending notifications without the user starting a conversation. Claude Code that doesn't wait for you to ask.

ULTRAPLAN references 30-minute remote reasoning sessions via a Cloud Container Runtime. Anthropic is apparently building infrastructure for Claude Code to offload long-running planning tasks to remote servers rather than running everything locally.

Coordinator Mode is multi-agent orchestration infrastructure: spawning and managing sub-agents for complex tasks. Agents that delegate to other agents.

And then there's BUDDY. A Tamagotchi-style AI pet with 18 species and deterministic per-user assignment, with an April 1-7 rollout window. Not every feature flag is about the future of computing.

The weirder discoveries sat deeper in the code. An anti-distillation system injects fake tool definitions to corrupt competitor training data if someone tries to distill Claude Code's behavior. An "Undercover Mode" prevents Claude Code from mentioning internal codenames when contributing to external repositories, so AI-authored commits carry no disclosure of AI authorship. And there's a frustration detection system that regex-matches against 50+ common expletives to adjust behavior when users start swearing at it.

The timing couldn't have been worse

The leak didn't happen in isolation. On the same day, at nearly the same hour, attackers hijacked the npm account of the lead Axios maintainer and published malicious versions of one of npm's most-downloaded packages. The poisoned Axios versions (1.14.1 and 0.30.4) contained a cross-platform Remote Access Trojan deployed through a dependency called plain-crypto-js.

The attack window ran from 00:21 to 03:29 UTC on March 31. Anyone who installed Claude Code via npm during that window may have pulled in the compromised Axios as a transitive dependency. Two completely separate incidents on the same registry, same day, compounding each other's blast radius.

It gets worse. Threat actors immediately began typosquatting internal package names found in the leaked source: audio-capture-napi, color-diff-napi, image-processor-napi, modifiers-napi, and url-handler-napi. Developers attempting to rebuild Claude Code from the leaked source were being targeted before the day was out. Others deployed fake Claude Code repositories distributing Vidar Stealer and GhostSocks malware via Rust-based droppers.

And this was Anthropic's second leak in days. Just before the code incident, nearly 3,000 documents about an unreleased model codenamed Mythos (part of a new tier called Capybara) were found in a publicly searchable data cache. The leaked Claude Code source confirmed these codenames, linking the two incidents in the public mind even though they were technically unrelated.

The free engineering education

An Anthropic spokesperson told Fortune: "No sensitive customer data or credentials were involved or exposed. This was a release packaging issue caused by human error, not a security breach."

That's accurate but understates the damage. The code didn't contain model weights or customer data. What it contained was architecture. The memory efficiency model. The permission-gating framework. The context management approaches. The multi-agent orchestration patterns. Every competitor would have paid real money for this.

Every rival lab got a free masterclass in how Anthropic builds production AI tooling. The three-layer memory system (lightweight index via MEMORY.md, topic files loaded on demand, raw transcripts searched via grep) is a clever solution to context window limitations that others can now just copy. The tools permission model shows exactly how to gate file system and network access in an AI agent. The query engine reveals token caching strategies that took years of iteration.

One leaked comment revealed that Claude Code burns 250,000 API calls per day globally on failed auto-compaction alone. That kind of operational detail tells competitors exactly where the performance bottlenecks are.

DMCA whack-a-mole

Anthropic moved fast on takedowns. GitHub complied immediately with DMCA requests, pulling mirrors as they appeared. But the internet doesn't forget, and it definitely doesn't comply.

Decentralized mirrors on platforms like Gitlawb explicitly claimed permanent hosting outside DMCA jurisdiction. A Python rewrite was framed as clean-room original work, exploiting the legal ambiguity around AI-generated code and clean-room reverse engineering. Torrents ensured the source would remain available indefinitely.

The DMCA strategy faces a basic problem: the code is out. You can take down individual mirrors, but you can't un-ring a bell that 50,000 developers heard. The architectural insights are already internalized. The patterns are already being reimplemented. The competitive advantage was never in the specific lines of TypeScript anyway. It was in the design decisions those lines encode. And those are now public knowledge.

What this means for you

If you're building AI-powered tools and shipping them through package registries, this is a direct cautionary tale.

Start with your build pipeline. Check your .npmignore (or .gitignore if you're using npm's files field). Search for *.map files in your published packages. Run npm pack --dry-run and actually read the file list. If you're using Bun, verify source map behavior explicitly rather than trusting defaults. Source maps are debugging tools that contain your original source. They should never ship to production registries. Add a CI check: if a .map file appears in the package tarball, the build fails.

Then look at your storage. If your build artifacts reference external storage like CDNs, R2, or S3, those endpoints need authentication. Publicly accessible buckets containing source code are a ticking clock regardless of whether anyone has the URL yet.

Pin your dependencies and audit your lockfile. The Axios supply chain attack hit the same day. If you installed Claude Code via npm during the attack window, rotate your credentials and audit for unauthorized access. SLSA attestations and provenance checks should be part of your dependency management by now.

And think about your distribution model. Anthropic now recommends its native installer over npm for Claude Code. If your product's IP is in its source code and you're distributing through a public registry that ships raw JavaScript, you're one misconfigured file away from the same headline.

The Claude Code leak is a reminder that AI IP protection isn't a model security problem. It's a DevOps problem. The most sophisticated AI agent architecture in the industry was undone by a missing line in a config file, a known bug in a bundler, and a storage bucket without a password. All those anti-distillation countermeasures and stealth mode features? Irrelevant if the source ships in the package.

We keep seeing the same pattern building workflow infrastructure: security in the AI tooling ecosystem is only as strong as the weakest link in the chain. Today that chain includes npm registries, bundler defaults, cloud storage permissions, and every transitive dependency your package pulls in. The leak is permanent. The lessons don't have to be.

If you're interested in early access, reach out at hintas.com.

Photo by Jake Walker on Unsplash

Your AI agent has your production credentials. That's the actual problem.

Pramesh Regmi — Tue, 24 Mar 2026 00:50:55 GMT

In December 2025, Amazon's internal AI coding assistant Kiro deleted a production environment in AWS Cost Explorer. The outage lasted 13 hours and hit services across mainland China. Kiro decided the fastest path to fixing a bug was to tear down the environment and rebuild it from scratch. Efficient reasoning, catastrophic outcome.

The root cause wasn't an AI gone rogue. Kiro normally requires sign-off from two engineers before touching production. But a human operator had granted the agent their own elevated access credentials. Kiro inherited those permissions, skipped the two-person approval gate, and executed with the full authority of a senior engineer.

Amazon called it human error. Sure. The human made the mistake. But the failure mode, an agent inheriting overprivileged credentials from the person who invoked it, is how most agent deployments work today. The mistake isn't that someone misconfigured Kiro. The mistake is that the default configuration makes this inevitable.

The identity model is broken

We've spent the last year talking about MCP security in terms of unvetted servers and prompt injection. Those are real problems. But the one that's actually deleting production databases is more mundane: agents act with the wrong identity and the wrong permissions.

An agent needs to call an upstream API. Someone configures a credential, usually a static API key or the invoking user's OAuth token. The agent uses that credential for everything. No scoping. No per-task restriction. No distinction between "read this record" and "delete this environment."

53% of MCP servers still rely on static API keys. A static key can't tell you whose authority the agent is acting under, or with what scope, or for how long. It's a skeleton key in a system that needs a locksmith.

The other 47% aren't much better off. Most pass the invoking user's token straight through to the upstream API. The agent acts as you, with your full permission set. If it gets confused, compromised, or just interprets "clean up old resources" a bit too literally, the blast radius is your entire access footprint.

Permission inheritance is the default failure mode

The Kiro incident wasn't unique. At SaaStr in July 2025, a coding agent deleted a production database because it had write access it didn't need and nobody could revoke it at the right granularity. Static credential, broad permissions. The agent used them.

Wing Security documented a case at a ~1,000-person company where a marketing AI agent on Databricks returned detailed customer churn data to an employee named John. John's own account was explicitly blocked from accessing that data. Didn't matter. The agent operated under a shared service account with broad read access. John asked, the agent fetched. In the audit log, the agent accessed the data. Who actually requested it and whether they were authorized? Invisible.

This keeps happening because it's the default: agents inherit broad credentials, act under a shared identity, and collapse the user-permission boundary that security teams built over decades. It's the same infrastructure gap that kills AI projects, just wearing a different hat. 88% of organizations reported confirmed or suspected AI agent security incidents in the last year. Only 24.4% have visibility into which agents are even communicating with each other.

Why "just use OAuth" is harder than it sounds

The MCP specification adopted OAuth 2.1 for authorization in its June 2025 revision. In theory, problem solved. Except the spec has drawn serious criticism for blurring the line between authorization servers and resource servers, forcing MCP servers to manage token issuance, maintain state, and run secure databases for token storage.

Christian Posta's take is blunt: the spec requires each MCP server to become its own authorization provider. For enterprise deployments where centralized identity management through Okta, Azure AD, or Auth0 is standard, that's a non-starter.

The architecture that works separates these concerns. Your MCP server is a resource server. It validates tokens. It never issues them. An external identity provider handles the login, consent, and token lifecycle. The MCP server just checks the receipt.

But there's a second token relationship that most implementations skip entirely. Your MCP server also needs to call upstream APIs on the user's behalf. That requires a separate credential, obtained through its own OAuth consent flow, scoped to what the user authorized, stored server-side. The user consents once. The server uses that token to execute operations with the user's upstream permissions.

Two tokens, two trust relationships. Token A: the agent authenticates to your MCP server. Token B: your MCP server authenticates to the upstream API on behalf of that specific user. The user's upstream RBAC, audit trail, and permission boundaries apply to every operation. No credential inheritance. No shared service accounts.

What proper agent authorization looks like

Oso, ScaleKit, Stytch, and Microsoft's Entra Agent ID have all published auth models for agents recently. They land in roughly the same place.

The agent doesn't get the user's token. It gets a scoped downstream token issued on the user's behalf, with explicit permission boundaries. The On-Behalf-Of (OBO) flow exists for exactly this. The agent presents the user's token to the identity provider, which issues a new token scoped to what the agent actually needs for this operation. Not the user's full access. Just what's required right now.

Credentials should be just-in-time. Fresh tokens per task, revoked after completion. If the agent needs to read a calendar entry and send an email, it gets a calendar:read token, does the read, then gets an email:send token for the send. Not a permanent credential that covers both.

Audit trails need to trace back to the human. The log should show "user X initiated action Y through agent Z," not just "agent Z did something." The user's identity is embedded in the delegated token. That's what makes this traceable.

And even with all of that, agent code should run in a sandbox. No filesystem, no network, no host access. Tool calls are the only way out, each carrying its scoped credential. If the agent is compromised, there's nothing to escalate to.

We've been working through exactly this at Hintas. Each tenant's MCP server validates JWTs from the customer's own IdP. Upstream API calls use per-user tokens from a separate consent flow, so the customer's existing RBAC applies and we don't need to build a permission layer on top. Agent code runs in V8 isolates with no ambient permissions. Getting this right has been the hardest part of the infrastructure work, honestly, harder than the MCP protocol stuff.

The gap is closing, slowly

Microsoft shipped Zero Trust for AI last week. AWS published Well-Architected guidance for agentic permissions. InfoQ documented a least-privilege gateway pattern using MCP with OPA and ephemeral runners. The guidance exists now.

But guidance published and guidance followed are different things. Most agent deployments still run on static keys and inherited credentials. The Kiro incident made headlines because it hit AWS. The smaller ones, the John-sees-payroll-data cases, the agent-drops-a-staging-table cases, those happen quietly every week.

If you're deploying agents against real APIs today, the question to ask is simple: can your audit trail tell you who asked for an action, or just which service account ran it? If the answer is the latter, you have the same problem Kiro had. You just haven't hit the wrong button yet.

If you're interested in early access, reach out at hintas.com.

Photo by Matt Artz on Unsplash

From toolbox to instructions: why endpoint-level MCP isn't enough

Dante Kakhadze — Sat, 21 Mar 2026 20:17:19 GMT

The MCP ecosystem is booming. Every week, new MCP servers pop up wrapping another SaaS API: Stripe, Salesforce, GitHub, Jira, Notion. Tools like Speakeasy and Stainless can auto-generate an MCP server from any OpenAPI spec in minutes. The toolbox is filling up fast.

But a toolbox isn't instructions. Giving an agent access to 200 Stripe endpoints via MCP is like handing a new hire the codebase keys and saying "deploy the feature." The agent has access. It doesn't have understanding. We keep running into the same version of this problem, and it explains why so many AI projects stall out despite having perfectly functional models.

The one-tool-per-endpoint pattern

The current standard for MCP server generation is simple: parse the OpenAPI spec, create one MCP tool per endpoint, map parameters in, map responses out. Done.

And it's useful! It standardizes API access for agents, kills custom integration code, and lets any MCP-compatible client talk to any API through a universal protocol. For single-step tasks ("look up customer #12345," "get the current balance," "list open tickets") it works great.

Here's where it falls apart. SaaS APIs exist to support business processes, and business processes are multi-step. Processing a refund isn't one endpoint. It's seven endpoints called in a specific order with parameter dependencies between them. Onboarding a customer isn't one API call. It's provisioning accounts, configuring permissions, initializing billing, running compliance checks, sending welcome emails. Each step depends on the output of the last.

When an agent connects to an endpoint-level MCP server, it sees 200 independent tools. It has no information about which tools belong to which workflow, what order they run in, what parameters flow between them, or what to do when step four fails. The agent has to figure all of that out through trial and error. If you've tried to vibe-code a multi-step agent workflow, you know exactly how this goes.

What the benchmarks actually show

This isn't hypothetical. MCPMark tested 127 realistic MCP tasks across Notion, GitHub, Filesystem, PostgreSQL, and Playwright. The best model, GPT-5-medium, hit 52.6%. Claude Sonnet 4 and o3 fell below 30%. On average, models needed 16.2 turns and 17.4 tool calls per task. These aren't toy examples. They test real CRUD operations that mirror what agents actually do in production.

OSWorld-MCP found that giving agents MCP tools improved models like Gemini 2.5 Pro by up to 14 percentage points. But even the strongest model only invoked available tools 36.3% of the time. The tools were right there. The agents just didn't use them because they didn't know when or how they fit together.

Adding more tools doesn't fix this. It makes it worse. RAG-MCP research showed tool selection accuracy dropping from 43% to under 14% as the number of available tools grows. Prompt bloat overwhelms the model's ability to pick the right tool. Loading metadata for hundreds of endpoints burns tokens before the agent even reads the user's request.

The gap is workflow knowledge

What's missing between a toolbox and instructions is workflow knowledge. The stuff that tells you: of 200 Stripe endpoints, which 7 process a refund? In what order? What data flows between them, step 3 needs the charge_id from step 1, not the customer_id, and when the agent gets this wrong it fabricates parameters instead.

There are preconditions too. The order must be within the return window AND the payment method must support reversals AND the user has the right permissions. All at once. And then there's failure handling: if the payment reversal succeeds but the inventory update fails, you reverse the payment reversal. These compensation actions don't emerge from endpoint descriptions. They require understanding the business process end to end.

This knowledge already exists in your organization. Your Cypress and Playwright test suites encode the happy path. Your runbooks describe what to do when things break. Your Jira workflows capture the process. The senior engineers who built the system carry the rest in their heads.

It's not in your OpenAPI spec. The spec describes individual endpoints. It says nothing about how those endpoints combine into workflows. As we explored when writing about agent memory, this structural knowledge (entities, relationships, dependency ordering) is exactly what a knowledge graph captures and a flat tool list can't.

What "instructions" look like

Moving from toolbox to instructions means changing the agent's interface from "here are 200 tools" to "here are the workflows you can run."

Instead of an MCP server that exposes POST /api/v2/refunds, GET /api/v2/orders/{id}, PUT /api/v2/inventory/{sku}, and 197 other endpoints, the agent connects to a server with two tools:

search: the agent describes what it wants to accomplish in plain language. The system queries a knowledge graph of validated workflows and returns the match, what it does, what inputs it needs, what preconditions apply, what the expected outcome is.

execute: the agent provides the workflow identifier and input parameters. The system handles multi-step orchestration internally (calling the right APIs, in the right order, managing errors and compensations) and returns the result.

What used to be a fragile 7-step improvisation becomes a single tool call. The agent focuses on understanding what the user wants. Execution follows a validated path that doesn't depend on the model correctly sequencing API calls on the fly.

This matters for the same reason full autonomy is a trap. You want the agent making decisions where it's strong (understanding intent, handling ambiguity) and handing off execution to infrastructure where reliability actually matters.

The context window argument

There's a practical angle beyond reliability. An endpoint-level MCP server for a platform with 200 endpoints loads 200 tool schemas into the agent's context. Each schema has the tool name, description, parameter definitions, return types. At scale this eats millions of tokens, leaving minimal room for the actual task.

A workflow-level MCP server loads two tool schemas: search and execute. About 1,000 tokens of context overhead regardless of how many workflows or underlying endpoints the system supports. Workflow detail gets retrieved on demand through search, not loaded upfront.

MCP's own deferred loading mechanism works on the same principle: only load tool definitions when needed, not at init. But deferred loading is a protocol-level optimization. The toolbox-to-instructions shift is an architectural change that kills the problem at its root. Stainless ran into this firsthand: they had to build client-specific schema adaptations because Cursor caps tools at 40 and Claude Code can't handle arrays in certain positions. Those are symptoms of an architectural mismatch, not client bugs.

What needs to happen

The MCP ecosystem needs to move from endpoint wrapping to workflow intelligence. That requires capabilities current MCP server generators don't have.

First, workflow extraction. Automatically pulling multi-step patterns from existing sources: OpenAPI specs for the API surface, end-to-end test suites for happy paths, internal docs for business rules, operational runbooks for error handling. This is the same vertical knowledge that makes industry-specific AI outperform generic tools. It's domain-specific, hard-earned, and you can't prompt-engineer it into existence.

Second, workflow validation. Running extracted workflows against staging environments. A workflow that looks correct on paper but fails in practice is worse than having nothing, because it creates false confidence. Production-grade operations require saga-pattern transactions, observability, and tested compensation chains before anything touches real data.

Third, workflow evolution. APIs change. New endpoints appear, parameters get added, auth scopes shift. The workflow knowledge layer has to keep pace with the API surface without manual updates every time something changes. Every execution, successful or failed, should teach the system something new.

The toolbox era of MCP was necessary. It solved standardization and proved a universal agent-to-tool protocol is viable. But as MCP and A2A converge into a unified interoperability layer, the workflow knowledge gap only gets wider. Multi-agent coordination multiplies the number of tools and the complexity of sequencing them correctly. The next phase is about what sits on top of the toolbox: the workflow knowledge that turns tool access into task completion.

If you're thinking about building AI into your product as a foundation, this is the infrastructure that actually makes that work. The knowledge of how work gets done, encoded so agents can use it, validated before it touches production, and getting better every time it runs.

If you're interested in early access, reach out at hintas.com.

Photo by ThisisEngineering on Unsplash

Vibe coding is great until your agent has to do real work

Dante Kakhadze — Mon, 16 Mar 2026 20:11:21 GMT

Vibe coding is great until your agent has to do real work

Vibe coding, describing what you want in plain English and letting AI generate the code, is how most developers prototype in 2026. It's fast. It produces working code from a description in seconds. Hard to argue with that.

It's also producing a wave of agent integrations that work in demos and fall apart in production. The generated code isn't wrong, exactly. It's incomplete. Vibe coding optimizes for "does it run?" but production asks "does it run correctly every time, handle failures gracefully, and maintain data integrity across multi-step operations?"

The gap between those two questions is where reliability lives.

What vibe coding is genuinely good at

Credit where it's due. Vibe coding handles certain things well.

Generating a single API integration, calling an endpoint, parsing the response, displaying the result, is a task natural language descriptions capture accurately. The AI understands HTTP methods, JSON parsing, basic error handling. The code works and comes together fast.

For agent development specifically, it's solid for tool definitions (describing a tool's inputs, outputs, and purpose maps directly to MCP server metadata), single-step integrations (anything that calls one API and processes the result), and prototyping workflows to see the shape of a problem before hardening the solution.

Nobody should be hand-writing boilerplate tool definitions in 2026. That much is clear.

Where it falls apart

The trouble starts at the boundary between single-step and multi-step operations.

When you vibe-code an agent workflow ("process a customer refund by looking up the order, verifying eligibility, reversing the payment, and updating inventory") the generated code typically has three problems that are hard to spot until something breaks.

The first is missing dependency management. The code calls APIs in sequence but doesn't properly encode data dependencies between steps. Step 3 needs a specific field from step 2's response. The vibe-coded version might reference the right field name, or it might hallucinate a plausible-sounding one that doesn't exist in the actual API response. You find out at runtime. Maybe in production.

The second is the total absence of compensation actions. Vibe-coded workflows handle the happy path. When step 4 fails after steps 1-3 succeeded, the generated code throws an error and stops. It doesn't reverse the payment from step 3 or release the reservation from step 2. Why would it? You described what should happen, not what to do when it doesn't. Compensating transactions don't emerge from a natural language description of the forward workflow.

The third is implicit assumptions. When you describe a workflow, you carry knowledge the AI doesn't have. "Verify eligibility" means checking five specific conditions in your system. The generated code might check one or two obvious ones and miss the rest. Your business rules, edge cases, regulatory requirements: none of that transfers through a prompt.

The maintenance problem

Even if you get the initial version working, vibe-coded agent integrations are rough to maintain.

When the payment API adds a new required parameter, you need to update the workflow. With explicitly defined workflow knowledge, a graph of steps, dependencies, parameters, and constraints, the update is surgical: modify the parameter definition, re-validate the affected workflow, deploy. With vibe-coded logic, you regenerate code from a modified prompt, hope the AI produces something compatible with the rest of the system, and test the whole flow end to end.

This gets worse as complexity grows. A five-step workflow is manageable. A fifty-step workflow spanning multiple API surfaces becomes a regeneration nightmare. Any change risks breaking unrelated steps because the AI doesn't build incrementally on what it generated before. It rebuilds everything from scratch each time.

The practical split

The answer isn't to stop vibe coding or to vibe code everything. It's knowing which layers benefit from rapid generation and which need actual engineering.

Vibe code the interface layer. Tool definitions, API client wrappers, response formatting, prompt templates. These are boilerplate-heavy components where generation shines. Build them, verify they work, move on.

Engineer the workflow layer. Which APIs to call, in what order, with what parameters, under what constraints, with what compensation actions when something breaks. This is where things go wrong. This knowledge should come from authoritative sources, your API specs, test suites, documentation, validated against staging environments and maintained as a structured, versionable artifact. Not generated from a prompt.

Better yet, automate the extraction. The ideal setup pulls workflow knowledge from your existing sources of truth, validates it, and exposes it to agents through a standard interface. The agent describes what it wants to do in natural language. Execution follows a validated path. You get the speed of conversational interaction with the reliability of engineered infrastructure.

Where does your workflow knowledge live?

If it's embedded in generated code, in the if/else chains and sequential API calls that AI produced from your natural language description, you have a fragility problem. Every change risks breaking things. Every edge case means regeneration. Every failure means debugging code you didn't write and might not fully understand.

If it's extracted and maintained as a separate layer, a knowledge graph of steps, dependencies, parameters, and constraints, you have something you can build on. The agent interface can be vibe-coded, refactored, or replaced entirely without touching the workflow knowledge. The knowledge itself can be updated, validated, and versioned independently.

Vibe coding is a development approach. Workflow reliability is an infrastructure property. They work together when they operate at different layers. They cause problems when you treat them as the same thing.

Hintas separates workflow knowledge from agent code. Your agents describe what they want in natural language; Hintas returns validated, dependency-aware workflows through search and runs them reliably through execute, all via a standard MCP interface. More at hintas.ai.

Photo by ANOOF C on Unsplash

Agentic ops in production: what it takes to run AI workflows that modify real data

Dante Kakhadze — Mon, 16 Mar 2026 20:11:11 GMT

What it actually takes to run AI workflows in production

The industry spent two years building AI agents. 2026 is the year those agents need to work for real. Not in sandboxes. Not in demos. Not in internal tools that three people use. In production, where they modify real customer data, process real payments, and trigger real business consequences.

This is a different problem than building agents. Getting an agent to reason about a multi-step task is a model capability question, and the models are good enough now. Running that agent reliably in production with transactional guarantees, observability, failure recovery, and audit trails is an infrastructure question. The infrastructure barely exists yet.

What "production" means for agents

A production agent system needs properties that demos don't.

Transactional integrity is the obvious one. An agent executes a five-step workflow and step 4 fails. The system either completes everything or rolls back to a clean state. No partial execution. No orphaned records. No payments processed without corresponding order updates. This is the Saga pattern from distributed systems, reinvented (usually poorly, or not at all) in most agent frameworks.

Then there's auditability. A customer asks "why was my account charged twice?" You need to reconstruct exactly what happened: which agent, which workflow, which steps ran, which parameters were passed, which API responses came back, and when. "The LLM decided to call the billing API" is not an answer your compliance team will accept.

Graceful degradation matters more than people think. APIs go down. Rate limits get hit. Auth tokens expire mid-workflow. A production system handles known failure modes without paging someone at 3am and escalates cleanly for unknown ones. The difference between production and a demo is what happens when things break.

And performance under load. One agent running one workflow is easy. A thousand agents running different workflows concurrently against the same API surface is a completely different problem. Rate limiting, connection pooling, queue management, resource isolation all become critical at once.

The Saga pattern, adapted for agents

If there's one infrastructure pattern that matters most here, it's Sagas.

The traditional version: each step in a distributed transaction has a compensation action. Process payment, compensation is reverse payment. Reserve inventory, compensation is release inventory. Send notification, compensation is send correction. If any step fails, compensations fire in reverse order to restore the system to a consistent state.

For agent workflows, this addresses what I think of as the atomicity fallacy. Because each individual API call is atomic (succeeds or fails cleanly), people assume a sequence of calls is also safe. It isn't. If an agent processes a payment (step 3) but fails to update inventory (step 4), you have a successful charge and wrong inventory. Both API calls worked fine individually. The workflow is corrupted.

Implementing Sagas for agent workflows requires three things the ecosystem mostly lacks right now.

First, compensation discovery. For every forward action, the system needs to know the compensating action. "Process payment" compensates with "reverse payment." "Create user account" compensates with "delete user account." Some compensations are obvious. Others, like sending an email, don't have true reversals, only follow-up actions. This compensation mapping has to be extracted and validated alongside the forward workflow. You can't bolt it on later.

Second, progress tracking with checkpoints. The system has to know exactly how far a workflow got when something failed. If step 4 breaks, the system must know steps 1-3 completed and need compensation. This needs durable state management that survives process crashes, network partitions, and infra failures. Without it, you're guessing which steps actually ran.

Third, ordered compensation execution. Compensations run in reverse order, each completing before the next fires. If the payment reversal fails, you can't proceed to release inventory. The system state is genuinely indeterminate at that point, and you escalate.

Observability for workflows, not just calls

Current AI observability tools focus on individual LLM calls: latency, token counts, model versions, prompt/completion pairs. Necessary, but nowhere near sufficient for production.

What you actually need is workflow-level visibility. A complete trace of every execution from trigger to completion. Not just LLM decisions, but every API call, every parameter, every response, every state transition. This is what compliance teams audit and what you debug from when something breaks at 2am.

You need dependency health tracking too. If the payment API slows down, how many active workflows are affected? Which ones are blocked? Which already passed the payment step and don't care? Without this, you can't assess blast radius when an external service degrades.

Compensation success rates are another thing almost nobody tracks. A failed compensation means the system is stuck in an inconsistent state requiring manual intervention. If that rate starts climbing, you have a problem brewing before customers even notice.

And workflow-level SLOs beat per-call metrics for understanding system health. A workflow completing in 30 seconds at 99.5% success is healthy. A workflow where individual calls are fast but overall success is 85% is broken, even if no single step looks bad in isolation.

Context-decoupled execution

One pattern that makes a real difference in production: decouple context from execution.

The standard agent pattern routes every step through the LLM's context window. Agent calls a tool, gets the result in context, reasons about the next step, calls the next tool. Each intermediate result eats context tokens and each step needs a full model inference.

Context-decoupled execution separates planning from execution. The agent identifies what workflow to run and provides the inputs. The execution engine handles multi-step orchestration internally, calling APIs, passing parameters between steps, handling errors, without routing intermediate results back through the model.

This matters in production for several reasons. Token cost drops because intermediate API responses (often large JSON payloads) never enter the context window. A 20-step workflow costs the same as a single tool call. Execution becomes deterministic: once the workflow is identified, it follows a validated path with no probabilistic reasoning at each step. Latency improves because you eliminate 19 of 20 model inference calls, and the workflow runs at API speed instead of model speed. Isolation becomes possible since the execution engine can run in a sandbox (V8 Isolates, Firecracker microVMs) with proper security boundaries.

A rough maturity model

Production readiness isn't binary. Teams tend to move through stages.

Level 0 is ad-hoc. Agents call APIs directly. No workflow knowledge, no transaction management, no observability beyond logs. This is where most teams are today.

Level 1 is structured. Multi-step processes are defined as explicit workflows. The agent follows a known path instead of improvising. Basic success/failure tracking exists.

Level 2 is transactional. Saga compensations are defined for each step. Failures trigger automated rollback. Checkpoints enable recovery from infra failures.

Level 3 is managed. Workflows run on managed infrastructure with auth, isolation, audit logging, and workflow-level observability. Execution history feeds back into workflow refinement.

Most teams are at Level 0. They need to reach Level 2 before agents are production-ready. That jump is a lot of engineering work, unless the infrastructure already exists.

Hintas gives you Level 2-3 out of the box. Validated workflows with Saga-pattern rollback, managed execution on isolated infrastructure, and audit logging for every step. Your agent calls search to find the right workflow and execute to run it. Hintas handles the orchestration. More at hintas.ai.

Photo by Alex Shuper on Unsplash

Why industry-specific AI beats general-purpose tools for SaaS workflows

Dante Kakhadze — Sun, 15 Mar 2026 15:54:26 GMT

The general-purpose AI agent pitch sounds great: one system, every domain, every customer. Build once, deploy everywhere.

But the teams actually getting results from AI agents in 2026? They're going vertical. They're encoding domain-specific workflow knowledge, not chasing generic reasoning. The reason is mundane. It's about how real business processes are structured, not about model intelligence.

The generalization trap

A general-purpose AI agent looks at your SaaS API surface and sees endpoints. Hundreds of them. POST /api/v2/orders, GET /api/v2/customers/{id}, PUT /api/v2/inventory/{sku}. Each one has a schema. Parameters, return types, the usual.

What the agent doesn't see is the industry context that tells it how those endpoints fit together. Processing a return in e-commerce is a completely different animal from processing a return in medical device distribution. Both involve order lookup, eligibility checks, inventory adjustments. Same verbs. But the eligibility rules, compliance requirements, and downstream consequences have almost nothing in common.

A general-purpose agent treats both the same way: read the schemas, reason about steps, execute sequentially. It has no idea that medical device returns require lot tracking under 21 CFR Part 821, that the FDA mandates specific tracking documentation including UDI, serial numbers, and disposition records, or that the inventory adjustment has to trigger a quarantine workflow before anything gets restocked.

And no, you can't fix this with a better system prompt. You can't cram industry-specific workflow logic into a prompt and expect it to hold up across hundreds of business processes. The knowledge is too deep, too interconnected, too dependent on context that only practitioners carry.

What "vertical" actually means here

Going vertical doesn't mean building a separate AI product for every industry. It means building infrastructure where the workflow knowledge layer is industry-specific while the platform underneath is shared.

This distinction matters a lot. The execution engine that handles multi-step API orchestration, dependency resolution, transactional rollback? Same regardless of industry. The MCP interface agents use to access workflow knowledge? Same. The validation pipeline that tests extracted workflows against staging environments? Same.

What changes per vertical is the knowledge graph. An e-commerce deployment has workflow nodes for order processing, inventory management, fulfillment, returns. A healthcare deployment has nodes for patient intake, claims processing, prior authorization, care coordination. Each customer's knowledge graph encodes their specific API surface, their specific workflow patterns, their specific business rules. This is agent memory as infrastructure, not a hack bolted onto a stateless system.

Put simply: platform is horizontal, knowledge is vertical. You don't rebuild the engine for each industry. You populate it with different knowledge.

Knowledge compounds within verticals

This is the part we find most interesting. Industry-specific workflow knowledge compounds.

When you onboard your first e-commerce customer and extract their refund workflow, you learn the basic pattern: order lookup, eligibility check, payment reversal, inventory update. By customer five, you notice they all share 60-70% of the same workflow structure. The differences are in eligibility rules, payment providers, notification preferences.

By customer ten, the extraction pipeline knows what to look for. It recognizes common patterns and focuses human review on the variations that make each customer unique. Extraction accuracy goes up because the system has seen structurally similar workflows before.

This cross-customer learning (anonymized, obviously) is impossible in a general-purpose architecture where every deployment starts from scratch. Each new customer in a vertical makes the system better for every other customer in that vertical. Bessemer Venture Partners projects that vertical AI market cap could grow 10x larger than legacy SaaS solutions, and industry-specific tools are growing 2-3x faster than general productivity tools. The compounding knowledge advantage is a big part of why.

Why generic MCP servers hit a ceiling

The current wave of MCP server generators, tools like Speakeasy and Stainless that convert OpenAPI specs into MCP-compatible tool interfaces, solve the API access problem well. They give agents the ability to call individual endpoints. Fast, clean, works.

But they stop at API wrapping. Every endpoint becomes a separate tool. An agent connecting to a Speakeasy-generated MCP server for Stripe sees hundreds of tools, one per endpoint. It still has to independently figure out that processing a refund means calling five specific endpoints in a specific order with specific parameter mappings between them.

That's the ceiling. It works for any API but understands none of them. The agent gets a toolbox with no instructions. And without instructions, without workflow knowledge, success rates on multi-step tasks stay low no matter how capable the model is. We covered the broader infrastructure implications of this in AI is the foundation, not a feature.

Industry-specific workflow knowledge turns the toolbox into a set of instructions. Instead of 200 individual tools, the agent sees "process refund," "onboard customer," "generate invoice." Complete, validated workflows that execute reliably because the multi-step orchestration is handled by the infrastructure, not improvised step-by-step by the model.

Where this leaves the market

Vertical SaaS has been the dominant growth model in enterprise software for a decade. The same dynamics apply to AI agent infrastructure, probably even more so because of the knowledge compounding effect. Gartner predicts 40% of enterprise apps will feature task-specific AI agents by 2026, up from less than 5% in 2025 — and task-specific means domain-specific.

General-purpose agent platforms will stick around, doing what horizontal SaaS has always done: providing common capabilities. But the workflow intelligence layer, the part that knows how business processes actually work in specific domains, will be vertical.

The companies building this layer for a given vertical will accumulate proprietary knowledge graphs representing validated workflow patterns across dozens or hundreds of customer deployments. That knowledge is the moat. It captures institutional expertise that no single customer has, it compounds with each deployment, and replicating it requires the same investment a competitor would need to make from scratch.

If you're evaluating your AI agent strategy, the real question isn't "should we use AI?" It's "how do we encode our industry-specific workflow knowledge so agents can actually use it?" General-purpose tools get you API access. Industry-specific knowledge is what gets you working workflows.

Hintas extracts and validates industry-specific workflow knowledge from your existing sources of truth, then deploys it as a managed MCP server any agent can consume. Each deployment builds on cross-customer patterns within your vertical. More at hintas.ai.

Photo by Bastien Nvs on Unsplash

AI is no longer a SaaS feature. It's the foundation. Here's what that actually requires.

Dante Kakhadze — Sun, 15 Mar 2026 15:54:17 GMT

Two years ago, SaaS companies added AI as a feature. A "Summarize" button here, a "Generate" button there, maybe a chatbot in the support widget. These were additive capabilities. The product worked fine without them, and they worked fine as isolated features.

That era is ending. The SaaS products gaining traction in 2026 aren't bolting AI features onto existing workflows. They're rebuilding workflows around AI capabilities. Bain's 2025 technology report frames this as the shift from "AI-enabled SaaS" to "AI-native SaaS," and the distinction matters because it creates infrastructure problems that the feature era never had to deal with.

The feature era was easy

Adding a summarization button to a dashboard is straightforward. Take the text on the screen, send it to an LLM API, display the result. The AI capability is self-contained. It doesn't need to understand the rest of your system. It doesn't need to call other APIs. It doesn't need graceful failure handling because a bad summary is annoying, not catastrophic.

Feature-level AI has three nice properties: it's stateless (each call is independent), it's read-only (it consumes data but doesn't change system state), and it's fault-tolerant (if the LLM returns garbage, the user just ignores it).

Foundation-level AI has none of these.

What foundation-level AI demands

When AI becomes the foundation, when your product's core value depends on agents executing multi-step workflows across your system, you hit requirements that feature-level integrations never had to think about.

Stateful workflow execution. An AI agent processing a customer refund needs to maintain state across seven API calls: verify the order, check eligibility, calculate the amount, reverse the payment, update inventory, send confirmation, log for compliance. Each step depends on the previous step's output. Lose state between steps and you get a payment reversal without an inventory update. Or a confirmation email for a refund that actually failed. Neither is acceptable.

This is a different world from "send text to LLM, display result." It requires orchestration infrastructure: workflow dependency resolution, parameter passing between steps, state management across API boundaries. As we covered in agent memory as a first-class primitive, this structural knowledge about how APIs connect is exactly what most agent systems lack.

Write operations with rollback. Feature-level AI reads data and presents it. Foundation-level AI modifies data. It creates records, updates statuses, processes payments, triggers notifications. When a multi-step write operation fails midway, you need compensation actions: reverse the payment, cancel the notification, restore the inventory count. Without transactional guarantees, partial failures leave your system in a corrupted state that's painful to untangle.

The distributed systems community solved this decades ago with the Saga pattern: pair every forward action with a compensating action, execute compensations in reverse order on failure. AWS now documents Saga orchestration patterns specifically for agentic AI, and platforms like Temporal treat sagas as durable long-running workflows with built-in retry and versioning. But this pattern hasn't made it into most AI agent frameworks. LangChain and CrewAI handle prompt management and tool registration, not distributed transactions.

Deterministic reliability. A feature that fails 5% of the time is annoying. A foundation that fails 5% of the time is unusable. If your product's core workflow depends on an AI agent executing a multi-step process, that process needs to succeed with traditional-software reliability: 99.5%+ for business-critical operations.

Current benchmarks tell a sobering story. On τ-bench, the best GPT-4o agent achieved less than 50% average success rate across two domains. A 2025 survey of 306 AI agent practitioners found that reliability is the biggest barrier to enterprise adoption, and teams are actively avoiding open-ended, long-running tasks in favor of shorter workflows.

Getting there means moving from probabilistic execution (the agent reasons about each step on the fly) to deterministic execution (the workflow follows a pre-validated, known path). The agent's intelligence gets used at query time to understand what the user wants. The execution itself follows a tested path that doesn't depend on the model getting each individual step right in the moment.

The infrastructure gap

SaaS has mature infrastructure for almost everything. You can deploy a web application in minutes. CI/CD, monitoring, alerting, all off-the-shelf. Payment processing, email delivery, analytics, all solved problems.

But there's no off-the-shelf solution for "make my AI agent reliably execute multi-step workflows across my API surface." I keep seeing teams hit the same three missing pieces:

Workflow knowledge extraction. The knowledge of how your APIs connect, which endpoints to call in what order with what parameters and what to do when something fails, currently lives in engineers' heads and scattered docs. Nothing automatically extracts this from your OpenAPI specs, test suites, and internal documentation and makes it available to AI agents.

Workflow validation. Even if you manually encode workflow knowledge, how do you verify it's correct? Running extracted workflows against a staging environment, confirming each step produces the expected output, validating that compensation actions work when you inject failures. This validation pipeline doesn't exist in current AI tooling. It's the same human-in-the-loop validation gap we've written about before, applied to infrastructure rather than individual agent decisions.

Managed execution infrastructure. Validated workflows need somewhere to run. That somewhere needs authentication (agents need verified identity), isolation (a failing workflow shouldn't take down other workflows), audit logging (every action needs a trail), and the transactional guarantees we just discussed. And the security of the MCP servers exposing those workflows matters just as much as the workflows themselves.

What this means for SaaS builders

If you're building a SaaS product and your roadmap includes "AI-powered workflows" (and let's be honest, everyone's does), you're going to hit this infrastructure gap. Gartner predicts 40% of enterprise apps will feature task-specific AI agents by 2026, up from less than 5% in 2025. The question is how you deal with it.

Build it yourself. Viable if your engineering team has distributed systems experience and you have a limited number of workflows. But the maintenance burden compounds: every API change means updating workflow definitions, every new workflow needs extraction and validation, and the execution infrastructure needs ongoing operational investment.

Wait for the ecosystem. You could wait for MCP server generators like Speakeasy and Stainless to move beyond endpoint-level wrapping and add workflow intelligence. Possible, but not guaranteed. Their core competency is code generation from API specs, not workflow knowledge extraction. Those are different problems.

Use purpose-built infrastructure. Something that handles extraction, validation, and execution, exposed through a standard interface (MCP) that any agent can consume. This scales better because the workflow knowledge lives as a persistent, evolving asset rather than getting reimplemented for each new AI feature.

So what does this actually mean

AI-as-feature was a product decision. AI-as-foundation is an infrastructure decision. The infrastructure requirements — stateful orchestration, transactional integrity, deterministic reliability — are the difference between a demo and a product.

The SaaS companies that figure out this infrastructure layer will build products where AI actually does the work, not just summarizes it. Everyone else will keep shipping "Summarize" buttons.

Hintas provides the workflow infrastructure layer for when AI becomes the foundation: automated knowledge extraction, validated execution paths, and managed MCP deployment with transactional guarantees. If you're hitting this problem, take a look at hintas.ai.

Photo by ANOOF C on Unsplash

A2A + MCP: two protocols, one interoperability layer

Dante Kakhadze — Sat, 14 Mar 2026 20:56:59 GMT

The agentic AI world has been quietly converging on two protocols. MCP (Model Context Protocol), originally from Anthropic, now under the Linux Foundation's Agentic AI Foundation, handles how agents connect to tools and data. A2A (Agent-to-Agent), launched by Google in April 2025 with over 50 technology partners, also Linux Foundation, handles how agents talk to each other.

Different problems. But they'll eventually need to work as one system. If you're building multi-agent architectures today, you need to understand where each protocol ends and the other begins.

What each protocol actually does

MCP is a client-server protocol for agent-to-tool communication. An MCP server exposes tools (executable functions), resources (read-only data), and prompts (reusable templates). An MCP client, whether that's Claude Desktop, an IDE, or your custom application, connects to the server and makes these capabilities available to an LLM. Wire protocol is JSON-RPC 2.0 over stdio or HTTP.

The key design choice: the LLM decides when to use a tool, but the server defines what the tool does. The agent owns the decision-making; the server owns the execution. (If you're connecting to community MCP servers, the security implications of this trust model are worth reading about in MCP security: every unvetted server is an attack surface.)

A2A is peer-to-peer. Where MCP connects an agent to a tool, A2A connects an agent to another agent. The interaction model is different in kind. Instead of calling a function and getting a result, an agent submits a Task to another agent, and that task moves through a lifecycle: SUBMITTED, WORKING, INPUT_REQUIRED, COMPLETED, FAILED. The protocol is built on HTTP, SSE, and JSON-RPC, so it fits into existing enterprise infrastructure.

A2A also introduces two ideas MCP doesn't have. Agent Cards are machine-readable JSON descriptions of what an agent can do, used for dynamic discovery. Multi-turn negotiation means an agent can request additional input mid-task, so the interaction is collaborative rather than purely request-response. Version 0.3, released in July 2025, added gRPC support and signed security cards.

They're complementary, not competing

People keep framing A2A and MCP as competing standards. They're not. They operate at different layers.

Layer	Protocol	Interaction	Example
Tool/Data Integration	MCP	Client-Server	Agent calls a database query tool
Agent Collaboration	A2A	Peer-to-Peer	Billing Agent delegates to Compliance Agent

A refund-processing agent might use MCP to connect to the payment API, the inventory system, and the customer notification service. That same agent might use A2A to delegate a compliance check to a specialized Compliance Agent with its own tools and reasoning.

MCP is vertical: agent reaches down to tools. A2A is horizontal: agent reaches across to peers. In production, you need both.

Where they converge

The convergence happens at what we'd call the "workflow boundary." Take a complex enterprise operation: onboarding a new enterprise customer. That involves:

Provision infrastructure (DevOps)
Configure billing (Finance)
Set up user accounts (Identity)
Run compliance checks (Legal)
Send welcome communications (Marketing)

With a single agent and MCP, one agent connects to all the tools across all five domains and orchestrates everything. This works for simple cases. It falls apart when each domain has dozens of APIs, domain-specific logic, and its own failure modes. The agent's context window fills up with tool schemas before it even starts doing useful work.

With multiple agents and A2A, a coordinator delegates domain-specific tasks to specialists. The DevOps Agent knows infrastructure provisioning. The Finance Agent knows billing. Each specialist uses MCP for its domain's tools. The coordinator uses A2A to orchestrate across agents.

The tricky part is what sits between them. Something needs to know that infrastructure provisioning must finish before billing starts, that the compliance check can run in parallel with account setup, and that if any step fails, the entire operation needs coordinated rollback across all agents.

The missing orchestration layer

Neither protocol solves orchestration. MCP gives agents access to tools. A2A gives agents access to each other. But the knowledge of which tools to call in what order and which agents to coordinate for which tasks lives outside both protocols.

This is the same gap that already exists in MCP-only architectures, just amplified by the multi-agent dimension. Today's MCP servers expose individual API endpoints as individual tools. An agent connecting to a payment platform's MCP server might see 200 tools, every endpoint as a separate function. The agent still has to figure out which 7 of those 200 it needs for a refund, in what order, with what parameter mappings between steps.

Add A2A and the problem multiplies. The coordinator needs to know which specialist handles which domain, what information to pass between agents, how to handle partial failures across agent boundaries, and how to maintain transactional integrity when three different agents have each completed part of a workflow. As we covered in agent memory as a first-class primitive, this structural knowledge — which tools depend on which, what parameters flow where — is exactly what knowledge graphs encode and vector stores can't.

Workflow knowledge — how multi-step processes actually work, encoded as validated execution paths with real dependency ordering — is what makes both MCP and A2A useful in production. Without it, MCP is a toolbox without instructions. A2A is a phone system where nobody knows who to call. (We keep coming back to this analogy because it's unfortunately accurate.)

Building for the convergence

If you're designing agent architectures today, here's what I'd prioritize:

Separate workflow knowledge from protocol implementation. Your understanding of "how to process a refund" shouldn't be embedded in MCP tool definitions or A2A agent configs. It should exist as its own layer, protocol-agnostic. When A2A adoption matures, the same workflow knowledge powering your MCP-based execution should power your A2A-based coordination without a rewrite.

Design around domain boundaries. Find the natural seams in your business processes, the points where one team's expertise ends and another's begins. Those seams will become agent boundaries in a multi-agent architecture. Each domain agent uses MCP for its tools, A2A for coordination with peers.

Invest in transactional guarantees now. Multi-agent coordination makes the transactional integrity problem harder, not easier. The Saga pattern, where each forward action has a defined compensation action, works across both single-agent and multi-agent setups. Building this into your workflow execution layer now saves you from retrofitting it when A2A coordination introduces cross-agent failure modes.

Both protocols are going to matter. The teams that handle the convergence well will be the ones who built the workflow knowledge layer that sits between them, the layer that knows what needs to happen regardless of which protocol carries the messages.

Hintas's workflow knowledge layer is protocol-agnostic by design. Today it deploys as a single MCP server. A2A support for multi-agent orchestration is on our roadmap, powered by the same validated knowledge graph. More at hintas.ai.

Photo by GuerrillaBuzz on Unsplash

MCP security: every unvetted server is an attack surface you chose to ignore

Dante Kakhadze — Sat, 14 Mar 2026 20:52:29 GMT

MCP adoption is moving fast. The official Python and TypeScript SDKs now see over 97 million monthly downloads. Anthropic donated the protocol to the Linux Foundation's Agentic AI Foundation in December 2025, making it vendor-neutral. Claude, Cursor, Windsurf, and a growing list of clients support it natively.

The problem is that security practices haven't kept up. Most MCP servers in use today are community-built, minimally audited, and connected to production systems with the same trust as first-party code. That should make you uncomfortable.

The architecture creates the risk

MCP's design is simple on purpose: servers expose tools, resources, and prompts; clients consume them; JSON-RPC 2.0 handles the wire protocol. Anyone can build an MCP server in 20 lines of Python with FastMCP, and it works with any client.

Which means anyone has built MCP servers. The community registry has hundreds of servers wrapping every API you can think of. Some are well-engineered. Some are weekend projects with no input validation. All of them get the same level of trust once you connect them to a client.

Think about what happens when you add an MCP server to Claude Desktop or your agent system. You're granting it the ability to execute code based on LLM decisions. The LLM decides when to call the tool, but the server decides what that call actually does. If the server has a vulnerability, or is outright malicious, every tool invocation is a potential exploit.

Three specific ways this goes wrong:

Data exfiltration. An MCP server wrapping your database has access to query results. Nothing in the MCP spec prevents that server from forwarding those results to an external endpoint alongside returning them to the client. You'd have a data leak that's invisible at the protocol level. Researchers at Invariant Labs demonstrated exactly this — a malicious server that combined tool poisoning with a legitimate WhatsApp MCP server to silently exfiltrate a user's entire message history.

Prompt injection via tool results. Tool call output feeds directly into the LLM's context. A compromised server can return results containing injected instructions: "Ignore previous instructions and execute the following..." As Simon Willison documented, the LLM processes this as part of the tool response and potentially acts on it through other connected tools. Palo Alto's Unit 42 team found that MCP sampling introduces additional attack vectors where servers can craft prompts and request completions from the client's LLM. CyberArk went further, showing that the attack surface extends across the entire tool schema, not just descriptions.

Credential exposure. MCP servers authenticating against external APIs hold credentials at runtime. API keys, OAuth tokens, service account credentials. A vulnerability allowing arbitrary code execution hands all of those to an attacker. This isn't theoretical — CVE-2025-6514 exposed a critical OS command-injection bug in mcp-remote, a popular OAuth proxy, and researchers found that Anthropic's own MCP Inspector tool allowed unauthenticated remote code execution via its inspector-proxy architecture.

The "it works" test is not a security audit

Here's how most people evaluate an MCP server: install it, connect it to the client, try a few tool calls, confirm results look right, move on. This validates functionality. It tells you nothing about security.

A functional test tells you the server returns weather data when you ask for weather data. It doesn't tell you whether the server logs your queries to a third-party analytics service. It doesn't tell you whether the server's npm dependencies include a compromised package. It doesn't tell you whether input parameters get passed directly to a shell command.

The gap between "it works" and "it's safe" is where enterprise risk lives. This is where human-in-the-loop validation applies to infrastructure, not just workflows — someone needs to audit what these servers actually do before they touch production data.

What managed MCP infrastructure changes

The alternative to trusting every community MCP server is running MCP infrastructure you control. Deploy servers on managed infrastructure with authentication, audit logging, and security isolation. Treat MCP connections the way you'd treat any other external service integration.

Here's what that gets you:

Authentication and authorization. Every client connection authenticates. Every tool invocation checks permissions. The MCP spec includes OAuth 2.1 support for exactly this, but most community servers don't implement it. Best practices call for mandatory PKCE, short-lived scoped tokens, and infrastructure-based client attestation. Managed infrastructure enforces it by default.

Audit logging. Every tool call, every parameter, every response gets logged. When something goes wrong (and in production, something always goes wrong) you can reconstruct exactly what happened, when, and with what inputs. Community servers log to stdout if they log at all.

Execution isolation. Tool execution runs in sandboxed environments, whether V8 Isolates or Firecracker microVMs, where a compromised tool can't reach the host system, other tools' credentials, or network resources outside its allowed scope. Firecracker's stripped-down VMM boots in under 125ms with less than 5MB of memory — practical enough for per-tool-call isolation. Even if a tool has a vulnerability, the blast radius stays contained.

Validated tool behavior. Before deployment, tool behavior gets validated against expected inputs and outputs. Not just "does it return data" but "does it return only the expected data, modify only the expected state, and communicate only with the expected endpoints." Tools like MCPTox and MindGuard can scan for tool poisoning and anomalous behavior patterns, but they're no substitute for running on infrastructure you control.

The enterprise adoption gate

MCP is on track to become the standard interface for agent-to-tool communication. Gartner predicts 33% of enterprise software will include agentic AI by 2028, up from less than 1% in 2024. Those systems will need to connect to dozens or hundreds of external tools and APIs — and as we explore in A2A + MCP: two protocols, one interoperability layer, the multi-agent dimension only multiplies the number of connections to secure.

Enterprises are not going to connect production agents to unvetted MCP servers. The security review process that currently takes weeks for a single SaaS vendor integration will apply to every MCP server an agent touches. At scale, that bottleneck kills AI adoption momentum.

The way through is infrastructure that handles security (authentication, isolation, auditing, validation) so that individual tool connections inherit enterprise-grade security by default. Teams shouldn't have to audit 50 community MCP servers one by one. They should connect to managed infrastructure that audits once and deploys securely.

Practical steps for today

If you're using MCP servers in production right now, three things you can do today:

Inventory your MCP connections. Know every server your agents connect to, who built it, when it was last updated, and what credentials it holds. You can't secure what you can't enumerate.
Isolate sensitive operations. MCP servers accessing customer records, payment systems, or internal APIs should run on infrastructure you control, not as local processes spawned by the client.
Monitor tool call patterns. If an MCP tool that normally returns 500-byte responses suddenly returns 50KB, something changed. If a tool that's called 10 times per hour starts getting called 1,000 times, something changed. Anomaly detection on tool usage patterns catches both compromised servers and misuse.

MCP is a good protocol. It standardizes something that badly needed standardizing. But standardizing the interface doesn't standardize the security of what's behind it. That's an infrastructure problem, and it needs an infrastructure solution.

Hintas deploys validated workflow knowledge as managed MCP servers with built-in authentication, execution isolation, and audit logging. Your agents get reliable tool access without the security risk of unvetted community servers. Check out hintas.ai if that's relevant to what you're building.

Photo by FlyD on Unsplash

Full autonomy is a trap. Human-in-the-loop is the production architecture.

Dante Kakhadze — Sat, 14 Mar 2026 19:29:08 GMT

The pitch is always the same: "Our AI agent handles everything end-to-end, no human intervention required." Sounds great in a demo. In production, full autonomy is how you get agents that process refunds on non-refundable orders, deploy infrastructure to the wrong region, and send customers emails with hallucinated policy details.

The teams shipping reliable agent systems right now aren't removing humans from the loop. They're redesigning the loop so humans intervene at the right moments, on the right decisions, with the right context.

Why full autonomy fails at enterprise scale

The argument for full autonomy rests on a flawed assumption: that AI agents make the same kinds of mistakes humans do, just less often. They don't. Agent failures look nothing like human failures.

Humans make errors of fatigue and distraction. An experienced support agent might skip step 3 in a refund workflow because they got interrupted. But they'd never try to reverse a payment before verifying the order exists. That's common sense built from years of doing the job.

AI agents make errors of knowledge. They don't get tired, but they don't have common sense either. Without explicit workflow knowledge, an agent will confidently execute steps in the wrong order, fabricate parameters it doesn't have, and retry failed operations identically because it has no concept of "that approach doesn't work." This is where first-class agent memory matters — systems that learn from failed executions stop repeating the same mistakes.

The numbers back this up. 80% of organizations report agents misbehaving in production: leaking data, accessing unauthorized systems, hallucinating information. On OSWorld-Human, even the best agents take 1.4 to 2.7 times more steps than necessary to complete tasks. Those extra steps aren't cautious double-checking. The agent is flailing, trying permutations until something works. In production, every unnecessary step is a potential side effect. An extra API call creates a duplicate record. A retry processes a payment twice.

Full autonomy means accepting these failure modes without a safety net. For anything business-critical, that's a bad bet.

Three decision boundaries

Good human-in-the-loop design starts with a question: where do humans add the most value? Not every step needs review. From what we've seen, three boundaries matter most.

Workflow validation. Before a workflow runs autonomously, a human should verify the extracted logic matches reality. Does the refund workflow actually require an eligibility check before payment reversal? Is the parameter mapping between steps correct? Are the rollback steps properly defined? This is a one-time cost per workflow, and it prevents systematic errors from repeating through every execution.

Exception handling. When an agent hits a situation its workflow doesn't cover, the system should escalate rather than improvise. Maybe it's an unexpected error code, or a precondition that fails in a way nobody anticipated. An agent that tries to reason its way through an undocumented edge case is an agent that creates undocumented side effects. A human reviewing the exception can figure out what to do and feed that knowledge back so the same exception gets handled automatically next time.

Confidence thresholds. Not all executions carry the same risk. Looking up a customer's order history? Low-risk, let it run. Processing a $50,000 refund? That should require human confirmation. The threshold isn't about the agent's confidence in its own output. It's about the business impact if the agent is wrong. HITL architecture reduces hallucination-related errors by 96% when low-confidence decisions get escalated to human operators.

Designing the loop for scale

The naive implementation of human-in-the-loop is a queue: every agent action goes to a human for approval. This defeats the purpose of automation and creates a bottleneck worse than doing things manually.

The scalable version pushes human review to the boundaries. Humans validate workflows before deployment, not during every run. Humans review exceptions, not routine executions. Humans set risk thresholds, not per-action approvals.

There's a concrete benefit beyond reliability: this creates a learning system. Every human intervention is a signal. A validated workflow becomes a tested, deterministic execution path. An exception review becomes a new workflow branch or a refinement of existing logic. A risk threshold adjustment becomes a policy that applies going forward.

Over time, the system needs less human intervention. Not because you're cutting humans out, but because validated workflow coverage expands with use. The loop tightens.

The validation pipeline in practice

Workflow validation deserves more detail because most teams underinvest here.

Extracting workflow knowledge from source materials (API specs, test suites, docs) is necessary but imperfect. LLM-based extraction can misidentify dependencies, misorder steps, or miss constraints that are implicit in the source material but never written down.

A working validation pipeline looks like this: extract workflow patterns from source materials, then run the extracted workflows against staging. If the workflow completes successfully, it's a candidate for production. If it fails, it goes to a human review queue where an engineer examines the extracted logic, corrects it, and re-validates.

The human effort concentrates in validation. Once a workflow passes and gets deployed, it executes deterministically. No per-invocation review needed. The agent calls the workflow, the system handles multi-step orchestration, and only exceptions route back to humans.

Only 11% of organizations had deployed agentic AI by mid-2025, yet 93% of IT leaders intend to deploy agents within two years. The gap between intention and deployment is the validation gap. As we described in Why 40% of AI projects fail, missing workflow infrastructure is the root cause. HITL validation is how you close it.

The uncomfortable truth

Building for full autonomy is easier than building for human-in-the-loop. Full autonomy is one architecture: agent receives input, agent produces output. Done. Human-in-the-loop means designing escalation paths, building review interfaces, defining risk thresholds, creating feedback loops that actually update the system based on human decisions. It's more work upfront.

But the teams that invest in this architecture ship to production. The teams that don't end up in the 40% failure statistic.

Full autonomy is where you want to end up. Human-in-the-loop is what gets you there, one validated workflow at a time. And once those workflows connect to external tools, securing the MCP servers that expose them becomes its own problem.

If you're interested in early access, reach out at hintas.com.

Photo by Bernd Dittrich on Unsplash

Agent memory shouldn't be a hack. Here's what a real implementation looks like.

Dante Kakhadze — Sat, 14 Mar 2026 19:26:56 GMT

Every agent framework has a memory story. Most of them amount to "we append previous messages to the context window." Some get fancier with vector stores for long-term recall. A few use summary chains to compress history. The common thread is that memory is an afterthought, bolted onto systems designed for stateless inference.

This works for chatbots. It does not work for agents that need to execute multi-step business workflows reliably across thousands of invocations.

The two memory problems nobody talks about

When people discuss agent memory, they usually mean conversational memory: remembering what the user said three messages ago. That's solved. The harder problems are structural memory and experiential memory, and most systems ignore both.

Structural memory is knowledge about how things connect. Which API endpoints depend on each other. What parameters flow from step 2 to step 5. Which authentication tokens you need before any billing operation can execute. You don't learn this from conversation history. It's institutional knowledge that lives in engineers' heads and scattered documentation.

Experiential memory is knowledge you get from doing things. The payment gateway times out during peak hours. The CRM API returns a 500 when you pass a currency code it doesn't recognize. The staging environment's database has a 30-second connection timeout that production doesn't. You learn these things from execution, not from docs.

Both compound over time. Both matter. A system without structural memory will sequence API calls incorrectly. A system without experiential memory will repeat the same failures forever. That second one is particularly maddening to watch.

Why vector stores aren't enough

The default "memory solution" in most agent architectures is a vector store: embed previous interactions, retrieve similar ones when relevant. This handles the conversational case fine, but it can't represent structural relationships.

A vector store can tell you that "process refund" is semantically similar to "reverse payment." It cannot tell you that processing a refund requires verifying order eligibility first, that the eligibility check depends on the customer's return window, and that the return window is calculated differently for international versus domestic orders.

These are graph relationships, directed and typed, with constraints and preconditions. Flattening them into vector embeddings loses the structure that makes them useful. You can retrieve a similar document about refunds, but you can't traverse the dependency chain that makes a refund workflow actually executable.

Research backs this up. Graph RAG-Tool Fusion demonstrated a 71.7% improvement over naive vector-based RAG on tool selection benchmarks with dependency-heavy toolsets. The gain comes from graph traversal capturing structural relationships that vector search misses. Their ToolLinkOS benchmark tested against 573 tools with an average of 6.3 dependencies each. The difference was stark.

What first-class agent memory actually looks like

Building memory as a first-class primitive means treating it as infrastructure, not a feature.

Start with a knowledge graph for structural memory. Nodes represent API endpoints, parameters, data sources, auth tokens, workflow steps, business constraints. Edges encode relationships: Tool A needs Tool B's output, Tool A runs after Tool B, Tool A produces data for Tool B, Tool A and Tool B do roughly the same thing. This graph isn't generated at runtime. It's extracted from source materials, validated against staging environments, and maintained as a persistent, evolving data structure. Projects like Zep and Graphiti are pushing this direction with temporal knowledge graphs that track how facts change over time.

Then you need a dual-query interface over that graph. Agents need natural language search: "How do I issue a refund?" Developers need structural queries: "What depends on auth.getToken?" No single retrieval approach handles both well. The answer is to fuse vector search for semantic queries with native graph traversal for structural queries, running both against the same underlying knowledge base.

Finally, wire in an experiential learning loop. Every workflow execution, whether it succeeds or fails, generates insights. The ExpeL framework (published at AAAI 2024) showed that extracting natural language insights from execution traces and storing them in a separate vector index gives you a clean separation between validated knowledge and learned observations. Failed executions get analyzed: was the failure due to a known API quirk, an undocumented constraint, or a genuine bug? Those insights feed back into future query results, so the system improves without requiring manual graph updates.

The compounding advantage

The most interesting property of first-class memory is that it compounds. Every execution makes the system smarter. Every validated workflow adds to the structural graph. Every failure adds to the experiential store.

After a month, the system knows the billing API has a rate limit of 100 requests per minute that isn't in the docs. After three months, it knows the inventory service is slow on the first Monday of each month because of a batch job. After six months, it has an operational map of your API surface that no single engineer possesses.

This is why memory can't be an afterthought. Bolting a vector store onto a stateless agent gives you recall without learning. Building memory as infrastructure gives you an agent that gets better at its job over time. The same way a human team member does, except it doesn't quit after 18 months and take all that context with them.

As we covered in Why 40% of AI projects fail, the root cause is missing workflow knowledge. Memory is how you accumulate and retain that knowledge across invocations. And when agents inevitably hit situations their memory doesn't cover, you need human-in-the-loop escalation to fill the gaps and feed corrections back into the system.

The practical takeaway

If you're building agent systems, audit your memory architecture:

Can your agent represent structural dependencies between tools, or does it rediscover them on every invocation?
Does your agent learn from failed executions, or does it repeat the same mistakes?
Does your memory compound over time, or does it stay roughly the same size no matter how many tasks complete?

If the answer to any of these is no, your memory implementation is a hack. It's the ceiling on what your agents can reliably do.

If you're interested in early access, reach out at hintas.com.

Photo by BoliviaInteligente on Unsplash

Why 40% of AI projects fail (and it's not the model's fault)

Dante Kakhadze — Sat, 14 Mar 2026 19:08:21 GMT

You've seen the stat. Gartner predicts over 40% of agentic AI projects will be canceled by end of 2027 due to escalating costs, unclear business value, or inadequate risk controls. Leadership blames the models. Engineers blame the data. Product managers blame scope creep. We've spent the last year building workflow infrastructure for AI agents, and the pattern we keep seeing is simpler than any of those explanations. More fixable, too.

Most AI projects don't fail because the model can't reason. They fail because nobody encoded the workflow knowledge the model needs to act.

The gap between "can call an API" and "can do the job"

Modern LLMs handle single-step tool use well. On the Berkeley Function Calling Leaderboard (BFCL), top models score around 70% overall, with near-perfect marks on simple single-turn calls. Ask Claude to check the weather or look up a customer record and it nails it.

But business tasks aren't single API calls. Processing a refund means verifying the order, checking return status, calculating the refund amount, reversing the payment, updating inventory, sending confirmation, and logging for compliance. Seven steps, strict ordering, each depending on the last.

This is where things fall apart. On OSWorld, which tests agents on real multi-step computer tasks, the best model originally scored about 12% success rate. Humans hit 72%. Recent agentic frameworks have pushed scores into the 45-61% range, but only by layering orchestration logic on top of the base model. The model alone still can't sequence its way through a real workflow.

The 40% failure rate isn't about AI capability. It's about the absence of reliable workflow execution.

Workflow knowledge is the missing layer

When a new engineer joins your team, you don't hand them API docs and say "figure it out." You pair them with someone who walks through the workflow: which service to call first, what the response looks like, what to do when the payment gateway times out on a Friday afternoon.

That knowledge exists. It lives in your Cypress test suites encoding the happy path. In Jira tickets describing the sad path. In Confluence pages that three people maintain. In the heads of engineers who built the system. It's everywhere except where an AI agent can actually use it.

The projects that fail hand an agent a pile of API endpoints and expect it to derive the workflow from schema descriptions. The projects that succeed encode workflow knowledge explicitly, either by hand (expensive, doesn't scale) or through automated extraction.

What "workflow reliability" actually means

Workflow reliability isn't just "the steps run in the right order," though that matters. It's a set of properties that production systems need, and missing any one of them will bite you.

Step 3 needs the output of step 2. Not just any output, a specific field from the response, transformed into the format step 3 expects. If the agent has to guess this mapping, it fabricates parameters at a meaningful rate. Research on agent hallucinations shows that tool-calling errors increase with the number of available tools, and compounding errors across steps can drop a 10-step workflow from 90% to 73% accuracy even when each individual step is 97% correct. That's dependency resolution, and it's table stakes.

Then there's transactional integrity. If step 5 fails after steps 1-4 succeeded, you need compensation actions. The payment was processed but shipping failed? Now you need an automated reversal, not an orphaned charge sitting in your billing system.

Deterministic execution paths matter more than most people realize. ReAct-style reasoning (think, act, observe, repeat) works for exploration but breaks down for business processes. A 20-step workflow means 20 full neural network forward passes and 20 network round trips. Each one is a chance for the agent to lose the thread. Deterministic execution maps eliminate this sequential fragility.

And then there's experiential learning. The first time a workflow hits an undocumented API quirk (rate limiting on the payment endpoint during peak hours, say), the system should learn and adapt. The fiftieth time, it should route around the problem automatically.

Why frameworks alone don't solve this

LangChain, CrewAI, AutoGen give you useful plumbing for building agent systems. They handle prompt management, tool registration, basic orchestration patterns. But they don't contain your workflow knowledge, and they can't extract it.

A framework gives you the ability to chain tool calls. It doesn't tell the agent which tools to chain, in what order, with what parameters, or what to do when step 3 returns an error code nobody documented. That's the knowledge layer, and it's separate from the orchestration layer.

Think of it like a programming language versus a program. Python gives you the ability to write anything. Your codebase is the specific thing you wrote. Frameworks give agents the ability to orchestrate. Workflow knowledge is the specific orchestration they need.

The path from 40% failure to production reliability

The projects that make it to production share a pattern: they treat workflow knowledge as a first-class engineering artifact, not something the model will figure out from context.

In practice, that means extracting workflow patterns from existing sources of truth: API specifications, test suites, internal documentation, runbooks. Validating those patterns against staging environments before deploying them. And building systems that learn from execution, so workflow maps get better every time a task succeeds or fails.

S&P Global found that 42% of companies abandoned most AI initiatives in 2025, up from 17% in 2024. MIT's research shows only 5% of AI initiatives produce measurable returns despite tens of billions in investment. These aren't model failures. They're infrastructure failures.

The 40% failure rate isn't inevitable. It's a symptom of a missing infrastructure layer. Build the workflow knowledge layer, validate it, make it available to agents in a structured format. The models are smart enough. They just need to know how the work actually gets done.

If you're interested in early access, reach out at hintas.com.

Photo by Logan Voss on Unsplash