AI

When LLMs Hallucinate Architecture: Real Enterprise Integration Nightmares

Last month, a developer I mentor showed me an architecture diagram that Claude had generated for his team's new microservices migration. It was beautiful....

Last month, a developer I mentor showed me an architecture diagram that Claude had generated for his team's new microservices migration. It was beautiful. Clean boxes, well-labeled arrows, proper separation of concerns. It included an event mesh using "Azure Service Fabric Reliable Actors with built-in saga orchestration" for distributed transactions across seven services.

There's just one problem. Azure Service Fabric Reliable Actors don't have built-in saga orchestration. That's not a feature. It has never been a feature. The LLM invented it from whole cloth, and it sounded so plausible that an entire team spent two weeks building toward an architecture that fundamentally cannot work as described.

Work Smarter with Claude Code: Automate Tasks, Manage Projects, and Run Operations—No Coding Required

Work Smarter with Claude Code: Automate Tasks, Manage Projects, and Run Operations—No Coding Required

AI that sees your files and does the work. Organize chaos, automate tasks, escape spreadsheet hell. No coding required. Practical guide for knowledge workers.

Learn More

This is what I call an architectural hallucination, and it's becoming one of the most dangerous patterns in modern software development. Not because the AI is malicious — it's not — but because architectural hallucinations are uniquely expensive. A hallucinated code snippet wastes ten minutes. A hallucinated architecture wastes months.

I've been collecting these stories for the past year, from my own projects, from teams I advise, and from the broader engineering community. The patterns are disturbingly consistent.


The Anatomy of an Architectural Hallucination

Before I walk through the war stories, it helps to understand why LLMs hallucinate architecture differently than they hallucinate code.

When an LLM generates incorrect code, you usually find out fast. You run it. It either works or it throws an error. The feedback loop is tight — minutes or hours at most.

Architecture doesn't work that way. You don't "run" an architecture. You build toward it over weeks or months, and you don't discover the fundamental flaw until you're deep enough in that changing course is painful and expensive. By the time you realize the integration pattern the AI suggested doesn't actually exist, you've already built three services around that assumption.

The other problem is that architectural hallucinations are much harder to verify through search. If an LLM tells you Array.prototype.flatMap() accepts three arguments, you can check MDN in ten seconds. If it tells you that Apache Kafka supports native exactly-once delivery across multiple consumer groups without any additional coordination, you might spend hours reading documentation before you're confident enough to say "no, that's not quite right" — and even then, you might second-guess yourself because the explanation sounded so authoritative.

LLMs hallucinate architecture because they're pattern-matching across millions of blog posts, documentation pages, and Stack Overflow answers. Real architectural patterns get blended together. Feature sets from competing products get merged. Capabilities from one version of a tool get attributed to another. And the result reads like documentation from a parallel universe where everything works the way you wish it did.


Nightmare #1: The Non-Existent API

A team I consulted for was building a real-time analytics dashboard. They asked GPT-4 to design the data pipeline, and it recommended using "Elasticsearch's native Change Data Capture (CDC) streaming API" to push index updates to the frontend via WebSockets.

Elasticsearch does not have a native CDC streaming API. What it has is the Changes API — which was experimental, never graduated to stable, and was removed entirely in version 8. What the LLM described was a mashup of Elasticsearch's percolate queries, Debezium's CDC capabilities, and maybe a bit of MongoDB's change streams thrown in for flavor.

The team spent three weeks building the WebSocket layer and the frontend real-time update system before a senior engineer finally sat down to actually implement the Elasticsearch integration and discovered that the core assumption was fiction.

The fix required rearchitecting around a proper CDC pipeline using Debezium reading from the source database, not from Elasticsearch. The three weeks of WebSocket and frontend work was mostly salvageable, but the data flow was completely different from what they'd planned.

What went wrong: Nobody verified the core integration assumption before building around it. The AI's suggestion sounded like a real product feature because it was described with the right vocabulary and the right level of specificity.


Nightmare #2: Fantasy Middleware

This one happened to me personally. I was prototyping a system that needed to bridge messages between RabbitMQ and Apache Kafka. Different teams owned different services, and we needed bidirectional flow between the two messaging systems.

I asked an LLM for integration options, and it confidently described a "RabbitMQ-Kafka Bridge Connector" that was supposedly part of the Kafka Connect ecosystem. It provided configuration examples. It described the connector's properties, including rabbitmq.queue.name, rabbitmq.exchange, and kafka.topic.mapping. It even mentioned specific version compatibility.

None of it was real. There is no official RabbitMQ-Kafka Bridge Connector in the Kafka Connect ecosystem. There are community-developed options and various third-party connectors, but the specific connector the LLM described — with those exact configuration properties — doesn't exist. It was a plausible fabrication assembled from the naming conventions of real Kafka Connect connectors and RabbitMQ concepts.

I caught this one in about twenty minutes because I went to look for the connector's documentation before writing any code. But I can easily see how a less experienced engineer, or an experienced engineer in a hurry, would have started building around this assumption.

What I actually ended up using was a custom bridge service — about 200 lines of Node.js that consumed from RabbitMQ and produced to Kafka. Simple, boring, and real.

var amqp = require("amqplib");
var kafka = require("kafkajs");

var kafkaClient = new kafka.Kafka({
  clientId: "rmq-bridge",
  brokers: [process.env.KAFKA_BROKER]
});

var producer = kafkaClient.producer();

async function startBridge() {
  var connection = await amqp.connect(process.env.RABBITMQ_URL);
  var channel = await connection.createChannel();
  await producer.connect();

  await channel.assertQueue("events", { durable: true });

  channel.consume("events", function(msg) {
    if (msg !== null) {
      producer.send({
        topic: "bridged-events",
        messages: [{ value: msg.content.toString() }]
      }).then(function() {
        channel.ack(msg);
      }).catch(function(err) {
        console.error("Kafka produce failed:", err);
        channel.nack(msg);
      });
    }
  });

  console.log("Bridge running");
}

startBridge().catch(console.error);

Boring code that works beats impressive architecture that doesn't exist.


Nightmare #3: The Impossible Integration

A startup founder asked an LLM to design the payment architecture for their marketplace. The response described using "Stripe Connect's built-in escrow system" to hold funds until both buyer and seller confirmed delivery. It included detailed flow diagrams and even pseudo-code for the webhook handlers.

Stripe Connect does not have a built-in escrow system. Stripe has payment intents, transfers, and the ability to build escrow-like flows using a combination of these primitives. But "Stripe Connect Escrow" as a single feature you can just turn on? That's a hallucination.

The founder had built their entire pitch deck, financial model, and MVP timeline around this assumption. When they brought in a payment engineer to implement it, three months of planning had to be revised.

This is maybe the most insidious type of hallucination: when the LLM takes something that's technically possible with significant custom development and presents it as a turnkey feature. The gap between "you can build escrow on top of Stripe" and "Stripe has escrow" is the gap between two weeks of work and two months of work. That gap kills startups.


Nightmare #4: Version-Shifted Capabilities

I see this one constantly. The LLM describes a real feature that exists — but attributes it to the wrong version of the tool, or describes the API as it existed three years ago before breaking changes.

One team was told they could use "Redis Streams consumer groups with exactly-once processing semantics" for their event sourcing system. Redis Streams are real. Consumer groups are real. But exactly-once processing semantics? That's not how Redis Streams work. You get at-least-once delivery with consumer groups, and you need to handle idempotency yourself. The LLM was apparently conflating Redis Streams with Kafka's exactly-once semantics.

Another team was told to use pg_logical_slot_get_changes() for real-time replication in PostgreSQL 11. That function exists, but the specific behavior described matched PostgreSQL 15's logical replication improvements, not version 11. The team was running version 11 in production and couldn't upgrade due to a legacy extension dependency.

These version-shift hallucinations are particularly nasty because partial verification confirms them. You search for the feature name and find real documentation. You just don't notice that the documentation is for a different version than what you're running.


Common Hallucination Patterns

After collecting these stories for a year, I see clear patterns in how LLMs hallucinate architecture:

Feature merging. The LLM combines features from two or more competing products into one. "PostgreSQL's native full-text search with vector similarity ranking" blends PostgreSQL's full-text search with pgvector's capabilities as if they were a single unified feature.

Capability inflation. A tool that can do something with significant custom work gets described as having that capability built in. The Stripe escrow example above is a perfect case.

API invention. The LLM generates plausible-sounding API endpoints, configuration properties, or method signatures that don't exist. These follow the naming conventions of the real tool well enough to pass casual inspection.

Version conflation. Features from the latest version get attributed to older versions, or deprecated features get described as current.

Documentation from a parallel universe. The LLM generates what documentation would say if the feature existed. Complete with parameter descriptions, return values, and usage examples. This is the most convincing type because it reads exactly like real docs.


Why Experienced Engineers Catch These (And Juniors Don't)

This is the part that concerns me most about the current moment in software engineering.

When I read the "Azure Service Fabric saga orchestration" suggestion, something immediately felt off. Not because I've memorized Service Fabric's feature set — I haven't. But because I've worked with distributed transaction patterns for fifteen years, and I know that built-in saga orchestration in a general-purpose actor framework would be a major, widely-discussed feature. If it existed, I would have heard about it. The absence of that awareness was itself a signal.

That's what thirty years of experience buys you. Not perfect knowledge, but a well-calibrated sense of what's plausible. When someone tells me a tool can do something I've never heard of, I don't assume I missed it — I assume it might not exist and I go verify.

Junior engineers don't have that calibration yet. They don't know what they don't know. When an LLM describes a feature with confidence and specificity, they take it at face value because they have no baseline to compare against. They've never wrestled with building escrow on top of payment primitives, so they don't know enough to be suspicious when someone says it's a built-in feature.

This isn't a knock on junior engineers. They're learning. That's fine. The problem is that LLMs short-circuit the learning process by providing answers that feel authoritative before the engineer has developed enough experience to evaluate them. It's like giving a first-year medical student access to an AI that confidently recommends treatments — some will be right, some will be plausible-sounding nonsense, and the student doesn't yet have the clinical experience to tell the difference.


How to Validate AI Architectural Suggestions

Here's the verification framework I use now. It adds maybe thirty minutes to the beginning of a project, and it's saved me weeks of wasted effort.

Step 1: Verify the Core Integration Exists

Before building anything, verify that the specific integration, API, or feature the AI described actually exists. Not "does something like this exist" — does this specific thing exist.

Go to the official documentation. Not a blog post, not a tutorial, not Stack Overflow — the vendor's actual documentation. Search for the exact feature name the AI used. If you can't find it in official docs, it probably doesn't exist.

Step 2: Check the Version

If you find the feature in documentation, verify it exists in the version you're actually running. Check the changelog or release notes. Features get added and removed between versions more often than you'd think.

Step 3: Build a Spike

Before committing to an architecture, build a minimal spike that tests the core integration assumption. Not the whole system — just the part the AI described that you haven't used before. Can you actually configure that connector? Does that API endpoint actually return what the AI said it returns? Does that library actually support that pattern?

// Spike: verify that the integration works as described
// Time budget: 2 hours max. If it doesn't work by then,
// the AI's suggestion was wrong.

var client = require("the-library-ai-recommended");

async function verifyIntegration() {
  try {
    var result = await client.theMethodAIDescribed({
      theParameter: "the-value-ai-suggested"
    });
    console.log("It works:", result);
  } catch (err) {
    console.error("AI was wrong:", err.message);
    // This is where we save ourselves weeks of wasted work
  }
}

verifyIntegration();

Two hours on a spike versus two months building on a hallucination. The math is obvious.

Step 4: Cross-Reference with Experience

Ask yourself: have I ever heard of this feature? Has anyone I know used it? Is it mentioned in conference talks, mature blog posts, or production case studies? A genuinely useful feature of a popular tool leaves a footprint in the community. If you can't find that footprint, be suspicious.

Step 5: Ask the AI to Cite Sources

This is imperfect but useful. Ask the LLM: "Can you link me to the official documentation for this feature?" If it generates a URL that 404s, that's a strong signal the feature doesn't exist. If it hedges and says "I'm not sure of the exact URL," that's also a signal.


What This Means for the Industry

I'm not an AI doomer. I use LLMs every day. Claude Code is genuinely one of the most valuable tools I've added to my workflow in the past five years. But we need to be honest about the failure modes.

The architecture hallucination problem is going to get worse before it gets better, for two reasons:

First, AI tools are being marketed to exactly the people least equipped to catch hallucinations — junior developers and non-technical founders. "Just ask the AI to design your system" is becoming conventional wisdom in startup circles. And it works often enough to be dangerous, because when it fails, it fails expensively and silently.

Second, as LLMs get better at generating plausible text, the hallucinations get harder to detect. Early GPT models would sometimes suggest libraries that obviously didn't exist. Current models generate suggestions that read exactly like real documentation. The sophistication of the hallucination is increasing faster than most people's ability to detect them.

The answer isn't to stop using AI for architecture. The answer is to treat AI architectural suggestions like you'd treat advice from a very well-read but sometimes confused colleague. Valuable input that requires verification before you build on it.

The engineers who thrive in this new world won't be the ones who use AI the most or the ones who refuse to use it at all. They'll be the ones who know when to trust it and when to verify. And right now, for architecture, the answer is: always verify.


The Simple Rule That Would Have Prevented Every Nightmare in This Article

Every single story I shared above — every wasted week, every failed integration, every architecture built on fantasy — could have been prevented by one rule:

Never build on an AI-suggested integration you haven't personally verified with a working spike.

Not "verified by reading about it." Not "verified by asking the AI to confirm." Verified by writing code that actually calls the API, configures the connector, or uses the feature. If it works, build on it. If it doesn't, you just saved yourself from a nightmare.

It takes two hours. It's boring. It's unsexy. It doesn't fit into a tweet about the future of AI-driven development.

But it works. And that's the only thing that matters.


Shane Larson is a software engineer and technical writer based in Caswell Lakes, Alaska. He runs Grizzly Peak Software and has been building enterprise integrations since before REST was a thing. For more no-nonsense engineering content, visit the Grizzly Peak Software articles.

Powered by Contentful