Chain-of-Thought vs. Agent Loops: Speed Tests on Real SaaS Features
I've been building SaaS features two different ways lately, and the performance gap between them is wider than I expected.
I've been building SaaS features two different ways lately, and the performance gap between them is wider than I expected.
On one side: chain-of-thought prompting, where you craft a single detailed prompt that walks the model through reasoning step by step. On the other: agentic loops, where you give an AI agent a goal and let it iterate — writing code, running tests, fixing errors, and repeating until it gets there.
Both approaches work. But they work very differently depending on the task, and I've got the timing data to prove it.
Defining the Two Approaches
Before I get into numbers, let me be precise about what I mean, because these terms get thrown around loosely.
Chain-of-thought (CoT) prompting is a single API call (or a small, fixed number of calls) where you instruct the model to reason through a problem step by step before giving a final answer. You design the prompt, you control the reasoning structure, and you get one shot at the output. If the output is wrong, you redesign the prompt and try again manually.
var Anthropic = require("@anthropic-ai/sdk");
var client = new Anthropic();
function cotApproach(task) {
return client.messages.create({
model: "claude-sonnet-4-20250514",
max_tokens: 4000,
messages: [{
role: "user",
content: "I need you to build the following feature. " +
"Think through it step by step before writing code.\n\n" +
"Step 1: Identify the data model needed\n" +
"Step 2: Design the API endpoints\n" +
"Step 3: Write the implementation\n" +
"Step 4: Add error handling\n" +
"Step 5: Write the final, complete code\n\n" +
"Feature: " + task
}]
});
}
Agentic loops are iterative systems where the model takes an action, observes the result, and decides what to do next. Think Claude Code, Cursor's agent mode, or a custom loop you build yourself. The model might write code, execute it, see an error, fix the code, run it again, and keep going until the task is complete.
var Anthropic = require("@anthropic-ai/sdk");
var child_process = require("child_process");
var client = new Anthropic();
function agentLoop(task, maxIterations) {
var context = "Task: " + task;
var iteration = 0;
function iterate() {
if (iteration >= maxIterations) {
return Promise.resolve({ status: "max_iterations", context: context });
}
iteration++;
return client.messages.create({
model: "claude-sonnet-4-20250514",
max_tokens: 4000,
messages: [{
role: "user",
content: context + "\n\nDecide what to do next. " +
"If the task is complete, respond with DONE. " +
"Otherwise, provide the next action."
}]
}).then(function(response) {
var output = response.content[0].text;
if (output.indexOf("DONE") !== -1) {
return { status: "complete", context: context, iterations: iteration };
}
// Execute the action, capture result
context += "\n\nIteration " + iteration + ": " + output;
return iterate();
});
}
return iterate();
}
The fundamental difference: CoT gives you predictable cost and latency but caps the model's ability to recover from mistakes. Agent loops give you better outcomes on complex tasks but at unpredictable cost and time.
The Test Setup
I picked five real SaaS features — things I've actually built for my projects — and implemented each one using both approaches. I timed everything, tracked token usage, and evaluated the output quality.
The features tested:
- REST API endpoint — A CRUD endpoint for managing newsletter subscribers with validation
- Database migration — Add a new column with a default value and update existing records
- Search with filtering — Full-text search across articles with category and date filters
- Email notification system — Send templated emails on specific events with retry logic
- Admin dashboard widget — A statistics card showing key metrics with data aggregation
Environment:
- Model: Claude Sonnet 4 for both approaches
- Agent loop: Custom implementation with file write and Node.js execution capabilities
- CoT: Single prompt with structured reasoning instructions
- Each test run three times, median values reported
The Results
Here's the raw timing data. These numbers include the full cycle from prompt submission to having working, tested code.
Feature 1: REST API Endpoint
| Metric | Chain-of-Thought | Agent Loop | |--------|-----------------|------------| | Time to first output | 12 seconds | 8 seconds | | Time to working code | 12 seconds | 45 seconds | | Total tokens used | 3,200 | 11,400 | | Estimated cost | $0.03 | $0.10 | | Manual fixes needed | 1 | 0 |
Winner: CoT for speed, Agent Loop for correctness
The CoT approach produced a complete CRUD endpoint in one shot. It was 90% correct — I had to fix one validation edge case where it wasn't checking for duplicate email addresses. The agent loop took nearly four times longer but caught that edge case on its own because it actually ran the code and tested it.
For a feature this straightforward, I'd use CoT every time. Fixing one small bug manually takes 30 seconds. Waiting an extra 33 seconds for the agent to find it automatically isn't a good trade.
Feature 2: Database Migration
| Metric | Chain-of-Thought | Agent Loop | |--------|-----------------|------------| | Time to first output | 8 seconds | 6 seconds | | Time to working code | 8 seconds | 28 seconds | | Total tokens used | 1,800 | 7,200 | | Estimated cost | $0.02 | $0.06 | | Manual fixes needed | 0 | 0 |
Winner: CoT decisively
Database migrations are well-defined, atomic operations. The CoT prompt produced a perfect migration script on the first try. The agent loop worked too, but it wasted time creating a test database, running the migration, verifying the schema change, and rolling it back — all unnecessary for something this simple.
Feature 3: Search with Filtering
| Metric | Chain-of-Thought | Agent Loop | |--------|-----------------|------------| | Time to first output | 18 seconds | 10 seconds | | Time to working code | 18 seconds | 2 minutes 15 seconds | | Total tokens used | 4,100 | 22,300 | | Estimated cost | $0.04 | $0.19 | | Manual fixes needed | 3 | 0 |
Winner: Agent Loop
This is where it got interesting. The CoT approach produced a search implementation that looked right but had three subtle bugs: it didn't properly escape user input in the search query, it failed to handle the case where no filters were applied, and the date range filter used the wrong comparison operator.
The agent loop caught all three because it actually tested the code with various inputs. It wrote test cases, ran them, saw failures, and fixed them iteratively. Two minutes and fifteen seconds is a lot longer than eighteen seconds, but those three bugs would have cost me an hour of debugging if they'd made it to production.
For features with meaningful state interactions and edge cases, the agent loop earns its time premium.
Feature 4: Email Notification System
| Metric | Chain-of-Thought | Agent Loop | |--------|-----------------|------------| | Time to first output | 15 seconds | 9 seconds | | Time to working code | 15 seconds | 3 minutes 40 seconds | | Total tokens used | 3,800 | 28,600 | | Estimated cost | $0.03 | $0.24 | | Manual fixes needed | 2 | 1 |
Winner: Neither — both had issues
Email systems are inherently hard to test without actually sending emails. The CoT approach produced clean code but missed retry logic for transient failures and didn't handle the case where the SMTP connection times out. The agent loop built a more robust implementation with retry logic, but it still had one issue — it used a synchronous sleep in the retry backoff instead of a proper async delay.
This is a case where neither approach was sufficient on its own. The agent loop got closer, but the unpredictable nature of external service integration means you're going to need human review regardless.
Feature 5: Admin Dashboard Widget
| Metric | Chain-of-Thought | Agent Loop | |--------|-----------------|------------| | Time to first output | 14 seconds | 8 seconds | | Time to working code | 14 seconds | 1 minute 50 seconds | | Total tokens used | 3,500 | 15,800 | | Estimated cost | $0.03 | $0.13 | | Manual fixes needed | 1 | 0 |
Winner: Agent Loop for quality, CoT for speed
The CoT version produced a working widget but the SQL aggregation query was inefficient — it used multiple subqueries instead of a single query with conditional aggregation. The agent loop produced the optimized version because it ran the queries and noticed the performance difference.
The Pattern That Emerged
After running all five tests, a clear pattern showed up:
CoT wins when:
- The feature is well-defined with clear inputs and outputs
- There are few edge cases or state interactions
- Speed matters more than perfection
- The feature is something the model has seen thousands of times in training data (CRUD endpoints, migrations, simple utilities)
- You're an experienced developer who can spot and fix issues quickly
Agent loops win when:
- The feature has complex state interactions or subtle edge cases
- Correctness matters more than speed
- The feature involves integrating multiple systems
- You'd spend more time debugging CoT output than the agent loop costs in extra time
- You're exploring an unfamiliar domain where you might miss issues yourself
Here's my rule of thumb: if I can fully specify the feature in under 200 words, CoT is faster. If the spec requires caveats, conditions, and "but also handle the case where…" clauses, the agent loop pays for itself.
Cost Analysis: It's Not Just Token Prices
The token cost difference is real but misleading. Let's do the actual math for a realistic week of building SaaS features.
Say I build 15 features in a week (mix of small and medium complexity):
Pure CoT approach:
- 10 simple features × $0.03 = $0.30
- 5 complex features × $0.04 = $0.20
- Manual debugging time: ~3 hours (fixing CoT mistakes on complex features)
- Total API cost: $0.50
- Total time: ~4 hours (including debugging)
Pure Agent Loop approach:
- 10 simple features × $0.10 = $1.00
- 5 complex features × $0.20 = $1.00
- Manual debugging time: ~30 minutes
- Total API cost: $2.00
- Total time: ~3 hours (more API time, less debugging)
Hybrid approach (what I actually do):
- 10 simple features via CoT × $0.03 = $0.30
- 5 complex features via Agent Loop × $0.20 = $1.00
- Manual debugging time: ~45 minutes
- Total API cost: $1.30
- Total time: ~2.5 hours
The hybrid approach costs $0.80 more per week than pure CoT but saves roughly 1.5 hours of debugging time. If your time is worth more than $0.53 per hour — and I really hope it is — the hybrid approach wins.
How I Decide in Practice
When I sit down to build a feature, I run through a quick mental checklist:
Is the feature a standard pattern? (CRUD, auth, simple query)
→ Yes: CoT. One prompt, one output, move on.
Does it involve data transformations with edge cases?
→ Yes: Agent loop. Let it find the edge cases for me.
Am I integrating with an external API I haven't used before?
→ Agent loop. Let it read docs and iterate.
Is it a UI component with no backend logic?
→ CoT. UI components are well-represented in training data.
Does it need to handle concurrent access or race conditions?
→ Agent loop. These bugs are nearly invisible in static code review.
For my work on Grizzly Peak Software, building out the library feature with search, filtering, and pagination was firmly in agent loop territory. Multiple database queries, URL parameter handling, pagination edge cases, empty state handling — there were a dozen ways it could break silently. The agent found issues I wouldn't have caught until users hit them.
For the ad rotation system, where I needed a simple endpoint to serve a random ad and track impressions? Pure CoT. Write the prompt, get the code, ship it.
The Tooling Gap
One thing that struck me during these tests: the tooling for CoT prompting is mature and well-understood. You write a prompt, you get a response, you evaluate it. Straightforward.
The tooling for agent loops is still rough. I've used Claude Code extensively, and it's excellent, but building your own custom agent loops requires significant infrastructure:
var fs = require("fs");
var child_process = require("child_process");
var Anthropic = require("@anthropic-ai/sdk");
var client = new Anthropic();
function buildFeature(spec, workDir) {
var messages = [];
var tools = [
{
name: "write_file",
description: "Write content to a file",
input_schema: {
type: "object",
properties: {
path: { type: "string" },
content: { type: "string" }
},
required: ["path", "content"]
}
},
{
name: "run_command",
description: "Run a shell command and return output",
input_schema: {
type: "object",
properties: {
command: { type: "string" }
},
required: ["command"]
}
}
];
messages.push({
role: "user",
content: "Build this feature in " + workDir + ": " + spec +
"\nWrite the code, run tests, and fix any issues."
});
function processResponse(response) {
if (response.stop_reason === "end_turn") {
return { status: "complete", messages: messages };
}
var toolResults = [];
response.content.forEach(function(block) {
if (block.type === "tool_use") {
var result;
if (block.name === "write_file") {
fs.writeFileSync(block.input.path, block.input.content);
result = "File written successfully";
} else if (block.name === "run_command") {
try {
result = child_process.execSync(
block.input.command,
{ cwd: workDir, timeout: 30000 }
).toString();
} catch (err) {
result = "Error: " + err.stderr.toString();
}
}
toolResults.push({
type: "tool_result",
tool_use_id: block.id,
content: result
});
}
});
messages.push({ role: "assistant", content: response.content });
messages.push({ role: "user", content: toolResults });
return client.messages.create({
model: "claude-sonnet-4-20250514",
max_tokens: 4000,
messages: messages,
tools: tools
}).then(processResponse);
}
return client.messages.create({
model: "claude-sonnet-4-20250514",
max_tokens: 4000,
messages: messages,
tools: tools
}).then(processResponse);
}
That's a non-trivial amount of infrastructure code. You need file I/O, command execution, error handling, iteration limits, and safety guards. For most indie hackers, using an existing agent tool like Claude Code is more practical than building custom loops.
What I Got Wrong Initially
When I first started using agent loops heavily, I made the mistake of using them for everything. Every feature, every bug fix, every small change. The result was predictable: I spent more time waiting for the agent to iterate through simple problems than it would have taken me to just write the code directly.
The other mistake: using CoT prompts that were too vague. "Build a user authentication system" is not a good CoT prompt. "Build a session-based authentication system using Express.js with bcrypt password hashing, a PostgreSQL users table with email and hashed_password columns, login/logout/register routes, and middleware that checks for a valid session on protected routes" — that's a CoT prompt that actually produces usable output on the first try.
The quality of your CoT output is directly proportional to the specificity of your prompt. With agent loops, you can be vaguer because the agent will ask clarifying questions or discover requirements through iteration. That's a genuine advantage when you're still figuring out what you want.
The Bottom Line
Neither approach is universally better. Anyone telling you to "just use agents for everything" is selling you something. Anyone telling you that careful prompt engineering makes agents unnecessary hasn't tried building a feature with fifteen edge cases using a single prompt.
The winning strategy for indie hackers building real SaaS products in 2026:
- Default to CoT for well-understood, standard features. It's faster, cheaper, and predictable.
- Switch to agent loops when the feature has complex interactions, unfamiliar APIs, or edge cases you can't enumerate upfront.
- Always review the output regardless of approach. Neither CoT nor agent loops produce code you should ship without reading.
- Track your time honestly. If you're spending 30 minutes debugging CoT output for a feature that an agent loop would have nailed in 2 minutes, you're optimizing for the wrong metric.
The models are getting better at both approaches with every release. But right now, in early 2026, knowing when to use which approach is a genuine competitive advantage. It's the kind of meta-skill that separates developers who ship from developers who tinker.
Shane Larson is a software engineer with 30+ years of experience, writing from a cabin in Caswell Lakes, Alaska. He runs Grizzly Peak Software and AutoDetective.ai, and has published a book on training large language models. His agent loops have a 10-iteration safety limit because he learned the hard way what happens without one.