OS Part 1 — Why Servers Slow Down: The Real Cost of Concurrency
Published: June 15, 2026
In 2025, a production server started receiving 10,000 requests per minute. Response times shot from 200ms to over 40 seconds. The engineering team increased the connection pool from 50 to 100, then to 150. Nothing improved. CPU utilization sat at just 20%, yet a thread dump revealed that most threads were stuck in TIMED_WAITING — doing absolutely nothing.
The same pattern had been documented 25 years earlier. In 1999, Dan Kegel posed the question that would define a generation of server architecture: “Can a single server handle 10,000 concurrent connections?” Apache assigned one process per connection. It worked fine for a few hundred. But once connections reached the thousands, the CPU began spending more time managing process switches than serving actual requests.
Reference: Thread Pool Bug: 200ms to 40s Timeouts (2025)
Reference: Dan Kegel, “The C10K Problem” (1999)
Reference: Cloudflare: Why We Chose NGINX
A quarter-century separates these two incidents, but they hit the same wall. The intuition that more workers means more output — in practice, it was the very thing that brought the system down.
In the database series, we traced disk I/O, indexes, transactions, and distributed systems along a single thread: costs never disappear. Operating systems take that principle and push it further — they hide costs, defer them, and act as though they don’t exist. One CPU appearing to juggle thousands of tasks at once. Behind that act is an enormous coordination effort, and the price of sustaining it.
This article is about the point where that coordination cost overtakes the work itself.
Why Servers Actually Slow Down
Section titled “Why Servers Actually Slow Down”When a factory falls behind on orders, hiring more workers is common sense. But in both incidents above, that common sense collapsed. More workers made things worse, not better.
And the strangest part: CPU utilization was only 20%. The workers weren’t overloaded. They weren’t even busy. So what was actually causing the slowdown?
Computing vs. Waiting
Section titled “Computing vs. Waiting”A computer only ever does two things with its time: compute, or wait.
The CPU wasn’t refusing to work. It was spending most of its time waiting — with nothing to do.
If the workload is constant computation — video encoding, encryption, ML training — then CPU speed is the bottleneck. Adding more workers hits a physical wall. This is a CPU Bound workload.
But most servers look nothing like that. A request arrives, the server queries a database, calls an external API, reads from disk. During all of that, the CPU sits idle. The bottleneck isn’t computation. It’s waiting. This is an I/O Bound workload.
[ CPU Bound ]
CPU ████████████████████████████ (computing 100%)I/O (waiting 0%)→ CPU is the bottleneck. A faster processor is the answer.
---[ I/O Bound ]
CPU ██░░░░░░██░░░░░░██░░░░░░██ (computing 20%)I/O ██████ ██████ ██████ (waiting 80%)→ Waiting is the bottleneck. The CPU is mostly idle.Most real-world web servers fall into the second category. The CPU computes for a tiny fraction of each request. The rest is waiting.
That’s what happened to the production server from the 2025 incident. Its threads were locked in TIMED_WAITING. The CPU wasn’t idle by choice — it had no work to pick up.
Ironically, this is also what gives the operating system its trick. Because the CPU spends most of its time waiting, there is room for a trick — fill those idle gaps with other work, and make it look like everything runs at once.
Is the CPU computing, or waiting? That question determines every design decision that follows.
The Paradox of Adding Workers
Section titled “The Paradox of Adding Workers”When a server slows down, adding threads feels like the obvious fix. More workers, more throughput.
But as the number of workers grows, the system becomes less productive, not more. Eventually, more resources are spent coordinating work than performing it.
The First Cost: The Weight of Shift Changes
Section titled “The First Cost: The Weight of Shift Changes”Physical cores are finite. Pack thousands of threads onto a handful of cores, and each worker gets a vanishingly small slice of time at the workbench. The OS must constantly rotate workers in and out.
This is where the first cost appears. Before a worker can hand over the workstation, they must record exactly where they left off. The incoming worker must then reload that state and figure out where to continue.
As more workers are added, a growing share of the factory’s resources is spent preparing for handovers rather than doing actual work. Eventually, the workers stop producing and spend most of their time updating paperwork.
The operating system calls this context switching.
[ Context Switch ]
Thread A running ──▶ [save state] ──▶ [restore state] ──▶ Thread B running │ │ └── this window ───┘ "zero useful work done"The Second Cost: Contention Over Shared Resources
Section titled “The Second Cost: Contention Over Shared Resources”But the cost doesn’t stop at shift changes. When workers share the same workspace, a second overhead stacks on top.
Two workers modifying the same data simultaneously corrupts it. To prevent this, the system enforces exclusive access: one worker at a time. This is a lock.
[ Lock Contention ]
Thread 1 ──▶ [lock acquired] ──▶ working ...Thread 2 ──▶ [waiting ─────────────────▶] ──▶ lock acquiredThread 3 ──▶ [waiting ──────────────────────────────▶] ──▶ lock acquiredThread 4 ──▶ [waiting ─────────────────────────────────────────▶]
More threads ↑ → longer queues ↑ → less concurrency benefit ↓More workers, longer lines. The workers hired to boost throughput end up standing in queues. This gridlock stacks on top of context switching.
The Third Cost: Spilling Into Hardware
Section titled “The Third Cost: Spilling Into Hardware”The first two costs are software-level coordination problems. But when worker counts grow large enough, the damage reaches hardware.
CPUs keep frequently used data in high-speed caches near the core. When workers rotate too quickly, each new worker finds the cache filled with the previous worker’s data — useless. They must reload from main memory, which is far slower. As threads multiply, this reloading compounds. The CPU spends more time fetching data than computing. This is what cache misses look like at scale.
The Inversion Point
Section titled “The Inversion Point”At first, more threads means more throughput. But past a certain point, each new thread costs more to coordinate than the work it produces.
Throughput▲│ ┌─── peak│ /││ / ││ / ││ / │ \│ / │ \│ / │ \ ← coordination cost > actual work│ / │ \│ / │└─────────────────────────▶ Thread count Beyond the optimum: more threads = less throughputPast the peak, context switching, lock contention, and cache pollution overwhelm useful work. CPU utilization hits 100%. The server looks maximally busy. But very little useful work is actually getting done.
A busy CPU and a productive CPU are not the same thing.
Does Adding People Make Things Faster?
Section titled “Does Adding People Make Things Faster?”In 1975, Frederick Brooks analyzed why the most intuitive fix for a late software project always backfires. When deadlines slip, managers add people. But adding people makes the project later.
The reason: as the team grows, the time spent aligning with each other rises faster than the time spent working. With n people, communication paths explode as n(n-1)/2.
Reference: Frederick Brooks, “The Mythical Man-Month” (1975)
[ The Explosion of Communication Paths ]3 people → 3 pathsA────B╲ │ ╲ │ ╲ │ C
5 people → 10 pathsA───B│╲ ╱││ ╳ ││╱ ╲│C───D ╲ ╱ E
10 people → 45 paths ...The Mythical Man-Month, OS Edition
Section titled “The Mythical Man-Month, OS Edition”This organizational law operates at the operating system level just the same. Adding a thread gains one worker. But the moment that worker competes for shared resources, coordination cost rises with it.
[ The Mythical Man-Month, OS Edition ]
Threads Productive Work Coordination Overhead4 85% 15%16 60% 40%64 30% 70%256 10% 90%Adding threads to a struggling server is the equivalent of throwing developers at a late project. Past the tipping point, more threads means less throughput.
The Constraint That Never Goes Away
Section titled “The Constraint That Never Goes Away”In the database series, we saw how individually rational transactions can collectively paralyze a system — the fallacy of composition. Thread pools follow the same law.
Every strategy that increases concurrency by adding workers carries a cost that scales with the number of workers: coordination cost. Costs never disappear. Execution units have grown lighter over the decades. The coordination cost itself has never been eliminated.
Three Responses to the Same Constraint
Section titled “Three Responses to the Same Constraint”Coordination cost cannot be eliminated. So how has the industry tried to contain it?
If the CPU is idle, fill the gap with other work. That’s the premise of concurrency. The real question is how to do it without letting coordination costs devour the gains. The answer came in three stages — each sacrificing something different in exchange for lower overhead.
Processes — Safety at Maximum Cost
Section titled “Processes — Safety at Maximum Cost”The first priority was safety. One task crashes, the rest must survive. That demands hard boundaries.
A process gives each task its own memory, its own file table — a completely independent workspace.
[ Process ]┌──────────────┐ ┌──────────────┐ ┌──────────────┐│ Process A │ │ Process B │ │ Process C ││ ┌────────┐ │ │ ┌────────┐ │ │ ┌────────┐ ││ │ Memory │ │ │ │ Memory │ │ │ │ Memory │ ││ └────────┘ │ │ └────────┘ │ │ └────────┘ ││ Code │ │ Code │ │ Code ││ Stack │ │ Stack │ │ Stack ││ File Table │ │ File Table │ │ File Table │└──────────────┘ └──────────────┘ └──────────────┘ Full isolation Full isolation Full isolationThis isolation is how Nginx survives a worker crash without losing the entire service.
But isolation is expensive. Independent memory spaces, the heaviest context switches of any execution unit. Maximum safety. Maximum coordination cost.
Threads — Trading Safety for Speed
Section titled “Threads — Trading Safety for Speed”If maintaining separate workspaces costs too much, what if workers shared one? That’s the tradeoff behind the thread.
Same heap, same code — only the stack is separate.
[ Thread ]┌──────────────────────────────────────┐│ Process A ││ ││ ┌──────────────────────────────┐ ││ │ Shared Memory (Heap) │ ││ └──────────────────────────────┘ ││ ││ ┌────────┐ ┌────────┐ ┌────────┐ ││ │Thread 1│ │Thread 2│ │Thread 3│ ││ │ Stack │ │ Stack │ │ Stack │ ││ └────────┘ └────────┘ └────────┘ │└──────────────────────────────────────┘ Shared memory, separate stacksNo duplication means far cheaper creation and switching. This is why Java thread pools became the standard for high traffic.
But sharing breeds conflict. Concurrent modifications corrupt data. One worker’s fatal error brings down the entire workspace. The cost of being lightweight: isolation is gone.
Coroutines — Moving the Cost Out of the OS
Section titled “Coroutines — Moving the Cost Out of the OS”Even threads hit a wall at tens of thousands of concurrent connections. Each thread still consumes hundreds of kilobytes of memory, and the OS scheduler’s switching overhead across all of them creates a hard ceiling.
The question: can you get massive concurrency without paying the OS-level coordination tax?
Coroutine-based systems answered by moving scheduling into the application runtime. Go’s goroutines run hundreds of thousands of lightweight workers on just a few hundred OS threads. Java 21’s Virtual Threads follow the same approach.
[ Thread ]
Thread A ──▶ OS Scheduler ──▶ Thread B
The OS decides when to switch, whether the thread is ready or not.
---[ Coroutine ]
Coroutine A ──▶ suspend │ ▼ Application Scheduler │ ▼ Coroutine B
Execution only transfers when the coroutine voluntarily yields.This is the first trick in our operating system series. A CPU core still executes only one task at a time. But whenever a task blocks on I/O, it gives up its turn and another takes its place. To the outside observer, it appears as though hundreds of thousands of tasks are running simultaneously.
To make that possible, goroutines and virtual threads place a lightweight scheduler on top of a small pool of OS-managed threads. The operating system continues to manage only a limited number of kernel threads, while the application runtime handles the scheduling of vast numbers of lightweight coroutines.
But the cost never disappeared. Part of the coordination work once handled by the operating system has simply been pushed into the application runtime.
The Spectrum
Section titled “The Spectrum”| Execution Unit | What You Gain | What You Lose | Example |
|---|---|---|---|
| Process | Full isolation | Heaviest creation and switching costs | Nginx workers |
| Thread | Lightweight switching, shared memory | No isolation, shared-resource conflicts | Java thread pools |
| Coroutine | Extreme lightness, minimal switching cost | Depends on cooperative yielding; blocking calls stall everything | Go Goroutines, Java Virtual Threads |
These three are not a technological progression. They are successive attempts to contain the same constraint. Costs never disappear. From processes to threads to coroutines, only the shape of the cost changed.
The Bottom Line
Section titled “The Bottom Line”Servers slow down for one of two reasons: the CPU is busy computing, or stuck waiting. Most web servers fight the second problem.
To fill those waiting gaps, operating systems evolved progressively lighter execution units. Each made concurrency cheaper to achieve. None made coordination free. Context switching, lock contention, and cache pollution scale with every worker added.
The real challenge was never making the CPU busier. It was filling idle time with useful work without letting coordination costs consume the gains.
Making a CPU busy is easy. Making it productive is the hard part.
Next: What if, instead of adding more workers, you changed how they wait? The next article explores the shift from blocking to non-blocking, from select to epoll — the evolution of I/O models that eliminate idle time.