Database Part 5 — Redefining Accuracy, The Price of Perfection

Published: May 27, 2026

In 2024, Canva’s creator payment system was experiencing at least one incident per month. The system was counting billions of content usage records one by one in MySQL — because each count directly determined how much creators got paid. As traffic doubled every 18 months, overcounting, undercounting, and misclassification kept recurring. Every time something broke, engineers had to log into the database and manually fix corrupted data.

Canva abandoned real-time counting. Instead of tallying each record as it arrived, they switched to a structure that pulled an entire month’s data and recalculated it in minutes. Code volume dropped by half. Incidents fell to once every few months. When they relaxed per-moment accuracy, the final results became more accurate.

The cost of maintaining perfect real-time accuracy was consuming the system from within. What saved Canva was not better technology — it was redefining what “accurate” needed to mean.

This is not an unfamiliar problem in management. A CEO who orders a full inventory audit across every department every hour will paralyze the organization before ever getting an accurate number. The floor stops selling because everyone is too busy counting.

Large-scale systems do not maintain perfect consistency across every operation. The following explores where to demand absolute accuracy, where to accept approximation and delay — and why that boundary is a business decision, not a technical one.

Reference: Facebook
Reference: Canva

Different Data Carries Different Guarantees

In Part 4, the system dismantled centralized control. Read authority was delegated to replicas, data was split across shards, and the single server that once managed everything was left behind.

But distribution introduced a new problem. Amid the flood of data, requests, and concurrent reads, engineers began to realize: the bottleneck was not storage. It was the load generated by keeping everything perfectly accurate in real time.

YouTube view counts, real-time trending rankings, Instagram’s recommendation feed, ad click statistics — these are familiar examples. If every video click had to be perfectly synchronized across globally distributed databases in real time, the system would not survive the synchronization cost.

In a distributed environment, network partitions and state divergence between nodes can happen at any moment. Having every server share the same truth at all times is practically impossible. How far to maintain consistency, and where to prioritize speed and availability instead — that line must be drawn differently for each type of data, shaped by the business it serves.

The Price of Perfect Information

Imagine a director asking “How many concurrent users right now?” every second. If the entire staff had to stop working and count heads each time, the company would go bankrupt producing reports instead of products. Extracting perfect, real-time information at that frequency is expensive enough to break operational viability.

What Canva experienced was not a special case. It is a wall every growing service hits. Real-time view counts, like rankings, DAU aggregation, trending searches, recommendation scores — what started as a single-line query gradually accumulates into system-wide latency as traffic scales.

Why Real-Time Aggregation Is Expensive

Fetching a single record is cheap. With an index, the database already knows where to look. As covered in Part 2, an index can locate one row among hundreds of millions with just a few disk accesses.

Aggregation works nothing like a lookup. COUNT, GROUP BY, ORDER BY — these do not find a row. They scan every qualifying row and recompute state from scratch. No matter how sophisticated the index, there is no shortcut for “count everything.” As data grows, so does the scope of computation. A lookup is finding. Aggregation is recounting — every time.

The moment thousands of users hit the door simultaneously, the database stops being a storage layer and becomes a real-time calculator, endlessly recomputing the same numbers.

Leveling Data by Priority

Large-scale systems do not abandon accuracy entirely. They stratify it.

The moment every piece of data is held to the same real-time standard, computational load explodes alongside system scale. The system must distinguish how accurate each type of data actually needs to be.

Account balance: Not a single unit can be wrong
Payment status: Success and failure must be immediately consistent
View count: A few seconds of delay causes no harm
Recommendation feed: Roughly correct is sufficient
Ad statistics: 1% margin of error is commercially acceptable

The point is not to apply the same standard to everything.

Where absolute accuracy is required, the system pays the full price and maintains strict consistency. Where a slight margin of error is tolerable, caching, asynchronous aggregation, and approximate algorithms gradually reduce the computational burden.

Scaling a large system is not about stripping away accuracy. It is about concentrating expensive consistency only where it truly matters.

First Strategy — Cache as Compromise

No executive assistant calls every regional manager each time the CEO asks for revenue numbers. They pull the daily report from the drawer — prepared that morning. That keeps the CEO informed and everyone else alive.

Large-scale systems make the same choice. Starting with areas where slight delay or imprecision is not critical, the system stops insisting on “the perfect present state” and begins reusing previously computed results.

This is where the first strategy appears: Cache.

Why Cache Exists

The system can no longer afford the load of repeating the same computation endlessly. Cache is the first bargain the system strikes. Instead of recalculating the latest state every time, it reuses a recently computed result for a while — absorbing read requests before they reach the database, so the database no longer has to recount the present state on every call.

Facebook faced the same problem. Querying and computing a social graph of billions of users directly from the database on every request had clear limits. Building a cache layer (TAO) that held frequently accessed data in memory was not a technical showcase — it was a survival mechanism to keep the database alive under billions of read requests per second.

The system now holds “a copy recent enough for the business to function” instead of insisting on the perfect, up-to-the-millisecond original — and breathes through massive traffic.

Reference: TAO: Facebook’s Distributed Data Store for the Social Graph (USENIX ATC ‘13)

Three Branches of Cache Strategy

Simply making a copy is not the end. When to load data into memory, and when to synchronize it after the original changes — the price shifts depending on these decisions. The system must choose what to prioritize among freshness, speed, and loss risk.

Cache Aside
```
Request ─▶ Cache ─ (MISS) ─▶ DB
                      │
                      └────▶ Cache
```
The most common pattern. The system checks cache first; if the data is missing, it fetches from the database and stores the result in cache. Simple in structure, but when the cache expires, subsequent traffic floods back to the database.
Write Through
```
Write ───┬──▶ Cache
         │
         └──▶ DB
```
Sometimes knowing “what I’m seeing is genuinely current” matters more than read speed. Write Through addresses that by writing to both the database and cache simultaneously. The cache always reflects the latest state, but every write operation pays the cost of dual writes and added latency.
Write Back
```
Write ──▶ Cache ── (later) ──▶ DB
```
The opposite trade: nearly eliminating write latency by accepting data loss risk. Writes land in cache first; the database receives them later in asynchronous batches. Write delay virtually disappears, but if the cache server goes down before the database sync, that data is gone permanently.

No strategy among the three is the right answer. It comes down to which price the system is prepared to pay — across four axes: read efficiency, write speed, data freshness, and loss risk.

New Failures Created by Cache

When cache fails, the system finds itself in a more dangerous state than if cache had never existed. The database that had been sheltered behind cache never built the capacity to absorb explosive traffic directly.

Cache Stampede: The moment a popular cache entry expires, requests that had been absorbed by cache simultaneously flood the database. A system that was stable moments ago is suddenly overwhelmed by a burst of recomputation requests. The request collapsing technique from Part 4 compresses this stampede into a single representative query.
Thundering Herd: Threads or requests waiting on a specific event all wake up simultaneously, causing an instantaneous spike. Cache regeneration or lock release triggers a rush for resources that can destabilize the entire infrastructure.

A peculiar paradox of distributed systems emerges here. Cache normally absorbs load and shields the system. But when expiration and regeneration overlap, cache becomes the most dangerous bottleneck — concentrating traffic instead of dispersing it.

Second Strategy — Approximation as an Alternative

Cache relaxed the standard for “latest data” and reduced the burden. But it still stores exact values.

Here, the system begins asking a bolder question.

“Does every value really need to be calculated to the last digit?”

When preparing refreshments for an all-hands meeting, the admin team does not survey all 5,000 employees for attendance. They estimate based on past events — roughly this many should be enough. Tolerating a small margin of error is far cheaper than conducting an exact count.

Large-scale systems make a similar choice. In certain areas, they accept a tiny margin of error and radically reduce the computation itself. The representative strategies that emerged from this approach are probabilistic data structures like Bloom Filter and HyperLogLog.

Bloom Filter

A surprising portion of large-scale traffic consists of requests for data that never existed in the first place. Lookups for nonexistent user IDs, access attempts on deleted posts, searches for invalid product codes. The problem is that the system must query the database all the way down just to confirm “not found.” The same expensive disk access and computation is spent on a request destined to fail.

Bloom Filter was designed to cut this wasted effort. At the very front of the path to the database, it makes a rapid determination: this data definitely does not exist.

Request ──▶ [ Bloom Filter ] ──┬──▶ Definitely Not Exists ──▶ Reject
                               │
                               └──▶ Possibly Exists ──▶ DB

It is not perfect. A small chance of false positives exists — judging something as present when it is not. But the reverse never happens. When the filter says “absent,” that verdict has zero exceptions.

HyperLogLog

Accurately computing metrics like DAU (Daily Active Users) presents the same challenge. At the scale of hundreds of millions of users, deduplicating and counting each individual precisely demands enormous memory and computational load.

Exact Counting - 123,456,789 Users

Memory / Computation Cost
████████████████████████████████████

        │
        └──▶ Tracks every user
        └──▶ Complete deduplication
        └──▶ High memory usage
        └──▶ High computation cost

HyperLogLog trades precision for a probabilistic estimation approach. It uses roughly 12KB of memory while maintaining an error margin around 1%.

HyperLogLog - Approximately 120 Million Users

Memory / Computation Cost
████████████████

        │
        └──▶ Probabilistic estimation
        └──▶ ±1% acceptable error
        └──▶ ~12KB memory usage
        └──▶ Very fast computation

What matters in most businesses is not whether today’s DAU is exactly 123,456,789. It is whether the trend is rising or falling, and how much it has shifted from yesterday. That is sufficient.

What Businesses Actually Need

Yesterday   ↗
Today       ↗↗
Tomorrow    ↘

= Understanding overall trends
  matters more than perfect precision

Approximation is not a technology that discards accuracy entirely. It is a choice to radically cut computational cost only in areas where the margin has no impact on business judgment.

Third Strategy — Abandoning the Schema

When launching a new business line, following headquarters’ rigid approval forms and procedures to the letter risks missing the market entry window. In these moments, flexibility takes priority over formality.

The third strategy follows a similar logic. It relaxes the rigidity of the container that holds data, choosing to adapt to change.

The Limits of RDB

The strength of a relational database (RDB) is its strict schema and relationship-centered design. Structure is explicit and maintaining consistency is straightforward.

Relational DB

User A ─┐
User B ─┼──▶ [ id | name | age | email ]
User C ─┘

But when the pace of service change exceeds the system’s capacity to keep up, that perfect structure becomes the bottleneck. Running ALTER TABLE on a database with hundreds of millions of records to add a single field is an operational risk in itself — the overhead of validating data structure integrity across the entire table becomes too heavy.

On top of that, real-world data grows increasingly unpredictable. Profile structures vary by user, feed data changes constantly, and required attributes differ across services. Forcing all of it into a single fixed table schema leads to a reversal: the system spends more resources maintaining and validating the container than storing the data itself.

NoSQL — Unstructured Databases

This is why systems moved beyond the decades-proven “row” unit. Choosing schema-free “documents” or key-value formats was not a matter of technical superiority. The system simply could no longer bear the burden of binding all data under a single set of rules.

Central, uniform control was relaxed. The order that had validated and synchronized every piece of data was set aside. Instead, each data unit was allowed its own independent structure.

Unstructured DB

User A ─────▶ { name, age, hobby }
User B ─────▶ { name, email }
User C ─────▶ { name, location, job, github }

The trade-offs are clear. JOINs weaken, duplicate data proliferates, and maintaining strong system-wide consistency becomes significantly harder.

But this is not a question of technical superiority. Scaling a large system is not about building a more perfect structure. It is a continuous process of weighing which order to maintain to the end, and which to let go — adapting to a changing environment.

The Satisficing Solution

Every strategy covered so far shares a common pattern. Deliberately surrendering a piece of perfection, and gaining scale and speed in return.

What is interesting is that this choice is not a phenomenon unique to distributed systems.

Herbert Simon, who received the Nobel Prize in Economics in 1978, left a sharp observation about organizational decision-making. Humans and organizations cannot collect all information, cannot compute every possible scenario, and cannot derive a perfectly optimal solution.

So real-world organizations do not pursue the optimal solution. They seek a satisficing one — a solution good enough to act on. Moving with a sufficient answer beats being paralyzed in the search for a perfect one.

Reference: Herbert Simon, “Models of Bounded Rationality” (1982)

Every technology discussed in this article stands on exactly this principle.

Cache chose to respond quickly with a slightly delayed answer, rather than guaranteeing the latest state at every moment. Bloom Filter focused on filtering out unnecessary queries at the front gate instead of verifying existence to the very end. HyperLogLog accepted an estimation approach tolerating small error margins rather than counting every user precisely. NoSQL chose a storage model that adapts flexibly to change, rather than binding all data within a single relational order.

None of these failed to achieve perfection. They calculated the computational cost and the risk of stalling that perfection would demand, and deliberately chose to set it aside. In the end, distributed systems — like human organizations — operate on sustainable compromise rather than absolute completeness.

The Boundary of Guarantees

The core of scaling is not about doing something better. It is about the ability to decide what to concede.

Some data must remain accurate to the end
Some data is sufficient as a rough approximation
Some data can afford to be reflected a few seconds late
Some data can tolerate being stored redundantly

Drawing that boundary is the essence of large-scale system design.

The Bottom Line

Scaling a system is not an exercise in technical perfectionism. It is a pragmatic process of deciding how much imprecision the business can tolerate in order to survive. Once the insistence on universal correctness is relaxed, the system gains room to scale.

Cache adjusted the standard of freshness. Bloom Filter relieved the burden of perfect existence checks. HyperLogLog compromised on exact counts. And NoSQL set aside a portion of the relational order that traditional databases had defended to the last.

Scaling a large system is not the process of strengthening perfection. It is the process of knowing which guarantees are worth their cost.

Next: The system has been scattered and loosened. Now it faces the paradox of having to complete a single transaction again. What if the payment succeeds but inventory deduction fails? What if a coupon is consumed but the order is cancelled? When the business demands a single result once more — distributed transactions.