The Day My "Index Optimization" Melted Our Database đ„
How removing a seemingly broken SQL condition made our CPU spike to 100% and the InnoDB architecture lesson hiding underneath.
It was a regular Monday morning. I was sipping my chai â, scanning through our monitoring dashboards, when I noticed something that had been bugging me for weeks - a SQL condition that was obviously wrong.
Many queries in our multi-tenant SaaS application had this filter:
WHERE (tenant_id = 42 OR tenant_id IS NULL)
AND org_id = 100
AND is_active = 1
AND is_deleted = 0That âOR tenant_id IS NULLâ was sitting in every single query for a lot of tables. And I knew from years of database experience that OR conditions are index killers. Every DBA will tell you:
âAn OR with IS NULL prevents MySQL from using indexes efficiently.â
So I did what any reasonable engineer would do. I removed it.
I pushed the fix. Deployed it. And then watched our RDS CPU climb from a comfortable 60% to a screaming 100% đđ.
I had just optimized our database into a meltdown đ„.
đïž The Setup: A Multi-Tenant Nightmare
Let me give you some context. We run a B2B SaaS platform - think of it like a simplified ERP system. Every customer (tenant) has their own data, but it all lives in the same database. The data model looks something like this:
CREATE TABLE orders (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
tenant_id INT,
org_id INT,
order_date DATE,
amount DECIMAL(10,2),
is_active TINYINT(1) DEFAULT 1,
is_deleted TINYINT(1) DEFAULT 0,
-- ... 30 more columns
INDEX (tenant_id),
INDEX (org_id),
INDEX (is_active)
);The orders table had ~200,000 rows. Not huge, but not trivial either. And hereâs the kicker, for any given tenant, a query would typically return about 50,000 rows out of those 200,000. Thatâs a 25% selectivity ratio. Remember this number. Itâs going to matter a lot.
Now, some tables in our system had âsharedâ or âglobalâ records, things like default payment terms, standard tax rates, common bank entries. These records had tenant_id = NULL because they belonged to everyone. But somewhere along the way, a developer had added the OR tenant_id IS NULL condition to every modelâs query, not just the ones with shared records.
It was sloppy. It was clearly wrong. And it was my job to clean it up.
đ„ The âFixâ That Broke Everything
The change was surgical. I went through the codebase and updated the query builder:
Before (every model):
WHERE (tenant_id = 42 OR tenant_id IS NULL)
AND org_id = 100
AND is_active = 1
AND is_deleted = 0After (transactional models only):
WHERE tenant_id = 42
AND org_id = 100
AND is_active = 1
AND is_deleted = 0Clean. Precise. The kind of change that should make indexes happy.
I deployed it on a Tuesday morning. By Tuesday afternoon, our Slack was on fire đ„.
âThe dashboard is crawling.â
âAPI timeouts across the board.â
âRDS CPU is pegged at 100%.â
I checked New Relic. Query times were actually slightly faster . 1-3 seconds vs the previous 2-5 seconds. But CPU had gone through the roof. How could faster queries use more CPU?
I reverted the change. CPU dropped back to ~60%. Queries went back to being slow but stable.
I sat there staring at my screen, feeling like Iâd just witnessed a violation of the laws of physics đ€Ż.
đłïž Down the Rabbit Hole: InnoDBâs Dirty Secret
Hereâs where things get interesting. To understand what happened, you need to understand something most developers never think about: how InnoDB actually stores your data.
InnoDB doesnât just dump rows into a file. It maintains two separate data structures for every table:
Structure 1: The Clustered Index (a.k.a. âThe Actual Tableâ)
In InnoDB, the primary key isnât just an index - it is the table. Rows are physically stored sorted by the primary key. When you do a full table scan, youâre reading this structure sequentially, cover to cover, like reading a book.
Structure 2: Secondary Indexes (Your Regular Indexes)
Every other index like INDEX(tenant_id), INDEX(org_id) is a secondary index. And hereâs the crucial part: secondary indexes donât contain the full row data. They only contain:
The indexed column(s)
A pointer back to the primary key
Thatâs it. No org_id, is_active or amount. Just the indexed column and a PK reference.
This design choice has a profound consequence.
đž The Two-Lookup Tax
When MySQL uses a secondary index to find rows, it has to do two lookups for every single match:
Look up the secondary index â finds matching PKs
Look up the clustered index â retrieves the actual row data using each PK
This second step is called a âbookmark lookupâ or âclustered index lookup.â And itâs not a sequential read- itâs a random seek. The primary keys from the secondary index are scattered across the clustered index. So for each match, InnoDB has to jump to a completely different location on disk.
Let me paint this picture clearly.
What Happened BEFORE My Change (Full Table Scan)
With âOR tenant_id IS NULLâ in the query, MySQLâs optimizer looked at the query and thought: âThis OR condition makes the index useless. Iâll just scan the whole table.â
Full Table Scan:
Read row 1 â check conditions â keep/discard
Read row 2 â check conditions â keep/discard
Read row 3 â check conditions â keep/discard
... (200,000 rows, read sequentially)
I/O Pattern: Sequential(Like reading a book from start to finish)
CPU cost: LOW (just comparing values)
Disk cost: HIGH but efficient (sequential reads)
Wall time: 2-5 seconds
Sequential I/O is remarkably efficient. The disk head moves in one direction, reading contiguous blocks. The CPU barely breaks a sweat itâs just checking if (tenant_id == 42) for each row. Most of the time, the CPU is waiting for the disk to deliver the next chunk.
This is an I/O-bound operation. The bottleneck is disk speed, not CPU.
What Happened AFTER My Change (Index Scan)
Without the OR condition, MySQL suddenly had a usable index on tenant_id. So the optimizer switched strategies: âGreat, I can use the tenant_id index! Let me find all rows where tenant_id = 42.â
Index Scan + Clustered Index Lookups:
Step 1: Scan tenant_id index â find 50,000 matching PKs
Step 2: For EACH of those 50,000 PKs:
â Random jump to clustered index
â Read full row
â Check: org_id = 100? is_active = 1? is_deleted = 0?
â Keep or discard
I/O Pattern: Random (Like flipping to 50,000 random pages in a book)
CPU cost: HIGH (coordinating 50,000 random seeks)
Disk cost: 50,000 individual random reads
Wall time: 1-3 seconds
Each of those 50,000 random seeks requires the CPU to:
Navigate the B-tree index structure
Calculate the physical disk location
Issue the I/O request
Handle the I/O completion interrupt
Manage the InnoDB buffer pool
This is a CPU-bound operation. The CPU is constantly busy coordinating random I/O.
đ§ź The Counter-Intuitive Math
Hereâs where it clicks. Let me lay out the numbers:
The index scan was faster in wall-clock time therefore the query returned results quicker. But it consumed 5x more CPU doing it. Under concurrent load (dozens of these queries running simultaneously), the CPU saturated, everything started queuing, and the whole system ground to a halt.
đŻ The Selectivity Problem
This leads to one of the most important rules in database performance that many engineers miss:
Selectivity = (Rows returned) / (Total rows in table)
< 5% selectivity: Index scans are clearly better. Few lookups = low overhead.
5-15% selectivity: Grey zone. Depends on hardware, data distribution, index type.
> 15% selectivity: Full table scans often beat single-column index scans.
Our queries had 25% selectivity. We were firmly in âtable scan winsâ territory for single-column indexes. MySQLâs optimizer actually knew this and thatâs why it chose table scans when the OR condition was present. We accidentally forced it into an inferior plan by making the index âusable.â
â
The Real Fix: Composite Indexes
So if single-column indexes are the problem, whatâs the solution? You donât go back to table scans, instead you use composite indexes.
A composite index includes multiple columns in a single index structure:
CREATE INDEX idx_orders_tenant_org_active_deleted
ON orders(tenant_id, org_id, is_active, is_deleted);Hereâs what this changes. Instead of a secondary index that only knows about tenant_id:
Single-column index:
Find tenant_id = 42 â 50,000 rows
Then: 50,000 random PK lookups to check other conditions
The composite index knows about all four filter columns:
Composite index on (tenant_id, org_id, is_active, is_deleted):
Navigate to tenant_id = 42
ââ Navigate to org_id = 100
ââ Navigate to is_active = 1
ââ Navigate to is_deleted = 0
ââ Only 1,000 rows match ALL conditions!
Then: Only 1,000 PK lookups (not 50,000!)
The filtering happens inside the index itself. By the time MySQL needs to do PK lookups to fetch full row data, itâs already narrowed down from 50,000 to 1,000 rows. Thatâs a 50x reduction in random I/O.
đ Wait, But Would This Happen in PostgreSQL?
This is where things get engine-specific. Everything Iâve described is a characteristic of InnoDBâs clustered index architecture. PostgreSQL handles things differently, and the same scenario would play out differently there.
InnoDB (MySQL): Clustered Index Architecture
The table is the primary key index (clustered)
Secondary indexes store PK pointers, not row data
Every secondary index lookup requires a second hop to the clustered index
This âdouble lookupâ is the root cause of our CPU spike
PostgreSQL: Heap-Based Architecture
The table is stored as a heap where rows are stored in insertion order, not sorted by PK
All indexes (including the primary key) are secondary indexes
Indexes point to a physical row location (tuple ID / ctid) on the heap
Thereâs no âclustered index lookupâ step because the index points directly to the rowâs physical location
InnoDB :
Secondary Index â PK value â Clustered Index â Row Data
(Two B-tree traversals per row)
PostgreSQL:
Index â Physical Row Location (ctid) â Row Data
(One B-tree traversal + one heap fetch per row)
So in PostgreSQL, our 50,000 index matches would each require one hop to the heap, not two hops through the clustered index. The CPU overhead would be lower.
But PostgreSQL has its own trade-off. Since the heap isnât sorted by any index, all index lookups involve random heap access. Thereâs no concept of a âclustered scanâ where the index order matches the physical data order (unless you explicitly use CLUSTER, which is a one-time operation that doesnât persist).
In InnoDB, if your query scans the primary key in order, you get beautiful sequential reads because the data is physically sorted by PK. In PostgreSQL, even a primary key scan can involve random heap access.
đ§ Key Takeaways
âOR kills indexesâ is true but that might be saving you. If your single-column indexes would cause worse performance than a table scan, the OR condition is accidentally keeping you in the better execution plan.
Selectivity is everything. If your query returns more than 15% of a tableâs rows, a single-column index scan can be worse than a full table scan due to random I/O overhead.
Understand your storage engine. InnoDBâs clustered index architecture means every secondary index lookup pays a âdouble-hopâ tax. This tax is invisible when selectivity is low but devastating when selectivity is high.
Composite indexes are the real answer. They donât just make queries faster, they fundamentally change the math by filtering inside the index before touching the main table.
Monitor CPU, not just query latency. A query thatâs 2x faster but uses 5x more CPU is a net negative under concurrent load.
Know your engineâs differences. The same schema optimization can have very different effects on InnoDB vs PostgreSQL due to their fundamentally different storage architectures.
The next time someone tells you âjust add an index,â ask them: âWhat kind of index, and whatâs your queryâs selectivity?â The answer might surprise you.







