Corruption in CREATE INDEX CONCURRENTLY
Two weeks back we discovered an ancient bug in PostgreSQL which may cause index corruption when index is built via CREATE INDEX CONCURRENTLY (aka CIC). All supported releases of PostgreSQL are affected by this bug. The bug could be a source of index corruption reports that we receive from the field once in awhile.
Now before we discuss the problem, let me make it absolutely clear that PostgreSQL is known for its gold standards as far as code quality and product stability is concerned. But just like any other software, the database engine can also have unnoticed bugs. But PostgreSQL community takes every known bug very seriously and try to fix it on priority. No wonder that the bug fix went into the scheduled 9.6.2 release (and all other stable releases) at a very short notice, but not before community discussed and deliberated pros and cons of adding a crucial bug fix very close to a planned release.
To understand origin of the bug, you must first understand how CREATE INDEX CONCURRENTLY works. So if you haven’t, read my previous blog on the topic.
Relation Cache
You would also need to know about another technical detail. PostgreSQL engine internally uses something known as “Relation Cache” or relcache to store information about database objects, such as tables and indexes. These objects belong to system catalogs which are nothing more than a bunch of tables, but created in a system schema and whose presence is well known to the database engine. These catalogs are accessed very frequently since almost everything that get executed in the engine need access to the catalogs. It would be grossly damaging if these catalogs are scanned and queried like normal tables, given how frequently they are accessed. Also, catalogs don’t change too often. So information about various database objects is cached so that it can be quickly accessed without going to the database. There are variety of caches, but the cache that stores information about tables, their attributes, index columns etc is called a relcache. Caches are maintained in backend-local private memory and they are invalidated whenever catalog information is changed.
CIC Bug: Origin, Symptoms and Effects
This particular bug affects CIC in presence of concurrent updates where updates are changing from HOT to non-HOT update precisely because of the new index being added. This of course does not happen very often since users typically try to avoid creating an index which will make HOT updates impossible. But if it does happen, then the bug may prevent CIC from inserting necessary index entries into the new index. Or incorrectly assume that the index entry added during the first phase of CIC can be used to identify the tuple (at this point, you may want to read my previous blog on the working of CIC and HOT). You would notice that HOT updates skip adding new index entries if none of the index columns are being modified. The updated versions are still reachable from the old index entries. Now obviously when a new index is added, the view of indexed columns changes if the new index is indexing new columns. CIC has special mechanism to deal with that, as explained in the previous blog. But what if that’s broken? What if we can’t reliably detect the set of indexed columns as new index is added and concurrent update are executed? PostgreSQL might think that an UPDATE is a HOT update and avoid indexing that tuple. It might assume that the existing index entry is enough to reach the updated tuple, but that’s not correct because the indexed columns have changed. That’s exactly what happened, resulting in the corruption.
If the race condition materialise, you might find that you can’t access certain rows via the index or index access may return wrong rows. For example:
CREATE TABLE testtab (a int UNIQUE, b int, c int); -- Insert data and then run UPDATE statements INSERT INTO testtab SELECT generate_series(1, 100), generate_series(1001, 1100), 10; UPDATE testtab SET b = b + 1000 WHERE a = 1; -- At the same time, build a new index concurrently CREATE INDEX CONCURRENTLY testindx ON testtab(b);
Without the new index, each UPDATE can be a HOT update and hence the database does not insert new index entries. But when the new index is built, update becomes a non-HOT update. Now say, when the first phase of CIC is executed, it indexes the old version (1, 1001, 10) of the row. It will add an index entry for (b = 1001). If the race condition occurs, and when b is updated from 1001 to 2001, the buggy code skips index entry for (b = 2001) because it can’t yet see the new index and incorrectly assumes that the update can be a HOT update. As a result, if the user later queries the table for (b = 2001), it may not find the row (unless of course database decides not to use the new index because of cost estimates). On the other hand, since the new row is still reachable from the old index entry, if the user queries the table for (b = 1001), it may get the new row instead.
You might wonder what it has to do with the relcache. Well, the information about indexed columns is stored in the relcache. When a new index is added, the relache for the underlying table is invalidated and then rebuilt. But there was a race condition which caused the relcache information to remain stale and include the columns indexed by the new index. Subsequent UPDATEs used this stale information, they incorrectly assumed certain updates to be HOT updates.
Data corruption bugs are often very subtle and may not show up immediately and that’s one reason why they are very hard to diagnose and fix.
How We Discovered the Bug
To be honest, the bug was discovered quite accidentally. We’re currently working on a new feature called WARM which has tremendous potential to reduce write amplification caused by UPDATEs to a table with large number of indexes, but only very few of those index columns are actually updated. As part of the development process, we were running stress tests on patched PostgreSQL, dropping and creating indexes concurrently while the test is updating rows in many concurrent sessions. We noticed certain data consistency issues, which were obviously first attributed to the new feature. This was very natural since WARM touches these very areas of the database engine. But after spending hours and with careful review of the new code, we concluded that this is a pre-existing bug. After some more code reading, it was clear how relcache can get into a race condition, causing stale data and problems thereafter.
Once we’d theory in place, it wasn’t too hard to create a fully reproducible case which was duly reported to the PostgreSQL community.
Remedy
PostgreSQL community deliberated to a great length on what could be the correct fix. Given project’s focus on data consistency and integrity, we ensured that a bug fix is committed in time before the scheduled point releases went out. So all latest minor releases of PostgreSQL contain the bug fix. If you’re think if one or more of your indexes that were built CONCURRENTLY could possibly be corrupt, make sure you rebuild them after updating to the corresponding latest release.
Thanks you Pavan for identifying this and even fixing this asap!
This is a pretty common scenario where we CIC our production indexes and it’s possible that there are number of updates going on in parallel. So, I am curious how exactly you identified the data consistency issue while testing? Is there any way, where I can exactly verify which of my production db indexes have this bug present?