At the core of the “10 Things I Hate About PostgreSQL” blog post, sits one particular observation that seems like a simple annoyance at first glance. MVCC is presented merely as item 4 in the list, but like XIDs, it serves as a star upon which an entire constellation of related stellar objects converge. While replication (both physical and logical) is a big part of Postgres’ future, despite a few hiccoughs along the way , MVCC is very much its past and present.
This time, we’re going straight into the beating heart of Postgres to take a good, hard look at how it manages storage. This won’t be an exhaustive treatise on MVCC; there are better places to get that information. But given what we know about Postgres and data storage in general, it’s possible to search for potential problem areas.
Who knows, we may even be pleasantly surprised.
A Perpetually Stacked Deck
Imagine a deck of cards that, once sorted just so, must remain in that configuration in perpetuity. If someone takes a card, they have to replace it exactly where it was. After a few such swaps, the card is gross and needs to be replaced, so we swap it with a new card of the same suit and face.
This is decidedly not how Postgres MVCC works. Instead of replacing the card right away, we just slap a new one on top of the deck and draw a big X on the old card so everyone knows to ignore it. Maybe eventually we’ll want to add a new card, and at that point we’ll replace a random previously “canceled” card.
To be sure, this is an extremely efficient approach in many ways. The existing card is still there, just in case we need it. Maybe the new card is a misprint, after all. Anyone can look at the old card, literally right up until we write the big X on it. What’s not to love?
This constant replacement shuffle has two rather unfortunate side-effects:
- The amount of “extra” cards in the deck is limited only by our chosen frequency for replacing crossed-out cards.
- Our desired sorting is lost the instant we deprecate the first card.
Postgres handles the first item with the
VACUUM command and the related
autovacuum background service. The default settings essentially “bake in” about 20% extra space, and consider that an equilibrium point where the amount of incoming and outgoing data is stable.
It’s this turnover that can eventually transform our previously shuffled deck into a random result with little resemblance to the original. It’s also the basis as to why some users express frustration in the arena of data storage in Postgres.
Index as Heap
But this kind of shuffling isn’t actually a common experience. Postgres users may take this for granted, but not all database engines enforce such strong heap and index role separation. Sure, Postgres does offer index-only scans, and even has
INCLUDE () syntax for covering indexes. But no matter what, the heap and index are always always separate. We can avoid visiting the heap in certain instances, but it must exist.
“10 Things I Hate About PostgreSQL” voiced this primarily as a waste of space, but the complaint goes much deeper than this. As we outlined above, the order of the heap is unreliable at best. Thus using the
CLUSTER command to sort the contents of a table by a specific index is a temporary measure unless the table is entirely static.
Assuming no sorting clauses, fetching rows from the heap is the cheapest possible action since it can be done sequentially. In the event the index is the heap, we don’t have to follow index node references in the absence of sorting parameters. Sorted results require using the clustered index, one of the secondary indexes, or sort in-memory. So all of our normal operations work the way we would expect.
If no uniqueness constraint is available, databases that use this approach enforce a kind of permanent uniqueness attribute tied to the database-level row identifier. These can’t be used for foreign keys, but are a perfectly fine substitute until a more suitable unique index is provided. In this case, such a table would act almost identically to the Postgres heap.
Are there drawbacks to this approach? It’s certainly possible to argue the availability of the heap as a “pristine” data copy for use in building other objects outside the influence of interpretive grouping and paging algorithms. With index existing outside the heap, isolated corruption can be corrected by a simple index rebuild. Then again, heap corruption happens too.
It’s possible this is merely an anachronism that has been “baked into” the engine in terms of momentum. Since the introduction of pluggable table storage in Postgres 12, there’s potential to rectify at least this oversight without having to go full Zheap.
Contrary to popular belief, Postgres TOAST storage does not mean “store compressible data in an external location”. There are actually several configurable thresholds and mechanisms for handling compression. In this case, we’ll refer to the TOAST documentation to explain why this is relevant.
Of note is the
MAIN tuple storage type, which will endeavor to compress the column value and store it inline with the rest of the tuple data in the heap. If this compression is still larger than the 8KB Postgres page size, the column contents will be stored externally. But for trivially compressible objects,
MAIN storage can prevent expensive external fetches from the TOAST table.
However, it is true that Postgres does not support what some refer to as “block level” compression. By this, we mean there is no algorithm that can collapse identical values shared among tuples into references within a page. For tables that contain a lot of repeated values such as dates, foreign key references, and so on, this could result in noticeable space savings.
In some ways, this is a natural consequence of Postgres relying on natural, uninterpreted heap storage as the base data layer. The heap is merely a series of tuple records that fit on an 8KB page, barring fill factor limitations. Once we inject cross-tuple compression effects, it becomes more of an index storage type. Then again, so long as we restrict the compression to a single page, every page read could be passed through inline transparent decompression layer before the rest of the engine interacts with it.
The main drawback here is that it would fundamentally alter the Postgres storage format in such a way that the upgrade would require a dump/restore between versions. Alternatively since the compressed files would be easily distinguishable, tables could simply be marked as
UNCOMPRESSED until later modified with some kind of
ALTER TABLE statement.
In cases like this, it may simply be a matter of there not being enough demand for such a feature. The community response to this is usually “patches welcome”.
VACUUM as Garbage Collection
Which brings us to the need for actually enforcing heap integrity itself. On the surface, the concept of
VACUUM itself is simple: mark deprecated rows as reusable space. The complex machinations of what the
VACUUM command actually does, the knobs that manipulate its behavior, and the subtleties involved, are enough to baffle even practiced experts.
Consider the necessary task of row freezing. Visible tuples that have a very old XID reference eventually get their transaction ID negated by a row header attribute that marks the row as frozen. This allows the ceaseless winding of the 32-bit Postgres XID horizon by essentially erasing transaction IDs that are no longer relevant to ongoing session visibility.
FREEZE is not the same as
VACUUM at all, though it is tangentially related due to the inherent manipulation of row visibility. Despite this, there is no separate
FREEZE command. Thus this critical maintenance task is behind a paywall of sorts, since it cannot be decoupled from the reaping of dead tuples.
So how do we configure the various vacuum knobs to ensure freeze occurs on a timely basis as well? That topic alone can be (and has been) the subject of several related blog posts, articles, and webinars. Andrew Dunstan’s Managing Freezing in PostgreSQL is a good summary, as is the Tuple Freezing & Transaction Wrap around Through Pictures webinar co-hosted by Andrew Dunstan (again) and Tom Kincaid. And we have another post by Tomas Vondra on Autovacuum Tuning Basics that goes into more depth on how to set specific configuration parameters.
The copious amount of reading material is actually part of the dual-edged sword. It’s a deep and varied subject which really must be understood to a high degree of accuracy, lest some misinterpretation lead to overwhelmed disk IO or irregular cleanup. Lackadaisical settings can lead to excessive bloat or even risk XID wraparound. Alternatively, overly ambitious settings may cause a disruptive amount of storage throughput overhead.
At the end of the day, an Oracle user (for instance) will always deem this a design flaw. Any kind of maintenance that can be deferred carries the risk of scheduling concerns and throughput considerations. Then there’s the careful balancing act necessary to prevent inherent overhead effects while also keeping structures functional. To that end, there is some argument that a data structure that does not require this kind of constant upkeep is superior for some use cases.
To be fair, Postgres does implement several optimizations to expedite maintenance. If all tuples in a page are frozen, the entire page is marked. This can greatly accelerate table freeze operations. Combined with regular invocations via autovaccum, and overhead would be nearly indistinguishable from background noise on most systems.
In many ways, this discussion is a philosophical one. The initial decision to decouple heap and external augmentations such as indexes may have started as a technical limitation, but other database engines have clearly demonstrated that this distinction isn’t necessary and may lack many implicit benefits. In fact, one enhancement we didn’t mention earlier is that an integrated index/heap layer would dramatically reduce WAL traffic.
The Postgres approach to MVCC incurs the cost of maintenance. Yet unlike rollback segments which are commit-optimistic, Postgres tuples are transaction agnostic at rest. All Postgres needs to know at any time is whether or not a transaction committed. Database engines that rely on external rollback mechanisms must physically move reverted rows back into table storage before they’re usable. This makes crash recovery extremely expensive and time-consuming by comparison.
It would be nice to see inline block compression, though. Assuming the engine interacts exclusively with the decompressed page, only the read and write operations would need to worry about the compression calls. This is probably one of the easiest modifications to accomplish within the loose topic areas we’ve covered. It’s certainly a simpler task than outright replacing the heap storage system with an index-heap hybrid.
Still, these are all reasons why any Postgres DBA must be well acquainted with its implementation of MVCC. It defines what maintenance is necessary and why. It tells us how Postgres actually interacts with the data it stores. From there, we can understand the implicit limitations and strengths. It’s rare for esoteric details like this to be absolute deal-breakers or selling-points.
Despite the fact Postgres has leveraged its strengths admirably, there’s always room for improvement. Only time will tell which direction things ultimately go.