PG Phriday: Around the World in Two Billion Transactions

Transaction IDs (XID) have been something of a thorn in Postgres’ side since the dawn of time. On one hand, they’re necessary to differentiate tuple visibility between past, present, and concurrent transactions. On the other hand, the counter that stores it is only 32-bits, meaning it’s possible to eventually overflow without some kind of intervention.

Mailchimp’s recent misadventures may even make it readily apparent XIDs are a fully loaded foot-gun given the right circumstances. And it’s not just them. Other high profile news events include Sentry and their encounter back in 2015, and Semantics3 posted a great writeup of their experience in 2016. There have been other highly visible events like this over the years, and each of them is as painful to read as the last.

Reframing the Problem

The crux of the issue here is the tacit assumption that two billion transactions is sufficient in any modern context. In many large-scale databases, it’s not even particularly notable, let alone “enough”. Extremely high OLTP systems can easily consume that many per day, or even faster with powerful enough hardware.

Robert Haas shared some analysis back when devs were frantically tweaking the Postgres locking code, and demonstrated rather stunning performance increases back in 2012. Back then, a 64-core server could deliver over 350k read-only transactions per second. Now imagine if a production database system actually managed a write transaction throughput that high. It would take less than two hours to exhaust two billion transactions at that rate. Every table in the database would need to be frozen at least once every two hours.

That’s simply not practical. It may not even be possible given table access and vacuum are both constantly vying for the same IO resources. Even a fully in-memory database may simply have too many non-frozen pages spread across hundreds or thousands of tables. At high enough volume, constant anti-wraparound vacuums may be fighting a losing battle.

It’s the Monitoring, Stupid!

Maintaining eternal vigilance against potential issues is absolutely necessary in regard to transaction IDs. Countless articles on monitoring wraparound from professional consultants and hobbyists practically wallpaper the internet at this point. Even Amazon explained how to use their interface to create a warning system to prevent XID wraparound problems for users of their RDS platform.

Everyone knows, yet it somehow remains a kind of dirty secret that still catches engineers off guard. It’s also extremely unintuitive in anything but an Multi Version Concurrency Control (MVCC) context. If Postgres were a filesystem, we staunch defenders could say that XIDs are equivalent to remaining filesystem space, since they’re both fungible resources that should be closely monitored.

Yet perhaps that analogy isn’t quite adept at framing the situation. It’s more likely users see the database itself as a type of filesystem. In which case, who would feel the necessity to prevent perfectly ordinary files from suddenly corrupting themselves simply because they were too old? Anyone using a Copy on Write filesystem like ZFS gets a similar benefit as MVCC provides, yet there’s no equivalent to XID wraparound there.

It would be folly to presume XID exhaustion is merely the outcome of lazy, incompetent, or incomplete monitoring alone. Remember, if the database is active enough, even responding within an hour may be too late. Having monitoring is one thing, addressing alerts effectively is quite another.

A lesson many of us learn too late, is that scope is nearly never static in nature. What may have been perfectly sufficient within the original architecture and design of the stack may have morphed over time into something entirely different. Maybe last year there was not even a tiny risk of wraparound, or a sudden influx of business—normally a momentus occasion—suddenly floods a system unprepared to handle that volume. Scaling up in this case would just make the problem worse unless there was a Postgres expert on hand who understood the implications.

Monitoring is only the first arrow in our quiver; one arrow cannot fell a mammoth.

How We Got Here

The way Postgres implements MVCC is the sticking point.

The origin of Postgres, despite many decisions that were ahead of their time, was still a point in history where storage was extremely expensive and relatively tiny. As a result, one of those choices was to limit page and tuple headers as efficient as humanly possible, and savor every last bit. Using 64-bits for XIDs would, after all, vastly increase the amount of space just for metadata related to every single row in the database.

Consider what’s in a Postgres tuple header. Were we to expand XID from 32 to 64 bits, we’d need to expand not one, but three header elements. That would inflate the header of a tuple from 23 bytes to 35. That’s 12GB of extra overhead per 1B rows! How much is 23 versus 35 bytes?

I'll always like to eat
I'll always like to eat my enemies!

What’s more, existing rows could not benefit from this kind of expansion. That would mean a very time-consuming and antiquated dump and restore procedure, something we’ve long past evolved beyond thanks to tools like pg_upgrade. Who wants to spend days or weeks transferring 50TB worth of data into a new target that removes the XID issue? Using logical replication to prevent downtime might be the only viable way forward. Luckily we’ve had that natively since Postgres 10, or through the pglogical extension since 9.4.

There can’t be any half-measures here, either; the XID is everywhere in Postgres. Replacing it would mean a hard compatibility break with the entire backend for all previous versions and associated tooling. It would fundamentally alter what Postgres is. For the better, to be sure, but what an eventful transition it would be!

A Way Out

Perhaps the solution is to horizontally scale the database using sharding or some other intricate distribution mechanism. Yet even that only staves off the issue for as many nodes as the cluster contains, even assuming perfectly even volume and traffic balance. Ten nodes only gives a maximum of 20-billion transactions before the problem reoccurs.

Maybe it’s possible to use a multi-master solution like BDR? Swap the current active database and vacuum the “offline” system while being as aggressive as necessary. That works so long as the write activity from the remaining node isn’t disruptive, but at high enough volumes, even this solution can’t keep up. Remember that active database at the beginning; we may have to swap nodes every few hours to remain viable. One wrong move and we’re toast!

It’s just kicking the can down the road, and that’s sorta what got us here in the first place. That’s what makes news like this so exciting!

Once data storage become decoupled from the engine itself, a whole new world opens up. Instead of using a shoehorn to extend the life of Old Faithful, it can be replaced outright. I imagine there are several improvements the developers have been wanting to implement, but couldn’t due to restrictions borne from the old header design. So now they can add a whole new system, and users can transition to it at their own pace.

Plug the replacement in, run an alter statement to rewrite each table into the new storage format, and call it a day. Wouldn’t that be nice?

And what kind of backends can we look forward to? Right now the list is fairly short. EnterpriseDB is currently working on zheap, which is similar to how Oracle manages in-place tuple replacement with rollback segments. There’s a columnar store on the way as well. What others can we be expecting?

What I’m mainly looking forward to is the native storage backend replacement for the main branch. What will it be? Will there be an integrated identifier so multi-master systems can uniquely version rows across the cluster? Who knows. But there could be. And if not, well… it would at least be possible to contribute a storage backend that did.

This development is of similar impact as when Postgres added extensions. If Postgres didn’t do something, adding it was “just a few lines of code away”. But that capability never extended to the storage system, as data durability is governed by the Write Ahead Log (WAL). Without that, there’s no crash safety, logical or physical replication, and any number of associated features.

Getting to that end result may still be one or two versions away, but now there’s a very clear light at the end of the tunnel. This has been a long time coming, and it pays to get it right the first time.

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *