PG Phriday: 10 Things Postgres Could Improve – Part 1
Postgres is a database software engine with a lot to love, and certainly a non-zero amount to hate. Rather than approaching the topic from a glib and clickbaity angle, let’s see what’s really lurking in those dark recesses, and how we may potentially address them.
Though the original blog post calling out Postgres’ faults gave ten distinct bullet-points that need to be addressed, we can’t really justify ten posts as a response. Luckily, several of the critiques fall into distinct categories that we can address in turn. Thus, this series will come in four parts:
- Transaction ID Complications
- Pitfalls of Replication
- MVCC and Storage Concerns
- Engine Functionality
We will approach each of these topics in turn and attempt to give a fair assessment of how Postgres actually operates, what the drawbacks actually are, and if there are any workarounds possible until the underlying issue is fully resolved.
We hope you enjoy!
Transaction ID Complications
Postgres represents the transaction pointer as a 32-bit integer. I went over this in much more depth in my Around the World in Two Billion Transactions, article. Transaction wraparound is indeed a real issue, especially as database sizes and transaction volumes continue to increase.
From the perspective of a DBA or end-user, much of that article is not entirely reassuring, as it ultimately concludes with the message that, "Help is coming. Maybe." Maybe the new Postgres pluggable storage engines will alleviate transaction ID wraparound concerns in some capacity. Maybe adequate monitoring can stave off the worst risks.
Multiple Attack Vectors
What compounds this even further, is that older all-visible tuples should be "frozen" so they no longer artificially hold back the valuable cyclical transaction ID pointer. This means every table page must be visited at least one subsequent future time to clear that value, incurring both a read and potentially a write on every page in the database. Is this viable on a 1TB database? How about a 500TB database?
Eventually, sure. Allow autovacuum to do its job. Tune it to be as greedy as storage performance metrics will permit without adversely impacting query performance. That’s possible, but requires expertise that is in limited supply without contacting a 3rd party consulting organization like 2ndQuadrant.
And the situation gets even more confounded by the resolution paths. Any long-running transaction as identified by
pg_stat_statements could be holding the XID epoch artificially low, or preventing autovacuum from cleaning up critical pages. Additionally, enabling
hot_standby_feedback on a replica can cause even
SELECT queries to prevent tuple freezing on the Primary node.
Even more insidious is an issue few users even know to investigate: the presence of prepared transactions. If a transaction is prepared and subsequently abandoned without being cleaned up, it can also prevent the XID from advancing. These types of transactions persist across database restarts, and could lurk unseen for weeks or even months before anyone realizes there’s a problem, if they do at all. And the only way to tell this has happened is to consult the
pg_prepared_xacts system catalog, which most users don’t even know exists.
While disabled by default thanks to
max_prepared_transactions being set to 0 in new clusters for modern versions of Postgres, that wasn’t always the case. Older clusters that keep propagating old defaults through upgrades, or slightly more advanced users who enabled them explicitly may not know of the above risk. So while the risk here is rare and minimal, it’s one that should not be ignored. Readers of this blog may know, and Postgres subject matter experts of all description, but who else?
So how can we mitigate the risk here until a more permanent fix is forthcoming?
Currently one of the best approaches to partially address this particular difficulty, is to make liberal use of partitions. Assume partitioning is done by date, and older data is unlikely to experience further modifications past a certain point. In that scenario, some partitions could be frozen in perpetuity since their data pages would be free of new MVCC entries. This means all new data would exist in a kind of rolling live segment of recent partitions that are likely a much smaller proportion of available data.
Doing this is an invasive modification to the database design, and possibly the app itself. It also requires a certain level of familiarity to properly deploy, and again, likely would be best done with oversight from a consulting firm or some other available expert.
We also need to fastidiously monitor Postgres transaction ID age, along with all of the previously discussed elements. These are essential:
- The age-adjusted XID of each database in the instance.
- The amount of idle transactions in connected sessions.
- The maximum observed time of the oldest idle transaction.
- The amount of prepared transactions.
- The maximum observed time of the oldest prepared transaction.
- All of the same checks on any physical replica, assuming
While we could supply queries for each of these, there are Postgres monitoring tools such as check_postgres.pl that already implement all of them.
Finally, never disable autovacuum. Further, the default settings are incredibly conservative with regard to resource consumption. Modern disks are rarely starved of IO throughput, and should make autovacuum much more aggressive. At the very least, this means increasing
2000 or higher to allow background autovacuum workers to accrue longer run times between forced pauses.
Still, our countermeasures are comparatively few, and consist of monitoring and crude workarounds. The perceived problem itself is still present, and will likely remain so for several more years at minimum.
The fact of the matter is that users who encounter this without sufficient research will be caught completely off-guard. Even with adequate monitoring, addressing the situation isn’t always obvious or straightforward, and falls into an area rife with tripwires and buried in esoteric system catalogs. And in a worst case scenario, it’s still necessary to shut down the database so it can be frozen in an offline state.
The risk of corruption here is incredibly low due to all of the built-in safeguards, but that’s of little consolation to a company with an offline primary database. Whether by finally moving to a 64-bit XID tracking value, or through augmenting the Postgres storage engine to include something like zheap, this particular wart on the proverbial Postgres toad will continue to thwart unprepared adopters.
The bit about recommending vacuum_cost_limit be set to 2000 should perhaps read autovacuum_vacuum_cost_limit, or else recommend keeping the default of -1 there to continue to inherit the value. If someone has tweaked autovacuum_vacuum_cost_limit to set to something besides -1, and vacuum_cost_delay is still set to 0 as is default, then changing only vacuum_cost_limit will have zero impact on both auto and manual vacuum.
True. I tend to operate from the perspective that leaving the default of
autovacuum_vacuum_cost_limitat the default is the best policy, and to only tweak `vacuum_cost_limit` instead. That’s obviously not a requirement, or what may be currently deployed.