2ndQuadrant is now part of EDB

Bringing together some of the world's top PostgreSQL experts.

2ndQuadrant | PostgreSQL
Mission Critical Databases
  • Contact us
  • EN
    • FR
    • IT
    • ES
    • DE
    • PT
  • Support & Services
  • Products
  • Downloads
    • Installers
      • Postgres Installer
      • 2UDA – Unified Data Analytics
    • Whitepapers
      • Business Case for PostgreSQL Support
      • Security Best Practices for PostgreSQL
    • Case Studies
      • Performance Tuning
        • BenchPrep
        • tastyworks
      • Distributed Clusters
        • ClickUp
        • European Space Agency (ESA)
        • Telefónica del Sur
        • Animal Logic
      • Database Administration
        • Agilis Systems
      • Professional Training
        • Met Office
        • London & Partners
      • Database Upgrades
        • Alfred Wegener Institute (AWI)
      • Database Migration
        • International Game Technology (IGT)
        • Healthcare Software Solutions (HSS)
        • Navionics
  • Postgres Learning Center
    • Webinars
      • Upcoming Webinars
      • Webinar Library
    • Whitepapers
      • Business Case for PostgreSQL Support
      • Security Best Practices for PostgreSQL
    • Blog
    • Training
      • Course Catalogue
    • Case Studies
      • Performance Tuning
        • BenchPrep
        • tastyworks
      • Distributed Clusters
        • ClickUp
        • European Space Agency (ESA)
        • Telefónica del Sur
        • Animal Logic
      • Database Administration
        • Agilis Systems
      • Professional Training
        • Met Office
        • London & Partners
      • Database Upgrades
        • Alfred Wegener Institute (AWI)
      • Database Migration
        • International Game Technology (IGT)
        • Healthcare Software Solutions (HSS)
        • Navionics
    • Books
      • PostgreSQL 11 Administration Cookbook
      • PostgreSQL 10 Administration Cookbook
      • PostgreSQL High Availability Cookbook – 2nd Edition
      • PostgreSQL 9 Administration Cookbook – 3rd Edition
      • PostgreSQL Server Programming Cookbook – 2nd Edition
      • PostgreSQL 9 Cookbook – Chinese Edition
    • Videos
    • Events
    • PostgreSQL
      • PostgreSQL – History
      • Who uses PostgreSQL?
      • PostgreSQL FAQ
      • PostgreSQL vs MySQL
      • The Business Case for PostgreSQL
      • Security Information
      • Documentation
  • About Us
    • About 2ndQuadrant
    • 2ndQuadrant’s Passion for PostgreSQL
    • News
    • Careers
    • Team Profile
  • Blog
  • Menu Menu
You are here: Home1 / Blog2 / Shaun's PlanetPostgreSQL3 / PG Phriday: Around the World in Two Billion Transactions
Shaun Thomas

PG Phriday: Around the World in Two Billion Transactions

April 19, 2019/0 Comments/in Shaun's PlanetPostgreSQL /by Shaun Thomas

Transaction IDs (XID) have been something of a thorn in Postgres’ side since the dawn of time. On one hand, they’re necessary to differentiate tuple visibility between past, present, and concurrent transactions. On the other hand, the counter that stores it is only 32-bits, meaning it’s possible to eventually overflow without some kind of intervention.

Mailchimp’s recent misadventures may even make it readily apparent XIDs are a fully loaded foot-gun given the right circumstances. And it’s not just them. Other high profile news events include Sentry and their encounter back in 2015, and Semantics3 posted a great writeup of their experience in 2016. There have been other highly visible events like this over the years, and each of them is as painful to read as the last.

Reframing the Problem

The crux of the issue here is the tacit assumption that two billion transactions is sufficient in any modern context. In many large-scale databases, it’s not even particularly notable, let alone “enough”. Extremely high OLTP systems can easily consume that many per day, or even faster with powerful enough hardware.

Robert Haas shared some analysis back when devs were frantically tweaking the Postgres locking code, and demonstrated rather stunning performance increases back in 2012. Back then, a 64-core server could deliver over 350k read-only transactions per second. Now imagine if a production database system actually managed a write transaction throughput that high. It would take less than two hours to exhaust two billion transactions at that rate. Every table in the database would need to be frozen at least once every two hours.

That’s simply not practical. It may not even be possible given table access and vacuum are both constantly vying for the same IO resources. Even a fully in-memory database may simply have too many non-frozen pages spread across hundreds or thousands of tables. At high enough volume, constant anti-wraparound vacuums may be fighting a losing battle.

It’s the Monitoring, Stupid!

Maintaining eternal vigilance against potential issues is absolutely necessary in regard to transaction IDs. Countless articles on monitoring wraparound from professional consultants and hobbyists practically wallpaper the internet at this point. Even Amazon explained how to use their interface to create a warning system to prevent XID wraparound problems for users of their RDS platform.

Everyone knows, yet it somehow remains a kind of dirty secret that still catches engineers off guard. It’s also extremely unintuitive in anything but an Multi Version Concurrency Control (MVCC) context. If Postgres were a filesystem, we staunch defenders could say that XIDs are equivalent to remaining filesystem space, since they’re both fungible resources that should be closely monitored.

Yet perhaps that analogy isn’t quite adept at framing the situation. It’s more likely users see the database itself as a type of filesystem. In which case, who would feel the necessity to prevent perfectly ordinary files from suddenly corrupting themselves simply because they were too old? Anyone using a Copy on Write filesystem like ZFS gets a similar benefit as MVCC provides, yet there’s no equivalent to XID wraparound there.

It would be folly to presume XID exhaustion is merely the outcome of lazy, incompetent, or incomplete monitoring alone. Remember, if the database is active enough, even responding within an hour may be too late. Having monitoring is one thing, addressing alerts effectively is quite another.

A lesson many of us learn too late, is that scope is nearly never static in nature. What may have been perfectly sufficient within the original architecture and design of the stack may have morphed over time into something entirely different. Maybe last year there was not even a tiny risk of wraparound, or a sudden influx of business—normally a momentus occasion—suddenly floods a system unprepared to handle that volume. Scaling up in this case would just make the problem worse unless there was a Postgres expert on hand who understood the implications.

Monitoring is only the first arrow in our quiver; one arrow cannot fell a mammoth.

How We Got Here

The way Postgres implements MVCC is the sticking point.

The origin of Postgres, despite many decisions that were ahead of their time, was still a point in history where storage was extremely expensive and relatively tiny. As a result, one of those choices was to limit page and tuple headers as efficient as humanly possible, and savor every last bit. Using 64-bits for XIDs would, after all, vastly increase the amount of space just for metadata related to every single row in the database.

Consider what’s in a Postgres tuple header. Were we to expand XID from 32 to 64 bits, we’d need to expand not one, but three header elements. That would inflate the header of a tuple from 23 bytes to 35. That’s 12GB of extra overhead per 1B rows! How much is 23 versus 35 bytes?

I'll always like to eat
I'll always like to eat my enemies!

What’s more, existing rows could not benefit from this kind of expansion. That would mean a very time-consuming and antiquated dump and restore procedure, something we’ve long past evolved beyond thanks to tools like pg_upgrade. Who wants to spend days or weeks transferring 50TB worth of data into a new target that removes the XID issue? Using logical replication to prevent downtime might be the only viable way forward. Luckily we’ve had that natively since Postgres 10, or through the pglogical extension since 9.4.

There can’t be any half-measures here, either; the XID is everywhere in Postgres. Replacing it would mean a hard compatibility break with the entire backend for all previous versions and associated tooling. It would fundamentally alter what Postgres is. For the better, to be sure, but what an eventful transition it would be!

A Way Out

Perhaps the solution is to horizontally scale the database using sharding or some other intricate distribution mechanism. Yet even that only staves off the issue for as many nodes as the cluster contains, even assuming perfectly even volume and traffic balance. Ten nodes only gives a maximum of 20-billion transactions before the problem reoccurs.

Maybe it’s possible to use a multi-master solution like BDR? Swap the current active database and vacuum the “offline” system while being as aggressive as necessary. That works so long as the write activity from the remaining node isn’t disruptive, but at high enough volumes, even this solution can’t keep up. Remember that active database at the beginning; we may have to swap nodes every few hours to remain viable. One wrong move and we’re toast!

It’s just kicking the can down the road, and that’s sorta what got us here in the first place. That’s what makes news like this so exciting!

Postgres now has pluggable table storage. Phew, this took longer than planned.

Brought to you in collaboration with Haribabu Kommi, Alvaro Herrera, Alexander Korotkov, David Rowley, Dimitri Golgov and others!https://t.co/YuBrw15VSJhttps://t.co/kan1RqlcEC

— Andres Freund (Tech) (@AndresFreundTec) April 4, 2019

Once data storage become decoupled from the engine itself, a whole new world opens up. Instead of using a shoehorn to extend the life of Old Faithful, it can be replaced outright. I imagine there are several improvements the developers have been wanting to implement, but couldn’t due to restrictions borne from the old header design. So now they can add a whole new system, and users can transition to it at their own pace.

Plug the replacement in, run an alter statement to rewrite each table into the new storage format, and call it a day. Wouldn’t that be nice?

And what kind of backends can we look forward to? Right now the list is fairly short. EnterpriseDB is currently working on zheap, which is similar to how Oracle manages in-place tuple replacement with rollback segments. There’s a columnar store on the way as well. What others can we be expecting?

What I’m mainly looking forward to is the native storage backend replacement for the main branch. What will it be? Will there be an integrated identifier so multi-master systems can uniquely version rows across the cluster? Who knows. But there could be. And if not, well… it would at least be possible to contribute a storage backend that did.

This development is of similar impact as when Postgres added extensions. If Postgres didn’t do something, adding it was “just a few lines of code away”. But that capability never extended to the storage system, as data durability is governed by the Write Ahead Log (WAL). Without that, there’s no crash safety, logical or physical replication, and any number of associated features.

Getting to that end result may still be one or two versions away, but now there’s a very clear light at the end of the tunnel. This has been a long time coming, and it pays to get it right the first time.

Tags: monitoring, PG Phriday, Pluggable Storage, Wraparound, XID
Share this entry
  • Share on Facebook
  • Share on Twitter
  • Share on WhatsApp
  • Share on LinkedIn
0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Search

Get in touch with us!

Recent Posts

  • Random Data December 3, 2020
  • Webinar: COMMIT Without Fear – The Beauty of CAMO [Follow Up] November 13, 2020
  • Full-text search since PostgreSQL 8.3 November 5, 2020
  • Random numbers November 3, 2020
  • Webinar: Best Practices for Bulk Data Loading in PostgreSQL [Follow Up] November 2, 2020

Featured External Blogs

Tomas Vondra's Blog

Our Bloggers

  • Simon Riggs
  • Alvaro Herrera
  • Andrew Dunstan
  • Craig Ringer
  • Francesco Canovai
  • Gabriele Bartolini
  • Giulio Calacoci
  • Ian Barwick
  • Marco Nenciarini
  • Mark Wong
  • Pavan Deolasee
  • Petr Jelinek
  • Shaun Thomas
  • Tomas Vondra
  • Umair Shahid

PostgreSQL Cloud

2QLovesPG 2UDA 9.6 backup Barman BDR Business Continuity community conference database DBA development devops disaster recovery greenplum Hot Standby JSON JSONB logical replication monitoring OmniDB open source Orange performance PG12 pgbarman pglogical PG Phriday postgres Postgres-BDR postgres-xl PostgreSQL PostgreSQL 9.6 PostgreSQL10 PostgreSQL11 PostgreSQL 11 PostgreSQL 11 New Features postgresql repmgr Recovery replication security sql wal webinar webinars

Support & Services

24/7 Production Support

Developer Support

Remote DBA for PostgreSQL

PostgreSQL Database Monitoring

PostgreSQL Health Check

PostgreSQL Performance Tuning

Database Security Audit

Upgrade PostgreSQL

PostgreSQL Migration Assessment

Migrate from Oracle to PostgreSQL

Products

HA Postgres Clusters

Postgres-BDR®

2ndQPostgres

pglogical

repmgr

Barman

Postgres Cloud Manager

SQL Firewall

Postgres-XL

OmniDB

Postgres Installer

2UDA

Postgres Learning Center

Introducing Postgres

Blog

Webinars

Books

Videos

Training

Case Studies

Events

About Us

About 2ndQuadrant

What does 2ndQuadrant Mean?

News

Careers 

Team Profile

© 2ndQuadrant Ltd. All rights reserved. | Privacy Policy
  • Twitter
  • LinkedIn
  • Facebook
  • Youtube
  • Mail
Optimizing storage of small tables in PostgreSQL 12 Postgres is the coolest database - Reason #2 The License Postgres is the coolest database – Reason #2: The License
Scroll to top
×