2ndQuadrant is now part of EDB

Bringing together some of the world's top PostgreSQL experts.

2ndQuadrant | PostgreSQL
Mission Critical Databases
  • Contact us
  • EN
    • FR
    • IT
    • ES
    • DE
    • PT
  • Support & Services
  • Products
  • Downloads
    • Installers
      • Postgres Installer
      • 2UDA – Unified Data Analytics
    • Whitepapers
      • Business Case for PostgreSQL Support
      • Security Best Practices for PostgreSQL
    • Case Studies
      • Performance Tuning
        • BenchPrep
        • tastyworks
      • Distributed Clusters
        • ClickUp
        • European Space Agency (ESA)
        • Telefónica del Sur
        • Animal Logic
      • Database Administration
        • Agilis Systems
      • Professional Training
        • Met Office
        • London & Partners
      • Database Upgrades
        • Alfred Wegener Institute (AWI)
      • Database Migration
        • International Game Technology (IGT)
        • Healthcare Software Solutions (HSS)
        • Navionics
  • Postgres Learning Center
    • Webinars
      • Upcoming Webinars
      • Webinar Library
    • Whitepapers
      • Business Case for PostgreSQL Support
      • Security Best Practices for PostgreSQL
    • Blog
    • Training
      • Course Catalogue
    • Case Studies
      • Performance Tuning
        • BenchPrep
        • tastyworks
      • Distributed Clusters
        • ClickUp
        • European Space Agency (ESA)
        • Telefónica del Sur
        • Animal Logic
      • Database Administration
        • Agilis Systems
      • Professional Training
        • Met Office
        • London & Partners
      • Database Upgrades
        • Alfred Wegener Institute (AWI)
      • Database Migration
        • International Game Technology (IGT)
        • Healthcare Software Solutions (HSS)
        • Navionics
    • Books
      • PostgreSQL 11 Administration Cookbook
      • PostgreSQL 10 Administration Cookbook
      • PostgreSQL High Availability Cookbook – 2nd Edition
      • PostgreSQL 9 Administration Cookbook – 3rd Edition
      • PostgreSQL Server Programming Cookbook – 2nd Edition
      • PostgreSQL 9 Cookbook – Chinese Edition
    • Videos
    • Events
    • PostgreSQL
      • PostgreSQL – History
      • Who uses PostgreSQL?
      • PostgreSQL FAQ
      • PostgreSQL vs MySQL
      • The Business Case for PostgreSQL
      • Security Information
      • Documentation
  • About Us
    • About 2ndQuadrant
    • 2ndQuadrant’s Passion for PostgreSQL
    • News
    • Careers
    • Team Profile
  • Blog
  • Menu Menu
You are here: Home1 / Blog2 / Shaun's PlanetPostgreSQL3 / PG Phriday: 10 Things Postgres Could Improve – Part 2

Last week, we examined Postgres XID wraparound complications in greater depth to see exactly how easily they can surprise even prepared enterprises. But we also found the real problem areas and how to mitigate them specifically.

In this continuing series to plumb the depths of Postgres constructively, we’re going to focus on some of the ways Postgres replication may either work nonintuitively, or result in complications. And just like last time, we’ll discuss just how to address the situation so we can get the most of our Postgres database cluster.

Let’s get started.

Framing Postgres Replication

Back in the Bad Old Days before 9.0, Postgres didn’t have replication at all. The only thing that existed was continuous crash recovery, accomplished by applying WAL files shipped from an upstream Primary node.

In many ways, even the most current Postgres 12 version still reflects that model. All write activity is first written to the WAL, which replicas can now stream directly from the Primary node. Some under-the-hood modifications also made it possible to read a instances that were in the middle of WAL apply, but it is still very much a continuous crash-recovery apply mechanism.

This has some implications both fortunate and less so, however.

Hey, You Missed a Spot!

The default Postgres replication model is an asynchronous one, meaning that any data not received by a replica may be permanently lost. Some have suggested that synchronous replication is a solution to this, but that is only the case provided at least two replicas are present.

Why exactly is that? Try this experiment:

-- On the Primary node

CREATE TABLE foo ( id int );

-- Stop the synchronous standby. Then:

BEGIN;
INSERT INTO foo VALUES (1);
COMMIT;

The COMMIT should hang at this point. Now cancel the transaction, and select from the table. The supposedly canceled transaction is alive and well, despite having no acknowledgment from the synchronous standby. Even though the Postgres client refuses further action until the previous transaction is acknowledged, that didn’t stop it from occurring.

This is actually the expected behavior since Postgres uses WAL streaming for replication. In order for the transaction to be replayed on a replica, it must first exist in the stream. Ostensibly then, the transaction COMMIT record must also appear in the WAL before being consumed by the standby, even if it is synchronous.

Additionally, Postgres does not use Quorum during any of this. There is configuration syntax to demand that multiple servers acknowledge a transaction, but that still doesn’t prevent it from being committed locally. It only prevents further transactions from that particular client session.

As a result, the only safe way to use synchronous replication is to deploy at least two such replicas, and then hope only one goes down at a time. Otherwise, it is generally not possible to prevent lost data; if the Primary irrecoverably crashes while the Standby is also down, any transactions committed while the Standby was unreachable also vanish into the ether.

Ebb and Flow

Surely the window for actual data loss is small in that edge case. Only one transaction from every currently connected session may have been written on the lost instance. That’s no real consolation in some contexts however, and Postgres has no real answer for this particular scenario.

The easiest way to address this deficiency is to always maintain at least one additional active synchronous stream than is strictly necessary to match the desired specification. If the design model calls for a minimum of two synchronous streams, use three.

The additional streaming node in this case should always be a backup. Postgres itself ships with the pg_receivewal utility which also supports synchronous mode via the --synchronous argument. It’s important to create a replication slot and keep a synchronous stream running to avoid any gaps in the WAL stream. Files received this way should be archived long-term, and are then available for PITR purposes.

Given the coordination necessary here, we designed Barman to perform these tasks in addition to managing backups. A sufficiently configured Barman server will not only keep pg_receivewal running, but also create a replication slot in cases of failover to a different node, and catalogs the files themselves for easy retrieval during a recovery. Barman also uses Postgres streaming replication, so is one of the only Postgres backup tools that could possibly reach RPO-zero, the concept that no data is ever lost during a system outage.

In that case, we wouldn’t need two fully qualified synchronous Postgres replicas in a cluster. We could fulfill the same write guarantee with one regular database instance and a Barman server, which we may need anyway if we want to follow best practices for disaster recovery.

Drinking From the Fire Hose

Since Postgres replication is WAL-based, everything written to the WAL is also written to the replication stream. This includes any ad-hoc maintenance activity such as VACUUM FULL, CLUSTER, or REINDEX, among others. These produce a burst of activity that must be replayed across every other physical replica.

It also means any data corruption on the Primary node always reaches those same replicas. But this is no surprise. Postgres physical replicas are essentially backups that are in continuous WAL recovery and also happen to remain online for read access. Except for hint bits, streaming replicas are byte-for-byte exact copies of the primary node. And if the wal_log_hints setting is enabled, or the database was created with checksums enabled, even those minutiae join the fire-hose.

Taking a Sip Instead

Indeed, logical replication can address both of those particular physical replication shortcomings. The data tuple or statement itself is replicated rather than all of the related page modifications. This Provider->Subscriber model decouples the binary dependence between the two nodes, and even makes it possible to mix otherwise incompatible data backends. Logical replication is the key to in-place zero-downtime major-version upgrades. What’s not to love?

There is unfortunately, one rather conspicuous caveat to those statements. Postgres logical replication requires the use of replication slots to carefully track the LSN replay position of the subscribing system. However, even since their introduction in Postgres 9.4, these are still not copied to physical replicas, nor retained during a backup. This means if a Primary server has a Standby replace it, the last replicated position on all subscribed logical replicas is lost.

The subscriber itself retains the last valid origin information in the pg_replication_origin_status view. This means we could recreate the slot to enable the subscriber to continue streaming WAL files.

Maybe.

Slots work by preventing WAL recycling. If enough transaction volume pushes the LSN beyond the position required by the subscriber before it can reconnect, replication can never resume because it’s not possible to create slots for past WAL positions. This makes sense, because those past WAL files were likely recycled once they were no longer required.

This also reveals another somewhat uncomfortable fact about replication slots: they can’t be rewound even if the required WAL files have not been recycled. Get a connection string wrong and accidentally fast-forward a slot on the wrong server with something like wal2json? Too bad.

While logical replication provides Postgres with some powerful capabilities such as online upgrades, it’s not yet a real replacement for physical replication.

Plugging Holes in the Dam

So what is?

Physical replicas can revert to manually processing WAL files rather than streaming them from some origin system. Logical replicas do not have this option.

Why not? Postgres logical WAL replication works by using the origin Postgres instance to decode the WAL information via the local system catalog. This converts table and schema identities into their fully qualified equivalent. Rather than 3249.98791, we see a table named foo.bar instead. This is necessary because the Postgres catalog follows the same transaction rules as anything else. Hence if we rename foo.bar to foo.bun, the table name will be accurate based on when it is decoded.

Obviously a subscriber system can’t do this; it has no idea which exact transaction may have modified a table’s name or column structure. Since decoding can only occur on an online origin node, merely having the related WAL files available to the subscriber system is insufficient.

In this case, we can do a couple of things to save ourselves. While replication slots are ostensibly a replacement for wal_keep_segments, this isn’t true in practice thanks to slots not being reflected in backups or Standby nodes.

In this case, we can follow the technique I outlined in Terrific Throughput Tracking and actually calculate our WAL production rate. If our failover process usually takes one or two minutes, we want to account for up to 10 times that much.

For example, if we normally produce 320MB of WAL traffic in two minutes (20 WAL files), we would want to set wal_keep_segments to 200. Thus our automated systems would have up to 20 minutes to recreate the replication slot before it would become necessary to rebuild any tables dependent on logical replication.

Sealing the Last Leak

What Postgres really needs to improve its logical replication situation is a minimum of two things:

  1. A mechanism for relaying replication slot information to Standby nodes.
  2. Some way of adding extra WAL contents such that object names exist in decoded form within the WAL file itself, or as a supplementary transaction pinned decode mapping.

So long as logical replication must depend on an online, living and breathing node, using information that literally cannot be backed up, it will remain essentially a novelty.

2ndQuadrant has addressed item 1 in our pglogical 3 extension. A pglogical 3 background worker can be configured to regularly copy slot positions from the upstream Primary for specified replication slots. Unfortunately this functionality is only available to commercial customers, and doesn’t solve the underlying problem for users of any community version of Postgres.

The second issue is something that has yet to be approached. WAL must be decoded by the node that produced it, yet slots only move forward. Combined with the fact a logical subscriber can’t decode the files for themselves, the only remaining option is to rebuild the logical subscriber itself. Depending on database sizes, that can be extremely resource and time intensive.

This is a weakness in Postgres that hearkens back to the ancient days before pg_upgrade made it possible to upgrade a Postgres database without performing a full dump/restore cycle. The thankless, resource-intensive, and time-consuming task of rebuilding a logical replica is currently taken for granted. Our users are not likely to be very forgiving of that drawback, especially if forced to repeat the process repeatedly on a multi-TB system.

We can prepare for all of this, and circumvent much to a certain extent, but Postgres can do better.

Search

Get in touch with us!

Recent Posts

  • Random Data December 3, 2020
  • Webinar: COMMIT Without Fear – The Beauty of CAMO [Follow Up] November 13, 2020
  • Full-text search since PostgreSQL 8.3 November 5, 2020
  • Random numbers November 3, 2020
  • Webinar: Best Practices for Bulk Data Loading in PostgreSQL [Follow Up] November 2, 2020

Featured External Blogs

Tomas Vondra's Blog

Our Bloggers

  • Simon Riggs
  • Alvaro Herrera
  • Andrew Dunstan
  • Craig Ringer
  • Francesco Canovai
  • Gabriele Bartolini
  • Giulio Calacoci
  • Ian Barwick
  • Marco Nenciarini
  • Mark Wong
  • Pavan Deolasee
  • Petr Jelinek
  • Shaun Thomas
  • Tomas Vondra
  • Umair Shahid

PostgreSQL Cloud

2QLovesPG 2UDA 9.6 backup Barman BDR Business Continuity community conference database DBA development devops disaster recovery greenplum Hot Standby JSON JSONB logical replication monitoring OmniDB open source Orange performance PG12 pgbarman pglogical PG Phriday postgres Postgres-BDR postgres-xl PostgreSQL PostgreSQL 9.6 PostgreSQL10 PostgreSQL11 PostgreSQL 11 PostgreSQL 11 New Features postgresql repmgr Recovery replication security sql wal webinar webinars

Support & Services

24/7 Production Support

Developer Support

Remote DBA for PostgreSQL

PostgreSQL Database Monitoring

PostgreSQL Health Check

PostgreSQL Performance Tuning

Database Security Audit

Upgrade PostgreSQL

PostgreSQL Migration Assessment

Migrate from Oracle to PostgreSQL

Products

HA Postgres Clusters

Postgres-BDR®

2ndQPostgres

pglogical

repmgr

Barman

Postgres Cloud Manager

SQL Firewall

Postgres-XL

OmniDB

Postgres Installer

2UDA

Postgres Learning Center

Introducing Postgres

Blog

Webinars

Books

Videos

Training

Case Studies

Events

About Us

About 2ndQuadrant

What does 2ndQuadrant Mean?

News

Careers 

Team Profile

© 2ndQuadrant Ltd. All rights reserved. | Privacy Policy
  • Twitter
  • LinkedIn
  • Facebook
  • Youtube
  • Mail
Webinar: Understanding the PostgreSQL table page layout [Follow Up] Webinar: Understanding the PostgreSQL table page layout [Follow Up] Local Persistent Volumes and PostgreSQL usage in Kubernetes Local Persistent Volumes and PostgreSQL usage in Kubernetes
Scroll to top
×