2ndQuadrant is now part of EDB

Bringing together some of the world's top PostgreSQL experts.

2ndQuadrant | PostgreSQL
Mission Critical Databases
  • Contact us
  • EN
    • FR
    • IT
    • ES
    • DE
    • PT
  • Support & Services
  • Products
  • Downloads
    • Installers
      • Postgres Installer
      • 2UDA – Unified Data Analytics
    • Whitepapers
      • Business Case for PostgreSQL Support
      • Security Best Practices for PostgreSQL
    • Case Studies
      • Performance Tuning
        • BenchPrep
        • tastyworks
      • Distributed Clusters
        • ClickUp
        • European Space Agency (ESA)
        • Telefónica del Sur
        • Animal Logic
      • Database Administration
        • Agilis Systems
      • Professional Training
        • Met Office
        • London & Partners
      • Database Upgrades
        • Alfred Wegener Institute (AWI)
      • Database Migration
        • International Game Technology (IGT)
        • Healthcare Software Solutions (HSS)
        • Navionics
  • Postgres Learning Center
    • Webinars
      • Upcoming Webinars
      • Webinar Library
    • Whitepapers
      • Business Case for PostgreSQL Support
      • Security Best Practices for PostgreSQL
    • Blog
    • Training
      • Course Catalogue
    • Case Studies
      • Performance Tuning
        • BenchPrep
        • tastyworks
      • Distributed Clusters
        • ClickUp
        • European Space Agency (ESA)
        • Telefónica del Sur
        • Animal Logic
      • Database Administration
        • Agilis Systems
      • Professional Training
        • Met Office
        • London & Partners
      • Database Upgrades
        • Alfred Wegener Institute (AWI)
      • Database Migration
        • International Game Technology (IGT)
        • Healthcare Software Solutions (HSS)
        • Navionics
    • Books
      • PostgreSQL 11 Administration Cookbook
      • PostgreSQL 10 Administration Cookbook
      • PostgreSQL High Availability Cookbook – 2nd Edition
      • PostgreSQL 9 Administration Cookbook – 3rd Edition
      • PostgreSQL Server Programming Cookbook – 2nd Edition
      • PostgreSQL 9 Cookbook – Chinese Edition
    • Videos
    • Events
    • PostgreSQL
      • PostgreSQL – History
      • Who uses PostgreSQL?
      • PostgreSQL FAQ
      • PostgreSQL vs MySQL
      • The Business Case for PostgreSQL
      • Security Information
      • Documentation
  • About Us
    • About 2ndQuadrant
    • 2ndQuadrant’s Passion for PostgreSQL
    • News
    • Careers
    • Team Profile
  • Blog
  • Menu Menu
You are here: Home1 / Blog2 / PostgreSQL3 / PostgreSQL 134 / PostgreSQL 13: Don’t let slots kill your primary
PostgreSQL 13: Don’t let slots kill your primary
Álvaro Herrera

PostgreSQL 13: Don’t let slots kill your primary

July 30, 2020/0 Comments/in 2ndQuadrant, Alvaro's PlanetPostgreSQL, PostgreSQL 13 /by Álvaro Herrera

One of the interesting features in PostgreSQL since version 9.4 is the ability to control removal of WAL files using replication slots. The dark side is that replication slots can cause disks to fill up with old WAL, killing the main production server. In this article I explain PostgreSQL replication slots, and how a new feature in PostgreSQL 13 helps prevent this problem.

WAL Production

As you know, WAL is produced for database changes in a primary server: inserts, updates, et cetera. A more active database will produce more WAL — in a very active server, there can be many gigabytes of WAL produced every minute. WAL is written to files with names in an increasing numerical sequence, and the files are always the same size (16 MB is default and typical). Once the data in a file is no longer needed, that file can be recycled, which means to rename it to a higher-numbered position in the sequence so that it can be filled with new data later.

(There are special situations such as a surge in activity that leads to the creation of additional files; when later the surge dies down, those extra files are removed instead of recycled.)

Because all database write activity produces WAL, it is critical for disk space to be available. When the disk storing WAL is full, the server will be unable to process new transactions and can become stuck, or worse: it may fall over completely. So this is a situation to be avoided by all means possible.

Replication Slots

Replication in PostgreSQL works by processing WAL files. In order for this to work, all WAL files have to be transiently available until they are processed. Therefore a mechanism is needed to tell the main WAL management not to recycle or remove files.

Enter replication slots. Slots are a mechanism that indicates that this backup we’re taking will require that WAL file, and can you please not delete it yet; or this replica still hasn’t processed that WAL file, so can it please be left alone for a little while.

By themselves, replication slots occupy very little disk space. They just store tiny bit of metadata, including a pointer to a position in WAL. But the WAL data that it protects is another matter: in a highly active server, it can be measured in gigabytes or worse.

WAL Consumption

Feeding data to a physical replica means to copy the WAL data from its primary server. Similarly, a logical replica needs to read WAL data (and transmit an interpreted version to the replica). The WAL position being read is what the slot keeps track of. Once the replica has secured the WAL data somehow, the slot can be advanced; this tells WAL management in the primary that the WAL file is then available for removal. This happens continuously when the replica is active, so that WAL in the primary server will use the same amount of disk space or maybe just a little more. Even twice as much or ten times as much might be acceptable, depending on conditions.

The problem is that if a replica dies completely and doesn’t recover for a long period of time; or the replica is destroyed and the DBA forgets to remove the replication slot; or the slot is a forgotten leftover of some experiment; or even the replica is being fed over a slow network link, then the reserved WAL will grow without bounds. And that becomes a ticking bomb.

Limiting Slot Size

In order to fight this problem, Kyotaro Horiguchi had been working since February 2017 in a PostgreSQL patch to limit the size of WAL reserved by a slot. After  a very long review and rework process I integrated it for PostgreSQL 13, improving management of high-availability PostgreSQL farms.

The main principle is that it is better to kill a replica (by somehow making its slot invalid; more on that below) than killing the primary server that feeds that replica and take all production down with it.

The way it works is pretty straightforward: set max_slot_wal_keep_size (documentation) in postgresql.conf to the maximum amount of disk space of WAL that replication slots are allowed to reserve. If a slot reaches that point and a checkpoint occurs, that slot will be marked invalid and some WAL files may be deleted. If the slot was in active use by a walsender process, that process will be signalled so that it terminates. If the walsender starts again, it’ll find that necessary WAL files won’t be there anymore. The replica using that slot will have to be recloned.

If max_slot_wal_keep_size is zero, which is the default value, then there’s no limit. I don’t recommend this, because it leads to failures when slots fill the disk.

Monitoring Slot Health

Also included are some monitoring features. Two columns in pg_replication_slots are relevant. The most critical one is wal_status. If that column is reserved, then the slot is pointing to data within max_wal_size; if it is extended then it exceeded max_wal_size, but is still protected by either wal_keep_size or max_slot_wal_keep_size (including when max_slot_wal_keep_size is zero). Either state is good and normal. However, when a slot gets over the limit, it first becomes unreserved, which means it’s in imminent danger, but can still recover if it’s quick enough. Finally, the status becomes lost when WAL files have been removed and no recovery is possibleπ.

The other column is safe_wal_size: it shows the number of bytes of WAL that can be written before this slot gets in danger of having WAL files removed. We suggest to keep a close eye on this column in your monitoring system, and fire alerts when it gets low. Zero or negative means your replica will be dead as soon as a checkpoint occurs:

SELECT slot_name, active, wal_status, safe_wal_size
  FROM pg_catalog.pg_replication_slots;

We believe that this new feature makes maintenance of replicas easier and more robust; hopefully we won’t see any more disasters with production down because of these problems.

(A note: safe_wal_size was introduced in 13beta3, so be sure to consult up-to-date documentation, or you’ll see min_safe_lsn instead. Ignore that.)

Thanks

Special thanks to Kyotaro Horiguchi for working on solving this problem. Several reviewers got deep on this, among whom I’d like to thank especially Masahiko Sawada, Fujii Masao, Jehan-Guillaume de Rorthais, and Amit Kapila (in no particular order).

Tags: 2QLovesPG, Hot Standby, monitoring, PostgreSQL, postgresql 13, replica, replication, replication slot, wal
Share this entry
  • Share on Facebook
  • Share on Twitter
  • Share on WhatsApp
  • Share on LinkedIn
0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Search

Get in touch with us!

Recent Posts

  • Random Data December 3, 2020
  • Webinar: COMMIT Without Fear – The Beauty of CAMO [Follow Up] November 13, 2020
  • Full-text search since PostgreSQL 8.3 November 5, 2020
  • Random numbers November 3, 2020
  • Webinar: Best Practices for Bulk Data Loading in PostgreSQL [Follow Up] November 2, 2020

Featured External Blogs

Tomas Vondra's Blog

Our Bloggers

  • Simon Riggs
  • Alvaro Herrera
  • Andrew Dunstan
  • Craig Ringer
  • Francesco Canovai
  • Gabriele Bartolini
  • Giulio Calacoci
  • Ian Barwick
  • Marco Nenciarini
  • Mark Wong
  • Pavan Deolasee
  • Petr Jelinek
  • Shaun Thomas
  • Tomas Vondra
  • Umair Shahid

PostgreSQL Cloud

2QLovesPG 2UDA 9.6 backup Barman BDR Business Continuity community conference database DBA development devops disaster recovery greenplum Hot Standby JSON JSONB logical replication monitoring OmniDB open source Orange performance PG12 pgbarman pglogical PG Phriday postgres Postgres-BDR postgres-xl PostgreSQL PostgreSQL 9.6 PostgreSQL10 PostgreSQL11 PostgreSQL 11 PostgreSQL 11 New Features postgresql repmgr Recovery replication security sql wal webinar webinars

Support & Services

24/7 Production Support

Developer Support

Remote DBA for PostgreSQL

PostgreSQL Database Monitoring

PostgreSQL Health Check

PostgreSQL Performance Tuning

Database Security Audit

Upgrade PostgreSQL

PostgreSQL Migration Assessment

Migrate from Oracle to PostgreSQL

Products

HA Postgres Clusters

Postgres-BDR®

2ndQPostgres

pglogical

repmgr

Barman

Postgres Cloud Manager

SQL Firewall

Postgres-XL

OmniDB

Postgres Installer

2UDA

Postgres Learning Center

Introducing Postgres

Blog

Webinars

Books

Videos

Training

Case Studies

Events

About Us

About 2ndQuadrant

What does 2ndQuadrant Mean?

News

Careers 

Team Profile

© 2ndQuadrant Ltd. All rights reserved. | Privacy Policy
  • Twitter
  • LinkedIn
  • Facebook
  • Youtube
  • Mail
Webinar: Being Committed – A Review of Transaction Control Statements... NSS on Windows for PostgreSQL development NSS on Windows for PostgreSQL development
Scroll to top
×