Streaming replication slots in PostgreSQL 9.4
Streaming replication slots are a pending feature in PostgreSQL 9.4, as part of the logical changeset extraction feature.
What are they for, what do you need to know, what changes?
What are replication slots?
Streaming replication slots are a new facility introduced in PostgreSQL 9.4. They are a persistent record of the state of a replica that is kept on the master server even when the replica is offline and disconnected.
They aren’t used for physical replication by default, so you’ll only be dealing with them if you enable their use. This post explains what they do, why they’re useful for physical replication, and why they’re necessary for logical replication.
Why slots?
As part of the ongoing work Andres has been doing on log-streaming logical replication, changeset extraction, and bidirectional replication, there’s been the need to better control WAL retention.
As streaming replication currently works, the master doesn’t make any effort to retain extra WAL for a standby that’s fallen behind. If the standby gets too far behind, the master will delete WAL segments the standby still needs to replay, and the standby cannot recover. You get an error like:
ERROR: requested WAL segment 00000001000000010000002D has already been removed
To prevent this, you are currently expected to configure continuous archiving and provide a restore_command to give the replica access to the archive, or be willing to re-create a standby that falls behind.
A less reliable alternative is to set wal_keep_segments to a “high enough” value to ensure that the replica never falls too far behind – but there’s no real way to guess what is high enough, and the higher the value is set the greater the risk the master will be able to run out of space in pg_xlog. So generally archiving is preferred.
The master could retain WAL for a replica while it’s connected, but as soon as the replica disconnects for any reason the master forgets all about it. As transient disconnects are common, this means there’s not much point in trying to retain WAL just for connected replicas.
We’ve gotten away with this so far because the option of falling back to continuous archiving has been available for when the master discards WAL the replicas still need. When combined with or scripts like WAL-E, a shared network file system, or a shared host to scp archives to/from, it’s done the job pretty well. It’s a bit confusing for new users, but it works well.
With log streaming logical replication we can’t fall back on continuous archiving anymore, because the master must read WAL and decode it to send the results to the replicas; it can’t just send raw WAL records. Replicas cannot just read raw WAL from the master and extract what they require themselves because only the master has a record of transaction ID statuses, the local oids of objects, etc.
We could just teach the master to run restore_command (something people seem to assume happens anyway) – but that could get very inefficient when many replicas want the same WAL archive. We’d need a caching system and other complexity. It also doesn’t get rid of the need for the admin to manage retention for archived WAL.
Unlike with physical replication, the option of just re-seeding a replica that has fallen too far behind is not acceptable for logical replication. There may be other databases on the replica host that contain their own original data. In the bidirectional replication case, the replica database may also contain newdata not yet replicated to other nodes. So we can’t just write the replica off and make a new pg_basebackup from the master or re-rsync it like we do with physcial replication.
The introduction of slots is intended to deal with this issue by giving the master persistent knowledge of the state of replicas, even when they are disconnected. While it was added for logical replication, the concept is useful in physical replication as well, as it lets the master retain as much WAL as is required by all its standbys and no more. There’s no guesswork like there is with wal_keep_segments. There’s no need to set up a separate WAL archiving system and manage retention in it. So long as you have enough space in pg_xlog for the extra WAL, it just works. If a replica disconnects the master can keep WAL for it until it reconnects and replays the WAL.
The downside of this approach is that it adds another way for unbounded WAL growth to fill pg_xlog – when a replica cannot receive or replay WAL fast enough or remains offline for a long time, the master will keep on retaining WAL until pg_xlog fills up and it starts reporting errors to clients. This can already happen on a master with long checkpoint intervals under heavy load, but it’s much easer to trigger with slots as a replica that goes ofline indefinitely will cause pg_xlog to fill up.
While monitoring and adjustment of pg_xlog capacity is already necessary for a busy PostgreSQL server, it is much more so for a server that is using slots to manage WAL retention for replication.
Operational impact
By default, physical replication will not use slots. Nothing changes. You must use WAL archiving on top of streaming to ensure reliable replication.
You can enable the use of replication slots for physical replication with the primary-slotname parameter in recovery.conf. The physical replication slot must already have been created on the master with the pg_create_physical_replication_slot(...) function; see the docs.
When using slots to manage physical replication, WAL archiving is no longer required for replication (though you may wish to retain it for PITR anyway). You don’t need to deal with the difficulties of managing WAL retention amongst multiple replicas that may or may not be online, with replay positions that are not known to the master when offline. You don’t need to manage the space in the archive store.
Instead, you need to keep an eye on space in pg_xlog. WAL retention is managed automatically for you, but you must ensure there’s enough space available.
If a replica is retired from service or unrecoverably lost due to failure of the host, you must manually remove the slot from the master because the master won’t delete any xlog until you do. If you have a system where the master produces xlog rapidly, this may require a lot of pg_xlog space or fairly quick action. A monitoring system like Icinga or Zabbix is strongly recommended.
A replica that is unable to keep up the master must be fixed or retired to prevent the master from running out of xlog space.
With slots you no longer need to monitor the replicas to determine their state, as accurate and up to date information is always available in the pg_replication_slots view. So all you need to do is monitor the master’s pg_xlog space and the status of all replication slots.
Any wal_keep_segments parameter will be respected as a minimum for WAL retention.
So: using slots for physical replication is a trade-off. You don’t need archiving anymore, but you have to monitor the state of the system more closely to avoid disrupting the master.
Future work
To reduce the need for a large pg_xlog space allocation on what could be expensive high-performance storage, in future slots may be extended to support moving WAL to a separate location once it’s no longer required for the master’s own checkpointing.
Why not have a “wal_keep_max_segments” GUC as well ?
That does not sound like that would be too difficult (maybe it is?), it could be off-by-default, and would still give users the ability to limit the maximum size of pg_xlog.
I would think there are enough cases where monitoring replication lag to slaves, with to possibility of having to rebuild the slave, is far more preferable then monitoring pg_xlog with a change of losing the master.
Something like that should be possible in a later version. It’s not practical to do everything all at once.
I use this type of replication, it works very nice.
I have trouble by backup of archived pg_xlog, because pg_xlog contains logs of 3 day before timestamp.
Just wondering, why cant decoding (for logical replication) be done from archived WAL. Like its possible for Goldengate to read from online redo logs (same as WAL in postgres) as well from archived logs (archived WAL).
If that works, then its a win-win: neither does pg_xlog hit the roof nor is there a threat of a standby missing out on changes (no matter its offline for how long).
Thanks
The only reason it can’t is that it’s not implemented. There’s no fundamental reason we can’t fetch WAL with something much like restore_command, then decode it and discard it again.
Some logic would be needed to cache the WAL segments, since often multiple decoding sessions need the same segments shortly after each other. And we couldn’t use restore_command because it’s not available in a running master, only from recovery.conf. But those are implementation issues.
If this is important for you please get in touch about development consulting – this is the sort of thing we can make happen.
Hello,
Thanks for a very informative post on replications slot!
Just had a couple of questions.
1: Replication slots are used to ensure that the standby server has the WAL’s need for replication, does that mean that when you create a new replication slot, you’re basically creating another “pg_xlog” directory in the master server for the standby? And existing WAL records are copied to this new directory?
2: Each WAL record is 16MB and the default setting for wal_keep_segments is 32 (16 * 32 = 512 MB). If I am still using this default setting, does this mean that when I create a replication slot it I should ensure that I have an additional 512 MB of space available for the WAL records?
In your post you said “If a replica is retired from service or unrecoverably lost due to failure of the host, you must manually remove the slot from the master because the master won’t delete any xlog until you do. If you have a system where the master produces xlog rapidly, this may require a lot of pg_xlog space or fairly quick action.”
This section of the article just made me curious about if a replication slot creates a new pg_xlog directory, and the master wont remove the files within it, if the replica is lost.
Thanks!
1. Definitely not. It just controls when the server deletes them from the current pg_xlog directory.
2. No, replication slots are not additive with wal_keep_segments. A WAL segment is removed if it’s no longer needed. wal_keep_segments makes segments stay marked as “needed” for longer. So do replication slots. They’re independent ways of preserving WAL segment lifetime.
Thanks a lot for your answer!!!
I didn’t realize that the wal_keep_segment specifies the minimum number of pasted WAL records to retain in the pg_xlog. So that means that Postgres will always keep generating new WAL records as needed but will always keep the minimum past records specified by the wal_keep_segment.
With regards to replication slots, they are used to tell the master not to remove old + new WAL records from pg_xlog until the standby server has received them? This is why the space can be consumed rapidly?
Sorry for the additional questions 🙁
Right, wal_keep_segments is a minimum.
Physical replication slots serve the same purpose but you don’t have to guess how much WAL must be retained, it’s managed automatically. But that also means there’s no limit on what’s retained and the disk can fill if the standby stops receiving.
At some point I want to modify replication slots so you can set a per-slot maximum WAL retention limit. When that limit is hit, like when you reach the wal_keep_segments limit, the standby will break unless there’s a working WAL archive and restore_command to fall back to.
@craig.ringer,
Thanks for your great explanation and comments.
what is the parameter to set per-slot maximum WAL retention limit?
i didn’t find any docs in this parameter in PostgreSQL docs.
There isn’t one, though I believe one is being added for PostgreSQL 12. Check the release notes.
Hi All,
Its a very informative post on replications slot!.. But i was thinking whenever a wal segment gets deleted from the master node due to any circumstances and now because it is not available to slave node in wal_level = ‘logical’ type of replication , cant we simply stop the other 2 slaves (via some provision to stop BDR on the slaves for a while) and can take a fresh back up the db using simple pg_dump command and then restore the same in new_DB and the start the BDR on the new_DB instead of the old BDR, which was running in the old DB.
SO my ask here is can u please help us with the command to stop the BDR on a particular DB, and as we know how to setup for new bdr we will be able to again restore everything from scratch.
This post is now quite old.
BDR uses logical replication and doesn’t directly copy WAL segments around, so you can’t really reason about it the same way.
You cannot simply dump and restore like that. There are complexities around consistency of snapshots and much more.
It’s not clear to me exactly what you want to achieve to be honest.