Failover slots for PostgreSQL

Logical decoding and logical replication is getting more attention in the PostgreSQL world. This means we need it working well alongside production HA systems – and it turns out there’s a problem there. Replication slots are not themselves synced to physical replicas so you can’t continue to use a slot after a master failure results in promotion of a standby.

The failover slots patch changes that, syncing slot creation and updates to physical replica servers such as those maintained with WAL archives or streaming replication. That lets logical decoding clients seamlessly follow a failover promotion and continue replay without losing consistency.

Logical decoding and slots

Introduced in 9.4, logical decoding lets a client stream changes row-by-row in consistent transaction commit order to some receiver – which can be another PostgreSQL instance or your choice of applications like message queues and search engines.

To stream the row data the client connects to a replication slot on the server. The slot makes sure that the server retains the WAL needed for decoding and (for logical slots) also prevents the removal of old versions of system catalog rows that might be needed to understand that WAL.

The failover problem

Most production PostgreSQL deployments rely on streaming replication and/or WAL archive based replication (“physical replication”) as part of their high availability and failover capabilities. Unlike most server state, replication slots are not replicated from a master server to its replicas. When the master fails and a replica is promoted any replication slots from the master are missing. Logical replication clients cannot continue because they have no slot to connect to.

Easy, right? Just create a slot with the same name on the replica.

It’s not that simple. A logical replication slot can only be created with a view of the database history somewhere in the future relative to when it’s created. So if the client wasn’t totally up to date replaying from the slot on the old master, or if it doesn’t create a replacement slot on the new master as the very first thing done on the master, the client will miss out on changes. That’s what slots are meant to prevent and can be a critical problem for some applications.

The two key reasons a slot can’t be created “back in time” are WAL retention and – for logical slots – vacuuming of the system catalogs. We also can’t get a snapshot of the past to export, but that’s only a problem for new client setup. WAL retention is the simplest: the standby might throw away WAL segments corresponding to database changes that a logical decoding client on the master hasn’t yet replayed. If there’s a failover then there’s no way to ever replay those changes to that client. The other issue is catalogs – logical decoding needs the historical definition of tables, types, etc to interpret data in WAL, and to do that it does a sort of MVCC-based time travel using deleted rows. If VACUUM on the master removes those deleted rows and marks the space free for re-use then the standby will replay that change and we can’t make sense of what’s in WAL anymore.

Failover slots

Failover slots address these issues by synchronizing slot creation, deletion and position updates to replica servers. This is done through the WAL stream like everything else. If a slot is created as a failover slot using the new failover boolean option to pg_create_logical_replication_slot then its creation is logged in WAL. So are subsequent position updates when the client tells the server it can free no-longer-needed resources.

If the master fails and a standby is promoted, logical decoding clients can just reconnect to the standby and carry on as if nothing had happened. If a DNS update or IP swap is done along with the promotion the logical decoding client might not even notice it’s anything more than a master restart.

What about physical slots?

PostgreSQL’s block-level (“physical”) replication also has replication slots. They can be used to pin WAL retention, providing a finer-grained mechanism than wal_keep_segments at the cost of also being unbounded.

A physical failover slot can be created, just like a logical failover slot.

Does this mean pglogical can be used to fail over to a logical replica?

Failover slots do not aid logical replication solutions in supporting failover to a logical replica. They exist to allow logical replication to follow a physical failover.

Supporting failover to a logical replica is a completely unrelated matter. There are a number of limitations in PostgreSQL core that are relevant to it, like the currently missing support for logical decoding of sequence position advances. Failover slots will neither help nor hinder there. What they do is provide a way to integrate logical replication into HA solutions now, into existing mature and established infrastructure patterns.

15 replies
  1. Kalyan R
    Kalyan R says:

    Hello,

    We are using Tableau Server which uses PostgreSQL as database to store it repository contents. I would like to whether VSS Integration is available for PostgreSQL database. Every night we want to take VM Snapshot and in case of failure we want to restore the VM and PostgreSQL should work (it will sure if VSS aware write is possible).

    Kindly advise.

    Thanks
    Kalyan

    Reply
  2. misha
    misha says:

    Next step : to be able to get initial copy of heap from master’s slot from standby. need to have possibility to set exported snapshot into standby

    Reply
    • craig.ringer
      craig.ringer says:

      Interesting idea. I’m not sure how practical it is though, given that exported snapshots rely on the exporting xact to be directly visible to the importing xact. Trying to copy a snapshot export file to a standby and import it fails with ‘the source transaction nnnn is not running anymore’. We’d probably need a way to make sure the snapshot was still valid on the master at import-time then send hot standby feedback to preserve it.

      Instead, what I’d like to see eventually is the ability to do logical decoding from a standby – and to have slots fail over from that standby to one or more cascading physical standbys.

      That’s going to require some work to allow logical decoding to work on a standby at all, first. It needs to add the catalog xmin of its logical slots to the hot standby feedback messages so that the primary can make sure to retain WAL for them. It needs to advance a logical slot on the master that reflects that catalog xmin, or we need to optionally track catalog xmin on physical slots where those slots are used by a physical standby with clients doing logical decoding.

      Then failover slots will need to be revised not to use WAL as the transport for slot updates. Slot information will need to be sent over the streaming replication protocol as a side-channel with a new message type instead, so that standbys can update slots and still have those slots be replicated to their own cascading standbys for failover. This is complicated by the need to make sure to tell all standbys about slot drops before fully forgetting about the slot on the upstream standby.

      By allowing logical decoding on a standby the impact on the master is reduced to just the required catalog bloat.

      Ideally I’d have liked to go straight there with failover slots, skipping writing to WAL. I don’t think it’s practical in a 9.6 timeframe though and frankly 9.6 is already two releases later than it should ideally have been.

      The only issue I see with revising them to use a non-WAL transport later is that failover slot updates would no longer be subject to point-in-time recovery and no longer available to archive-based standbys. Only streaming replication. To get around that I’d need to write additional pseudo-archives with slot information and I’m not sure that’ll be worth doing. It’s not clear how useful having PITR of a slot is anyway – if the client has already replayed past the most recent LSN on the server at the point of timeline separation the slot isn’t consistent and serves no purpose.

      (BTW, a logical client can detect this situation by connecting over the walsender protocol, doing an IDENTIFY SYSTEM to get the current server timeline, a TIMELINE HISTORY n to get the timeline history file, and parsing the timeline history file to find out the LSN at which history diverged. Should probably make this easier.)

      Reply
    • craig.ringer
      craig.ringer says:

      Oh, also, I think we should be able to use a logical decoding plugin to stream the initial state of a DB to client applications table-by-table.

      Oleksandr is doing some work on this by “faking” insert callbacks to the plugin, but I don’t think this is the way forward.

      Instead I want to be able to, while a slot has an exported snapshot, attach to that snapshot from other walsender or SQL connections and do a COPY-like operation that pipelines the data via the decoding plugin using a separate callback. The decoding plugin can then be aware it’s not replaying inserts, it’s doing an initial state copy.

      (I’d also like to be able to ask a slot to take a new snapshot at the next convenient consistent point, while still replaying, so new COPYs can be done on an existing slot to do things like resynchronize/verify tables).

      Reply
  3. harry
    harry says:

    Hi all,
    I am using replication slots to record change sets in a log-table for incremental refresh of materialized views which oracle did. In addition to this i am using custom background worker to get the changes from the replication slot and update this in the log table.

    But i am having doubt that will this failover slots help when the master fails and the standby becomes new master and continue to get the change sets from the replication slot..? or the crash time data will be missed.?

    correct me if i am wrong!

    Reply
    • craig.ringer
      craig.ringer says:

      Failover slots didn’t land up getting into PostgreSQL, so it’s kind of a moot point.

      For 10.0 I’m working on a new approach where we instead allow creation of separate slots on the replica and we maintain catalog_xmin using hot standby feedback. This requires support for logical decoding timeline following, which is submitted to Pg v10, and a few other patches.

      If it all gets into v10, your bgworker will be able to maintain slots on the replica and switch to using them on failover.

      For 9.6 and older there isn’t really any good solution to this. You can patch 9.6 to add logical decoding timeline following and some hacks to expose the low level slot creation/drop routines (see the example in src/test in the logical decoding patch submitted to the mailing list) but it’s ugly and fragile.

      Reply
  4. Sudalai
    Sudalai says:

    Just a thought!!

    Why can’t we write all the info needed to decode a change in WAL itself ?
    We can decode a change anytime, if WAL has all the info.
    No need to prevent VACUUM to clean the dead rows and no need to care about WAL recycling.
    Why are we complicating things?

    Reply
    • craig.ringer
      craig.ringer says:

      > Why can’t we write all the info needed to decode a change in WAL itself ?

      Because it’d be massively bloating of WAL. A simple row INSERT would have to write out:

      * table name
      * column names
      * column data type definitions

      at minimum. Probably plenty more.

      It’s also wholly impractical to do, since plugins would then have to use a completely different API to interact with tuples, since they can’t rely on the relcache and syscache that are used absolutely pervasively within PostgreSQL.

      Really, it won’t work.

      You might wonder why we don’t invoke the decoding plugin at WAL write time, instead of later, then write its _results_ into WAL. The trouble there is that sometimes we want to send different data to different peers, modify what we send based on options sent by the client, etc etc. It’d also tend to multiply what we write in WAL quite a bit. Perhaps most importantly it’s really hard to make it safe, since we’d have to have a way to ensure that we wrote the decoding messages atomically with the original WAL record, so they’re all successful or none of them are.

      So really, that’s a no-go.

      If it were simple we would’ve done it that way 😉

      Reply
  5. Sudalai
    Sudalai says:

    Yes. It increase WAL size.
    Schemaname, tablename, column name or oridinal and affected rows are enough to decode a change. I hope we can find some compact way to write.
    We can add new WAL level for this.
    Yeah. It is not simple, requires lot of changes in postgres.

    Think about below Advanages:

    1) If WAL has all the info in some structured format(for example :like mysql binlog row events structure). No need to create logical replication slot and run output-plugin(change customization) in postgres server. We can do change customization anywhere if we have a WAL. It free postgres server from logical event decoding overhead.

    2) No need to restrict VACUUM.

    3) WAL will be informative and structured, so we can write tools like binlog_connector in any programming language to decode changes. Easy to write many replication solutions.

    >If it were simple we would’ve done it that way 😉
    I agree 🙂 But It simplify many things, i hope :).

    Thanks,
    -Sudalai

    Reply
    • craig.ringer
      craig.ringer says:

      I don’t have experience with the MySQL binlog, so I can’t comment on that. I didn’t design the logical decoding facility either, that’s work primarily done by Andres Freund. At this point it’s what we have to work with, and knowing Andres I’m confident he carefully considered the trade-offs and alternate approaches before settling on the logical decoding model.

      It’s kind of a pointless discussion, since you’re speaking of a whole different way of doing logical replication that doesn’t exist, has no prototype implementation or solid design, and which nobody has the time/plans/funding/etc to implement. It would also have to offer truly compelling advantages to have any hope of acceptance into core PostgreSQL. We have quite a number of replication options already, so adding another will have high barrier to entry.

      There is essentially zero chance that I or the rest of the team working on BDR, pglogical, etc would pick up and work on this. I’ve had no enquiries about anything like it and so far most customers are both interested in and fairly happy with logical decoding. Unless that changes I don’t anticipate any work in this direction.

      It’s important to understand that the vacuum pinning involved in logical decoding only affects the system catalogs. So in practice its effects are negligible unless you have massive temp table churn, and hopefully PostgreSQL will eventually get temp tables that don’t bloat the catalogs. The WAL retention requirement is more of an issue, but there are things we could do as optimisations there too.

      For example, I can see value in eagerly writing logical decoding output on the upstream server, in advance of its consumption by replication slots. This would be somewhat similar to what you’re thinking of, since the transformations could be done early and the required catalog_xmin (xid for vacuum) and restart_lsn (WAL) could be advanced immediately. The pre-prepared data would probably get fed back through another callback in the logical decoding plugin when a client wanted to receive it, so the plugin could filter it based on replication origins, etc. This would let you implement something more like what you describe on top of logical decoding without duplicating lots of functionality. The cached stream would still have to be per-output-plugin, and plugins could decide whether to write a single shared stream or one per replication slot. There are a few other complexities, but nothing insurmountable, so if you have a workload where you really need highly efficient logical decoding to many clients and/or with minimal catalog bloat and WAL retention, get in touch with [email protected] to discuss whether we can help with the required PostgreSQL core development work.

      Another optimisation would be to teach the xlogreader to restore xlog from WAL archives for replay. That way we could still purge WAL from pg_xlog if archive_mode is active and archive_command is returning success. This really depends on the recovery.conf refactoring work to make restore_command etc a GUC. Handily, Abhijit Menon-Sen is working on just that for PostgreSQL 10. Then it’d just require some changes to xlogreader. Same as above, reach out if this would be useful/important to you.

      So … I don’t think we have to rewrite the world here. There are a number of improvements that can be made based on the existing facilities that would achieve similar results more flexibly and without massive duplication.

      Reply
  6. sanyam
    sanyam says:

    is this patch available now in pg?
    Which solution is better for handling the failover problem

    1-Using failover slots
    2-Allow creating logical slots in a standby

    Is there any other approach to solve this issue?

    Reply
    • craig.ringer
      craig.ringer says:

      Failover slots was not committed to PostgreSQL.

      2ndQuadrant Postgres (for support customers) has it, but we weren’t able to get it into core PostgreSQL despite a lot of effort.

      Right now there is no support for logical failover in community PostgreSQL, though pglogical3 will add some support when used on PostgreSQL 10.

      Reply

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *