Back to the Future Pt. 1: Introduction to pg_rewind
Since PostgreSQL 9.5, pg_rewind
has been able to make a former master follow up a promoted standby although, in the meantime, it proceeded with its own timeline. Consider, for instance, the case of a switchover that didn’t work properly.
Have you ever experienced a "split brain" during a switchover operation? You know, when the goal is to switch the roles of the master and the standby, but instead you end up with two independent masters – each one with its own timeline? For PostgreSQL DBAs in HA contexts, this where pg_rewind comes in handy!
Until PostgreSQL 9.5, there was only one solution to this problem: re-synchronise the PGDATA
of the downgraded master with a new base backup and add it to the HA cluster as a new standby node. Generally, this is not a problem, unless your database is relatively big. For instance, if there are hundreds of GBs, it is not easy to do this type of operation while trying to keep downtime as low as possible.
Restoring a database to a previous point in time can create some complexities that can be handled in different ways. For an in depth explanation of the evolution and components of fault tolerance in PostgreSQL I suggest you check out Gulcin’s series Evolution of Fault Tolerance in PostgreSQL, including a Time Travel feature that mentions the use of pg_rewind.
how pg_rewind works
pg_rewind scans the "old" master’s PGDATA
folder, identifies the data blocks changed during the switch to the new timeline, then copies only these blocks from the promoted standby. This is then used to replace the changes. As a "collateral" effect, the configuration files are also copied from the promoted standby (so the DBA has to be careful to adapt them to the new role of the node in the HA cluster). However, this allows the prevention of re-syncing PGDATA
completely.
To do this, it is necessary to have all the WALs produced in the final moments before the switchover from the old master. Changes are identified by comparing the status of the data blocks present in the PGDATA
with the changes logged in the WALs. Once the changed blocks are identified, the WALs are replayed, miming a sort of ‘rewind’ of the timelines.
Moreover:
- the instances have to be initialised with the “-k” (or
--data-checksums
) parameter - the parameter
wal_log_hints
has to be enabled
Until PostgreSQL 9.5, the necessary WALs were the ones starting from the last checkpoint, since it could not go behind this point in the timeline.
To better understand how it works, consider this simple example with a master:
# Set PATH variable
export PATH=/usr/pgsql-9.5/bin:${PATH}
# This is the directory where we will be working on
# Feel free to change it and the rest of the script
# will adapt itself
WORKDIR=/var/lib/pgsql/9.5
# Environment variables for PGDATA and archive directories
MASTER_PGDATA=${WORKDIR}/master
STANDBY1_PGDATA=${WORKDIR}/standby1
ARCHIVE_DIR=${WORKDIR}/archive
# Initialise the cluster
initdb --data-checksums -D ${MASTER_PGDATA}
# Basic configuration of PostgreSQL
cat >> ${MASTER_PGDATA}/postgresql.conf <<EOF
archive_command = 'cp %p ${ARCHIVE_DIR}/%f'
archive_mode = on
wal_level = hot_standby
max_wal_senders = 10
min_wal_size = '32MB'
max_wal_size = '32MB'
hot_standby = on
wal_log_hints = on
EOF
cat >> ${MASTER_PGDATA}/pg_hba.conf <<EOF
# Trust local access for replication
# BE CAREFUL WHEN DOING THIS IN PRODUCTION
local replication replication trust
EOF
# Create the archive directory
mkdir -p ${ARCHIVE_DIR}
# Start the master
pg_ctl -D ${MASTER_PGDATA} -l ${WORKDIR}/master.log start
# Create the replication user
psql -c "CREATE USER replication WITH replication"
(note the small amount of WAL kept in the master), and then a standby:
# Create the first standby
pg_basebackup -D ${STANDBY1_PGDATA} -R -c fast -U replication -x
cat >> ${STANDBY1_PGDATA}/postgresql.conf <<EOF
port = 5433
EOF
# Start the first standby
pg_ctl -D ${STANDBY1_PGDATA} -l ${WORKDIR}/standby1.log start
Let’s insert some data on the master. You will see it also from the (hot) standby.
Now promote the standby, leaving the master as it is:
pg_ctl -D ${STANDBY1_PGDATA} promote
Now if you update the master, no changes will be visible from the standby. Moreover, in the archive/
directory it is possible to see the file 00000002.history
, that shows there has been a change in the timeline during the promotion.
Now let’s "rewind" the master, and make it to follow up the promoted standby:
~$ pg_ctl -D ${MASTER_PGDATA} stop
waiting for server to shut down.... done
server stopped
~$ pg_rewind --target-pgdata=${MASTER_PGDATA} \
--source-server="port=5433 user=postgres dbname=postgres"
note here that for the connection to the source server – the promoted standby – we used the postgres
user, since a superuser is needed by pg_rewind to inspect the data blocks.
If the max_wal_size
parameter is not large enough to keep the needed WALs into the pg_xlog/
directory of the standby, as I’ve deliberately made before, an error similar to the following one can be obtained:
The servers diverged at WAL position 0/3015938 on timeline 1.
could not open file "/var/lib/pgsql/9.5/master/pg_xlog/000000010000000000000002": No such file or directory
could not find previous WAL record at 0/3015938
Failure, exiting
Now, there are two possible ways to solve this:
- manually check the missing WALs in the archive, starting from the one listed in the error message, then copy them to the
pg_xlog/
directory of the master - add a proper restore_command in the recovery.conf and place in the
PGDATA
of the master, so pg_rewind will automatically find the missing WALs
The second option is probably the most suitable method. Think, for instance, if you have the WAL archive managed by Barman: you could base the restore_command to use the get-wal feature of Barman, as explained in this interesting article by Gabriele. Doing so, Barman will be used as a WAL hub, providing all the necessary WAL files to pg_rewind.
Once the missing WALs are retrieved, you can run the pg_rewind command again and the following message should ensure that everything worked properly:
~$ pg_rewind --target-pgdata=${MASTER_PGDATA} \
--source-server="port=5433 user=postgres dbname=postgres"
servers diverged at WAL position 0/3015938 on timeline 1
rewinding from last common checkpoint at 0/3000140 on timeline 1
Done!
Keep in mind that just few blocks were copied (those changed during the split-brain), even if your database size is hundreds of GBs! Remember that the configurations are also copied, and eventually the already-present recovery.conf in the "downgraded" master has been over-written. So, remember to:
- change the port where the instance listens in the postgresql.conf (set it as 5432 in our example)
- change the primary_conninfo in the recovery.conf in order to make the downgraded master connect to the promoted master
Once this has been done, start the downgraded master and that will start to follow up the promoted standby and will then be its standby in turn.
Do you have a more complex HA cluster? Don’t worry! Part 2 will explain this more in depth and talk about pg_rewind in PostgreSQL 9.5!
Note, pg_rewind only works in case when old master stopped properly, and there is no way to rewind if it’s crashed, SIGKILL’ed or similar.
Hi Lesovsky,
Thanks for your observation. I haven’t talked about this indeed. Actually, I just considered here the case of a *switchover* (*not* failover), i.e. a change of roles between master and standby, where the master is stopped properly before the promotion of the standby, and after having performed a checkpoint.
Note that there would be definitely ways to improve that. It just deserves some attention, but really there are use cases where people now just restart the server(s) and then shutdown it immediately, which is really annoying.
Note as well that if you use wal_log_hints = on in postgresql.conf, there is no need of page-level checksums in initdb. That can just be one or the other to be sure that a FPW is wal-logged after a checkpoints when hint bints are updated on a page.
> Note as well that if you use wal_log_hints = on in postgresql.conf, there is no need of page-level checksums in initdb
Thanks Michael, didn’t know this!
> add a proper restore_command in the recovery.conf and place in the PGDATA of the master, so pg_rewind will automatically find the missing WALs
Does it really work? Seems, there is nothing about it in pg_rewind source code.
Hi Vladimir,
Yeah, this is misleading: after having tried this I realized that it cannot work in this way (but I wrongly didn’t changed the written part – too late now! 🙁 ).
Anyway, the get-wal feature of Barman is really interesting, and can be used anyway to retrieve the needed WALs from the archive (since it can be used independently). But it could be really nice to add a patch to pg_rewind that allows to retrieve the needed WALs specifying something similar to a ‘restore_command’.
However, thanks for having highlighted this.
> and there is no way to rewind if it’s crashed, SIGKILL’ed or similar.
This is not true. There is one workaround:
1) remove files from “pg_xlog/archive_status”
2) start your database in a single user mode (postgres –single -D -c archive_mode=on -c archive_command=false).
3) execute “checkpoint”.
4) press Ctrl+D to leave single user mode.
Voila, now you have clean shutdown and pg_rewind should work 🙂
Hi Alexander,
Thanks for this procedure. Never tried it before, does it really work for any kind of crash of the master? Anyway, as already highlighted to Lesovsky, I was considering here the case of a *switchover*, where the master is stopped properly, and not of a *failover*.
Switchover doesn’t need using pg_rewind at all if you stop master before promoting standby. It works since 9.3.
Yes, but what about if you stop the master after the promotion of the standby, and in the meantime new operations are executed? Or if something goes wrong during the follow-up of the old master as a new standby, and you obtain again a second master that proceeds with an independent timeline?
I know, I’m considering here real trivial corner cases, not possible if the switchover procedure is executed properly (and not really, really badly). But it is interesting to know how to use pg_rewind if something of wrong happens.
I’ll discuss this better in the next parts! 🙂
Yes, i’m talking about using repmgr so i do not have to do all the steps that lead me in this split-brain situation.
Can you explain me a bit how to “unregister” the old master, which now have FAILED state, and make it point and follow to the new master?
Which version of repmgr are you talking about? Anyway, repmgr support pg_rewind with the command ‘repmgr standby switchover’. It is not thought to use pg_rewind after an autofailover. At this point you should unregister the old master, clone it through repmgr, than register it as a standby.
And more exactly how do you unregister the old master?
I found a way to do somehow that, but i don’t know if it is the recommended way or just a hack.
Here’s how i do it now:
>> service postgresql-9.6 stop
>> delete data directory on old master
>> repmgr -D /var/lib/pgsql/9.6/data/ -d repmgr -p 5432 -U repmgr -R postgres –verbose –force standby clone new_master_name
>> repmgr -f /etc/repmgr/9.6/repmgr.conf standby register –force
>> service postgresql-9.6 start
And after doing that the old master “tranform” into standby and point to the new master.
Using pg_rewind should never be necessary if you practice STONITH – the old master is demoted or it is terminated, there is no in-between. If you’re running on visualised infrastructure such as EC² then use the provided APIs to stop/terminate the failed master. If you’re running on your own hardware, use the LOM features provided on your hardware to perform a hard shutdown. Anything less is going to cause data loss, since rewinding is DISCARDING PREVIOUSLY COMMITTED DATA. High availability isn’t perfect availability, accept the fundamental limitations of CAP theorem and make your choice: consistency OR availability.