Starting from Barman 1.6.1, PostgreSQL standby servers can rely on an “infinite” basin of WAL files and finally pre-fetch batches of WAL files in parallel from Barman, speeding up the restoration process as well as making the disaster recovery solution more resilient as a whole.
The master, the backup and the standby
Before we start, let’s define our playground. We have our PostgreSQL primary server, called
angus. A server with Barman, called
barman and a third server with a reliable PostgreSQL standby, called
chris – for different reasons, I had to rule out the following names
cliff and obviously
angus is a high workload server and is continuously backed up on
chris is a hot standby server with streaming replication from
angus enabled. This is a very simple, robust and cheap business continuity cluster that you can easily create with pure open source PostgreSQL, yet capable of reaching over 99.99% uptime in a year (according to our experience with several customers at 2ndQuadrant).
What we are going to do is to instruct
chris (the standby) to fetch WAL files from
barman whenever streaming replication with
angus is not working, as a fallback method, making the entire system more resilient and robust. Most typical examples of these problems are:
- temporary network failure between
- prolonged downtime for
chriswhich causes the standby to go out of sync with
For further information, please refer to the Getting WAL files from Barman with ‘get-wal’ blog article that I wrote some time ago.
Technically, we will be configuring the standby server
chris to remotely fetch WAL files from
barman as part of the
restore_command option in the
recovery.conf file. Since the release of Barman 1.6.1 we can take advantage of parallel pre-fetching of WAL files, which exploits network bandwidth and reduces recovery time of the standby.
This scenario has been tested on Linux systems only, and requires:
- Barman >= 1.6.1 on the
- Python with
argparsemodule installed (available as a package for most Linux distributions) on
- Public Ssh key of the
chrisuser in the
~/.ssh/authorized_keysfile of the
barmanuser (procedure known as exchange of Ssh public key)
postgres user on
chris download the script from our Github repository in your favourite directory (e.g.
/var/lib/pgsql/bin directly) with:
cd ~postgres/bin wget http://raw.githubusercontent.com/2ndquadrant-it/barman/master/scripts/barman-wal-restore chmod +700 barman-wal-restore
Then verify it is working:
You will get this output message:
usage: barman-wal-restore [-h] [-V] [-U USER] [-s SECONDS] [-p JOBS] [-z] [-j] BARMAN_HOST SERVER_NAME WAL_NAME WAL_DEST This script will be used as a 'restore_command' based on the get-wal feature of Barman. A ssh connection will be opened to the Barman host. positional arguments: BARMAN_HOST The host of the Barman server. SERVER_NAME The server name configured in Barman from which WALs are taken. WAL_NAME this parameter has to be the value of the '%f' keyword (according to 'restore_command'). WAL_DEST this parameter has to be the value of the '%p' keyword (according to 'restore_command'). optional arguments: -h, --help show this help message and exit -V, --version show program's version number and exit -U USER, --user USER The user used for the ssh connection to the Barman server. Defaults to 'barman'. -s SECONDS, --sleep SECONDS sleep for SECONDS after a failure of get-wal request. Defaults to 0 (nowait). -p JOBS, --parallel JOBS Specifies the number of files to peek and transfer in parallel. Defaults to 0 (disabled). -z, --gzip Transfer the WAL files compressed with gzip -j, --bzip2 Transfer the WAL files compressed with bzip2
If you get this output, the script has been installed correctly. Otherwise, you are most likely missing the
argparse module in your system.
Configuration and setup
chris and properly set the
restore_command = "/var/lib/pgsql/bin/barman-wal-restore -p 8 -s 10 barman angus %f %p"
The above example will connect to
barman user via Ssh and execute the
get-wal command on the
angus PostgreSQL server backed up in Barman. The script will pre-fetch up to 8 WAL files at a time and, by default, store them in a temporary folder (currently fixed:
In case of error, it will sleep for 10 seconds. Using the help page you can learn more about the available options and tune them in order to best fit in your environment.
All you have to do now is restart the standby server on
chris and check from the PostgreSQL log that WALs are being fetched from Barman and restored:
Jul 15 15:57:21 chris postgres: [23-1] LOG: restored log file "00000001000019EA0000008A" from archive
You can also peek in the
/var/tmp/barman-wal-restore directory and verify that the script has been executed.
Even Barman logs contain traces of this activity.
This very simple Python script that we have written and is available under GNU GPL 3 makes the PostgreSQL cluster more resilient, thanks to the tight cooperation with Barman.
It not only provides a stable fallback method for WAL fetching, but it also protects PostgreSQL standby servers from the infamous 255 error returned by Ssh in the case of network problems – which is different than
SIGTERM and therefore is treated as an exception by PostgreSQL, causing the recovery process to abort (see the “Archive Recovery Settings” section in the PostgreSQL documentation).
Stay tuned with us and with Barman’s development as we continue to improve disaster recovery solutions for PostgreSQL. We would like to thank our friends at Subito.it, Navionics and Jobrapido for helping us with the development of this important feature, as well as many others 2ndQuadrant customers who we cannot mention due to non disclosure agreements but still continue to support our work.
Side note: hopefully I won’t have to change the way I name servers due to AC/DC continuously changing their formation. 😉