Dataloss at GitLab
GitLab, thanks for using PostgreSQL 9.6 and its replication and backup facilities.
We’re sorry that you lost your database:
Thank you for posting this publicly to allow us to comment on this for your postmortem analysis.
I’m very happy that you monitor Replication Lag, that is good. Replication lag of 4GB is at times normal, so shouldn’t have caused major concern. I’ve recently fixed a bug in replication that caused replication to hang in some cases for up to a minute; we released a public fix to that and it will be included in the next maintenance release of PostgreSQL 9.6. It’s not certain that the bug was hit and, if it was, whether that was enough to cause the slow down noted. The openness of your response means we should do the same also, so I’m mentioning this issue here for that reason.
Restarting replication was probably unnecessary but if you shutdown the standby node cleanly then replication should disconnect gracefully. If replication is disconnected gracefully we would expect it to release the WALSender used, so if you reconnect then there should now be a free WALSender connection available. There should be no need to alter the number of max_wal_senders parameter in normal usage. So it appears that the standby node was violently shut down preventing it from normal clean shutdown of the replication connection.
The lack of a clean shutdown meant that the walsender connection wasn’t released quickly enough and the new connection tried to use an additional connection, which then hit the limit of max_wal_senders, causing it to need to be increased. Not sure what max_wal_senders was before but it seems it shouldn’t have needed to be as high as 32 – certainly 2-4 would be normal. max_connections = 8000 is a massively high figure, so reducing it to 2000 makes sense anyway.
pg_basebackup first issues a checkpoint to ensure it gets the latest data for the backup, which by default is a paced checkpoint to avoid hitting the disks too hard. As a result it can take a few minutes, typically about 4 minutes with default settings to start up. During that time the pg_basebackup process is quiet while it waits for the checkpoint on the master to complete.
In the sequence of actions I see no new PostgreSQL bugs or usage problems with the software. Some issues were encountered but we have publicly available tools to ensure things work correctly.
Removing a data directory by manual actions is dangerous, as this proved. Managing replication using scripts is normal, so a test in the script to check that the replication is being set up on a standby rather than on the master would have trapped that error. We recommend the use of the repmgr tool for this purpose. “https://www.2ndquadrant.com/”/resources/repmgr/ repmgr 3.3 was released 6 Jan 2017 and is maintained regularly.
Recovering from backup is obviously critical, which is why we provide the barman tool for managing backups. Barman works with S3 and is fully tested to ensure it works when things go bad. “https://www.2ndquadrant.com/”/resources/barman/ Barman 2.1 was released 5 Jan 2017 and is maintained regularly.
I’ve been spreading the meme that you should test your backups by practicing recovery from them for some years and we teach that on our courses and at conferences. I recommend the use of repmgr and barman for managing replication and backup respectively.
We’re sorry for any frustration you may have experienced and we’re open to suggestions for how to improve PostgreSQL.
2ndQuadrant is the initial author of core PostgreSQL’s backup technologies and has extended the backup and replication technologies continuously since version 8.0. 2ndQuadrant provides 24/7 Support for PostgreSQL in the event of emergencies, as well as training worldwide.
Simon Riggs, CTO 2ndQuadrant & PostgreSQL Committer
great response. Thanks for the wonderful PostgreSQL, and your work on the backup and replication parts.
I was wondering if you could address how both 9.2 and 9.6 binaries were being used? Could this issue have been caught somehow?
> pg_basebackup first issues a checkpoint to ensure it gets the latest data for the backup, which by default is a paced checkpoint to avoid hitting the disks too hard. As a result it can take a few minutes, typically about 4 minutes with default settings to start up. During that time the pg_basebackup process is quiet while it waits for the checkpoint on the master to complete.
Seems like this could be something for PG to improve? If pg_basebackup output that it was issuing a checkpoint (and that it could take a while) then it wouldn’t appear to be doing nothing/stuck.
We need to be clear that the few minute wait (10 minutes was mentioned) was NOT responsible for the delay or apparent delay in replication that led to the operator deciding the re-establish replication. pg_basebackup wasn’t running at first.
The operator’s annoyance at having to “wait for 10 minutes” was because the operator was in a hurry to get home. It was that hurry that led to the poor decision to re-establish replication and then in their haste to attempt it manually and regrettably get it wrong.
I’m not sure there was any action needed at all. The replication delay was caused by the huge write spike on the database that came about because of the denial of service attempts from hackers. It would likely have reduced back to lower levels quite quickly.
For pg_basebackup, I’m not sure there is anything to improve there… most people don’t issue backups manually, and if they do I wouldn’t expect them to watch and wait. And if they did watch, I would expect them to think that a database utility would be doing something on the database – and if they checked on the database they would see it was actually active.
It seems possible we could start the backup at an earlier checkpoint and skip the few minute wait. I’ll look at the technical design to see if that is easily possible, though I am clear it was in no way the cause of the incident.
This is a usability issue. I’ve encountered it, too. It appears to do nothing for a long period of time, as if it’s not working at all. I think (it’s been a few months) that it might be the first thing it does, and hence there’s no output, either. Compare this to the progress display during the backup, which gives you a reasonable idea of how long it’ll take – isn’t it possible to do something similar?
You say that “most people don’t issue backups manually”, but I’m sure most of them do when learning to use the feature (which already has a significant learning curve for people who are not PG/database experts to start with), for small-scale operations (most likely not to have alternate backup mechanisms), and possibly in high-stress situations when the system is degraded in some way. If they give up at any of those points because it seems like it’s not working, that’s a problem, and in the end may lead to data loss.
In a way it’s similar to the way you start a backup with the intention of setting up streaming replication, and then find you ran out of WAL before you got replication working (in part because you can’t put the restore file into the right location beforehand because it’ll be overwritten by the base backup). I think that was fixed by making it possible to create a physical replication slot concurrently to the base backup, or at least via the same program.
If you don’t know how a command behaves, running it for the first time on a production system late at night when you’re in a hurry isn’t the best time to start. Testing things ahead of time is normal. The tool in question was behaving exactly correctly and always behaves this way, so nothing strange or unusual occurred. I find it hard to blame the tool in this case.
> max_connections = 8000 is a massively high figure, so reducing it to 2000 makes sense anyway.
If max_connections needs to be > 100, something is generally very wrong with the system design.
Can you add a note that that Barman and Repmgr are free? Some idiots think you are trying to sell them as products and it stops useful discussion on reddit etc. P.S. Thanks a ton for the helpful information quite useful.
Sure. Yes, both Barman and Repmgr are FOSS; there are no “Enterprise Editions” of them
The question of tool licenses was my first thought when reading through the article.
@Simon Riggs: thanks for the swift update on the situation.
@Simon Riggs: Nice and clear article.
When I read the FAQ, S3 seems not to be supported:
`I do not want to manage backup space and want my disaster recovery solution to be scalable and elastic in terms of disk resources. Do you support storage on Amazon EC2 facilities?
Currently, no. However barman has been designed to one day integrate WAL shipping and backup archiving with Amazon S3 storage. We are looking for sponsors willing to fund the development of this feature and add it to barman‘s open-source version.`
I cannot find anything about it in the docs. Does your statement about S3 point to the post-backup script feature, or is there something else?
Yes, its straightforward to save to S3 with a script
From the incident report :
“/var/opt/gitlab/postgresql/data is wiped to ensure a clean replication”
On what grounds did he team-member-1 do this?
Also why didn’t they have wal-archiving PITR in place? This along with the backup 6 hours ago might save them.
It would really help if you provide a decent a Barman HOWTO for newbies and will keep it up-to-date. Absolutely impossible to get streaming replication up and running from the first go.
PostgreSQL feature set is very rich and for a small team who are doing DevOps on their own it’s easy to lost their way to the reliable solution.
Mentioning relevant PRs in the Changelog will also help