Postgres-BDR: It is also about fast safe upgrades
In the past year, I have had many conversations with many clients and prospects about Postgres-BDR. For those of you who don’t know this Postgres-BDR has been in production for many mission critical deployments for quite some time. We have more clients coming on board each month.
Most recently, I have been meeting with many companies whose business is in either the Telecommunications or Finance sector to discuss BDR. In such industries, as is the case with many others, down time is measured in money. The systems need to available 24×7. Such systems need to be “Always-On”.
Postgres is an incredibly stable, feature rich database with a permissive license. For these reasons, and many others, it is widely adopted.
Despite the wide adoption of Postgres, there are availability challenges that are dealt with either via in-house grown solutions, third party products or technologies that are part of the open source Postgres ecosystem. Many 2ndQuadrant customers use repmgr (www.repmgr.org) to handle switchover and failover between streaming replicas.
Just about any automated master-standby failover solution has a 1-2 minute outage time if a primary fails. Typically waiting sixty seconds to confirm the primary is in fact down via a series of pings and subsequently promoting a slave to become a primary. In the master standby base configuration, conservatively confirming the primary down is very important to avoid false positive in failover scenarios. Since a slave is going to be promoted you want to be sure the existing master is down before you endure the time it will promote a slave to a master and switch all the application traffic to the newly promoted master.
Since Postgres BDR is a Master – Master solution, the consequences of failing over are not nearly as high, a DBA can be much more aggressive in deciding if a primary has failed. In addition, the time it takes to promote a slave to a master in a master standby solution is eliminated. A switchover can be completed in under a second. A failover can be completed as quickly as you comfortable with (under a second in some cases). You may need to deal conflict resolutions but this is far simpler than dealing with a split brain or the cost of a two-minute outage. Yes, I have skimmed over the details but we do in fact have customers using BDR to increase availability in this manner. So up to 1 minute and 59 seconds of downtime eliminated.
What about the down time required for upgrades? This is equally important.
Upgrades include application upgrades, Postgres upgrades, BDR upgrades and possibly upgrades of the operating system and hardware. At times when considering a high availability solution, the down time for these tasks are often removed from the availability equation as “planned outages”. However, the system is just as unavailable as it would be in a failover scenario and you are likely costing the business you support money. Yes, perhaps at a more convenient time for the user base, but with user bases being global these days there is never a good time for an outage. In addition, upgrades can involve a critical bug fixes that can’t wait for a semi-convenient maintenance window.
Some Postgres upgrade scenarios can take over 30 minutes of down time to complete and are difficult to test. Even a minor Postgres version upgrade requires a server restart. All of these tasks require down time. The truth is you may never encounter a failover. However, you will definitely need to upgrade. So outages with respect to upgrades are very important to consider and plan for. In the case of a BDR deployment you can simply not have such outages.
An often-overlooked fact is that BDR nodes can have different versions of Postgres, different versions of BDR and in some cases different versions the application across different nodes. The BDR protocol along with logical replication will handle data synchronization between these versions for you.
You can simply introduce a new BDR node in the cluster, with a more recent version of Postgres, wait for the data to be synchronized across the nodes and when you are ready, switch the application traffic to the upgraded node. No downtime!
OK if that is not enough to convince you to consider Postgres-BDR for upgrades, to increase your availability how about this: When you upgrade Postgres there rarely a way back once users start writing to the new database. Performance goes ugly, errors start happening you now have a fire drill that must be fought until the fire goes out. It can go on for days or weeks and in the worst-case scenario result in a potential long outage while you figure out how to get your users back to the old software (if you in fact can).
In the case of upgrading via introducing a BDR node if things go bad, switch the traffic back to the old node. No fires, no pain no data loss. Yes, you can test, test, test to ensure you won’t encounter issues with an upgrade. However, how much testing is enough? Most of us have seen an upgrade once put into production result in a problem we didn’t consider.
Still not convinced? Consider this, if you have multiple BDR nodes in the cluster with different versions you can have some of your users run on the new software for a while before migrating all of them. You can have a rolling upgrade across your user base in a DevOps best practice manner.
The term Blue-Green upgrades where you can rollback in the event of a failure can now apply to Postgres and not just the application tier.
There are many great uses cases that can be accomplished by Postgres BDR such a geographic data sharding, getting the data close to the user and write availability. However, for fast reliable database online upgrades and Blue-Greeen deployments, BDR is also a great (and often over looked) solution.



Leave a Reply
Want to join the discussion?Feel free to contribute!