Postgres-BDR: It is also about fast safe upgrades
In the past year, I have had conversations with many clients and prospects about Postgres-BDR. Postgres-BDR has been in production for many mission critical deployments for quite some time. We have more clients coming on board each month.
Recently, I have been meeting with companies whose business is in either Telecommunications. In such industries, down time is measured in money. The systems need to be available 24×7. In other words the systems need to be “Always-On”.
Postgres is an incredibly stable, feature rich database with a permissive license. For these reasons, and many others, it is widely adopted.
Despite the wide adoption of Postgres, there are availability challenges. These challenges are usually dealt with using third party products or technologies that are part of the open source Postgres ecosystem. Many 2ndQuadrant customers use repmgr (www.repmgr.org) to handle switchover and failover between streaming replicas.
Just about any automated master-standby failover solution has a 1-2 minute outage time if a master fails. Typically waiting sixty seconds to confirm the primary is down via a series of pings and subsequently promoting a standby to become a master. In the master standby base configuration, conservatively confirming the primary down is very important to avoid a false positive failover scenarios. Since a standby is going to be promoted, you want to be certain the existing master is down. If you failover when the master is not in fact down, you will have serious problems that I will describe in a separate blog article.
Since Postgres BDR is a Master – Master solution, the consequences of failing over are not nearly as high, a DBA can be much more aggressive in deciding if a primary has failed. In addition, the time it takes to promote a standby to a master is eliminated. A switchover can be completed in under a second. A failover can be completed as quickly as you comfortable with (under a second in some cases). You may need to deal conflict resolutions but this is far simpler than dealing with a split brain or the cost of a two-minute outage. I have skimmed over the details but we do in fact have customers using BDR to increase availability in this manner. So up to 1 minute and 59 seconds of downtime eliminated.
What about the down time required for upgrades? This is equally important.
Upgrades include application upgrades, Postgres upgrades, BDR upgrades and upgrades of the operating system and the hardware. At times when considering a high availability solution, the down time for these tasks are often removed from the availability equation as “planned outages”. However, the system is just as unavailable as it would be in a failover scenario and you are likely costing the business you support money. Perhaps at a more convenient time for the user base, but with user bases being global these days there is never a good time for an outage. In addition, upgrades can involve a critical bug fixes that can’t wait for a semi-convenient maintenance window.
Some Postgres upgrade scenarios can take over 30 minutes of down time to complete and are difficult to test. Even a minor Postgres version upgrade requires a server restart. All of these tasks require down time. The truth is you may never encounter a failover. However, you will definitely need to upgrade. So outages with respect to upgrades are very important to consider and plan for. In the case of a BDR deployment you can simply not have such outages.
An often-overlooked fact is that BDR nodes can have different versions of Postgres, different versions of BDR and in some cases different versions the application across different nodes. The BDR protocol along with logical replication will handle data synchronization between these versions for you.
You can simply introduce a new BDR node in the cluster, with a more recent version of Postgres, wait for the data to be synchronized across the nodes and when you are ready, switch the application traffic to the upgraded node. No downtime!
If that is not enough to convince you to consider Postgres-BDR for upgrades, to increase your availability how about this: When you upgrade Postgres there is rarely a way back once users start writing to the new database. Performance goes ugly, errors start happening you now have a fire drill that must be fought until the fire goes out. It can go on for days or weeks and in the worst-case scenario result in a potential long outage while you figure out how to get your users back to the old software (if you in fact can).
In the case of upgrading via introducing a BDR node if things go bad, switch the traffic back to the old node. No fires, no pain no data loss. Yes, you can test, test, test to ensure you won’t encounter issues with an upgrade. However, how much testing is enough? Most of us have seen an upgrade once put into production result in a problem we didn’t consider.
Still not convinced? Consider this, if you have multiple BDR nodes in the cluster with different versions you can have some of your users run on the new software for a while before migrating all of them. You can have a rolling upgrade across your user base in a DevOps best practice manner.
The term Blue-Green upgrades where you can rollback in the event of a failure can now apply to Postgres and not just the application tier.
There are many great uses cases that can be accomplished by Postgres BDR such a geographic data sharding, getting the data close to the user and write availability. However, for fast reliable database online upgrades and Blue-Greeen deployments, BDR is also a great (and often over looked) solution.
How many nodes can you have in BDR setup? Is Postgres-BDR basically like Galera replication but for PostgreSQL? Can you make DDL schema changes one node at a time?
I’ll answer this in terms of BDR3, since that’s what I’m mainly working with.
The node count is a bit of a “how long is a piece of string” question. It really depends on the workload, the network topology, the roles you require the nodes to have, etc. The practical ceiling varies, it’s currently constrained mainly by the mesh topology (all-to-all replication) and also the level of Raft chatter between nodes. I probably wouldn’t want to go above 50 active nodes right now, but there isn’t really a hard limit, overheads just become a higher and higher % of total load.
It’s not the same as Galera at all, though it’s broadly in the same family of replication products. It’s more similar to Galera than it is to Mongo, CockroachDB, or RAC, but that’s about as far as it goes. I’m not equipped to give a detailed point-by-point comparison but I know there are large differences in inter-node communication and coordination, conflict detection and prevention model and features, and much more. BDR3 moves fast and is quite configurable, so it can meet a variety of different needs. It’s just not practical to give pat one-size-fits-all answers that are accurate for all likely workloads and user needs.
Regarding schema changes, many kinds of schema change may be made one node at a time. Some may not. Columns may be added (with some limits) or dropped node by node, indexes created and dropped, tables created and dropped (with some limits), and a variety of other operations. Some other types of DDL would be possible to deploy node-by-node, but aren’t presently supported for running that way. The main things you cannot do node-by-node are big in-place changes like altering a table to change the data type of a column – usually the same sorts of things you would never want to do in a single step anyway for locking and I/O reasons. Or operations that don’t produce consistent and predictable results from executing the same statement on multiple nodes, since DDL is transported using statement replication. The main time you’ll notice any DDL handling limitations in BDR are when you’re using schema change generators, “migrations” tools, etc that aren’t aware of how to make changes in low-lock table-rewrite-free ways. For example, if you try to
ALTER TABLE foo ADD COLUMN bar float8 DEFAULT random();
that’ll throw an error, as it cannot be sensibly handled by statement-replication. You can instead add the column without a default, add the default so it affects new rows, then UPDATE the table in batches to set the new value for existing rows. The documentation covers things like this, and we run a strong consulting and support team to help customers out.A word of advice: As BDR3 does not violate the basic laws of physics or require a wormhole generator for its operation, it is subject to the same fundamental rules as all data replication products. You’re bound by PACELC (a better formulation of CAP for databases). Some vendors’ sales material will spin their replication solutions as always-consistent, always-available, disaster-tolerant and partition-tolerant systems that somehow preserve both write availability and immediate global consistency even when the network is down or restore it without conflicts and inconsistencies when the network comes back. I’d like to know where they got their tachyon communication beam hardware from and if they sell Tricorders there too, because they’re either breaking the rules of physics or there’s some very important small print. Probably something like “All these features are available, but may not all be used at the same time”.