When are we going to contribute BDR to PostgreSQL?
My colleagues and I are often asked “when will you (2ndQuadrant) contribute BDR to PostgreSQL?”
This makes an interesting assumption: that we have not already done so. We have. Much of BDR has already been contributed to core PostgreSQL, with more to come. All of BDR is under the PostgreSQL license and could be merged into PostgreSQL at any time if the community wished it (but it doesn’t and shouldn’t; see below).
It’s now quite trivial to implement an extension to add simple multi-master on top of the facilities built-in to PostgreSQL 9.5 if you want to. See the work Konstantin Knizhnik and colleagues have been doing, for example. Many if not most of the facilties that make this possible were added for and as part of the BDR project.
There seems to be some widespread misunderstanding about how contributing large pieces of work to PostgreSQL works.
PostgreSQL is a complex piece of software and it’s developed quite conservatively, focusing on correctness and maintainability over pace of development. This means that big “code drops” are not welcome. That’s where multiple features are added at once in changes that affect large parts of many different files across the server. Merging BDR in one go would be just such a code drop. As a result, even though BDR is released under the PostgreSQL license and could simply be merged into PostgreSQL tomorrow that isn’t going to happen – and it shouldn’t.
That doesn’t mean we’re sitting idle, though.
Instead we’ve progressively worked with the rest of the community (including our competitors in the services and consulting space) to get the changes to the core server that BDR needs merged into PostgreSQL over the 9.3, 9.4 and 9.5 development series. The considerable majority of those features are now in place in 9.5, with more being worked on for 9.6. BDR on 9.4 requires quite a big patch to the core PostgreSQL server, wheras on 9.5 it would require very little as a result of all the work that’s been merged.
Now that most of the core server patches are merged we’re working on getting some of the upper layers of the technology into PostgreSQL. Petr and I submitted the pglogical_output plugin to the November commitfest for 9.6 and the team is working to get the downstream/client ready for general availability and submission. The reason we’re doing this is to get it in core and into the hands of users who don’t want to rely on patched servers and big extensions, which BDR undeniably is.
A partial list of patches created for or as part of BDR and merged to core includes:
- Background workers (9.3)
- Event Triggers (9.3 and 9.4)
- Replication slots (9.4)
- Logical decoding (9.4)
- Enhanced DROP event triggers (9.4)
- REPLICA IDENTITY (9.4)
- Commit Timestamps (9.5)
- Replication Origins (9.5)
- DDL Deparse hooks (9.5)
Meanwhile we continue to work on getting further core enhancements into place to improve logical replication and multimaster capabilities, with work on sequence access methods, wal messaging, logical decoding enhancements, and more. At the same time we’re contributing extensions that use that work now that the foundations are strong enough to support them.
In other words: we have contributed BDR, and continue to contribute it. Not only have we contributed the code, but we’ve contributed countless hours extracting pieces for submission then revising and editing and changing them to meet community standards.
The latest part of that effort is pglogical, which I’ll discuss in more detail in a follow-up post.
This would be very important. One reason many of us use MySQL is because it supports multimaster replication. If PG supported it I’d have no reason to forfeit my favorite database for the lack of 1 critical feature.
Given my involvement in the project, obviously I think logical replication and multi-master are useful and important. That said, I actually spend a lot of time trying to persuade people not to use it and stick to active/standby setups. I’ll address that in a follow-up post, as I think it requires more depth than a comment can address.
Hi
“I actually spend a lot of time trying to persuade people not to use it and stick to active/standby setups.”
Please tell or guide me where to study more about this.
Reards
Multi-master has costs. Sometimes it’s a tightly-coupled multi-master setup where the nodes are tightly dependent, needing low latency and fast interconnects to work well. Or, as in the case of BDR, it’s loosely coupled, where the nodes are tolerant of disconnection and lag, but in exchange there’s much less locking and transaction co-ordination between nodes.
BDR has limitations normal PostgreSQL does not. Do you know exactly what will happen if you insert a record on two nodes at roughly the same time, or while the nodes are suffering from network interruption? Do you know how your application will respond if you UPDATE the same record from two different nodes while a third node DELETEs it? If you don’t, then you shouldn’t be using asynchronous multi-master replication like BDR.
Most people’s needs are perfectly well met by active/standby failover with physical replication. They don’t need multi-master, and don’t understand the costs that come with it, especially geographically distributed loosely coupled multi-master.
One of the biggest advantages I see with multi-master setups is that there is no coordination needed to fail a master. From there, you can then simply monitor write lag between nodes to decide if a node should be used, even after a node fails and comes back online. Some questions I have that just don’t seem to be handled today:
1. What happens if after a master node fails, and you start writing to the slave, then the master node comes back up? How do you keep it offline until it has synced back up? A node coming online should start as a read-only slave or better yet unaccessible for data access until it has synced up with any peer that is writable.
2. How do you configure your master/slave configuration to handle swapping roles at runtime, without having to take offline a node, and reconfigure it?
3. Why can’t the synchronization configuration be adjusted via the postgres connection itself vs. the need for a trigger file to halt sync? Requiring filesystem access to the systems in addition to postgres access seems like an unnecessary additional requirement that lowers overall security of the databases. Likewise, access through the wal_archive should be via Postgres, not via a filesystem. It should all be seamless, with the wal_archive defined for a node locally, and other nodes utilize it’s archive as necessary.
The first two questions are much easier to tackle when you have a multi-master setup, even if application servers never write to more than one node at a time. In working with failover logic on ADC devices, I’ve found that the best way to configure systems in a redundant design is if the configurations are all symmetric, and only simple flags that can be changed and polled at runtime should be used to trigger failovers and adjust roles. As much as possible, everything should be handled auto-magically from there.
Additionally, outside of the fact that a node is part of a cluster and defining what criteria is needed to join a cluster, configuration should be kept to a minimum. No defining of IP addresses or hostnames should be part of the configs, except for security configuration, which like the pg_hba file, allows subnets and permissions to be defined in a broader scope. Instead, a discovery mechanism should be in place, much like Hazelcast allows a multicast join to discover local nodes. Additionally, if only one write node is allowed at a time for client access, each read-only node should be able to redirect a client to the proper write-node if the client attempts a write, so that the applications don’t need to know who is master upfront either.
I’ll address your last points first. Automatic discovery, minimal configuration, etc would be great, though it imposes its own problems with distributed clusters and with security. Someone has to design and implement it though. Right now delivering the core functionality is enough of a challenge. It’s also a lot harder to effectively QA automagic systems in all the corner cases, requiring complex network simulator setups etc. At this point I see that as the job of a management layer that can be added on top of the core server functionality by external tools.
I agree with you that the ability of read nodes to redirect writes to the master would be strongly desirable. I’ve wanted to add this to PostgreSQL for some time, but now that I know enough to do so I no longer have time to do it due to other development priorities. I’m not aware of anyone working on it. All I can really say there are that … well, patches are welcome.
I don’t really understand your first question as currently phrased. If you’re asking about having nodes in an asynchronous MM setup reject writes until they’re reasonably up-to-date, I agree that’d be good to have.
For (2), you use a pooler in front. You do have to reconfigure the old master node, but tools like repmgr manage that for you, and with things like pg_rewind that’d going to get easier in future. It certainly could be better and more seamless, but like most things, that needs people to implement it.
For (3), the trigger file is partly historical. I’d like to have a pg_replica_promote() function or similar that could be called at the SQL level from a hot standby, and I’m surprised there isn’t one. It’d be simple to write one with pl/perl, pl/python or whatever to invoke pg_ctl promote though.
Part of the reason you can’t promote from SQL, though, is that you probably *need* to have file-system access to ensure that the old master is down hard. You don’t want it getting writes from apps that haven’t got the message. Even if we had a function to stop the master, there’s obviously no way to *start* a server without access outside the PostgreSQL protocol its self. So you need a management agent of some sort, be it an ssh connection to the shell, something like repmgrd, etc.
As for the WAL archive. With replication slots you don’t need a WAL archive anymore, streaming replication is sufficient by its self. However, if you are using a WAL archive it makes no sense at all for the master to serve access to the archive to replicas, because the archive must be accessible to replicas even when the master is down. You’ll be interested to know that PgBarman can now serve WAL archive files and be used as a restore_command, which is pretty similar to what you want. The barman developers are looking at adding support for barman to act as a streaming replication server, too, so it appears to PostgreSQL replicas to be a master.
Finally: Yes, multimaster appears to simplify management. It does so only by hiding other complexities. You’re either breaking consistency guarantees (async MM replication) or accepting some serious performance impacts (sync MM clustering). In the latter case you’ve got to deal with complex performance behaviour, you’ve got to make decisions about when to take a node offline and redirect clients, etc. It’s not actually less complex than active/standby if you’re expecting a fully consistent system, probably much more so in most cases.
Automagic, auto-discovered, auto-scaled auto-everything MM is handy if you don’t need ACID behaviour. If you do, it either gets complex or unpredictable. Fast.
The TL;DR is really: Yeah, there’s lots we’d love to do. Get involved and help make it happen.
Seems like a nag. Given the amount of QA, tooling and minor fixes still to go on BDR, they could help …
Indeed. At this point it’d be useful to review and test pglogical_output, which is a (revised, simplified) BDR output plugin. Or help with getting sequence access methods into core.
There are tons of open core issues that could be solved independently to make progress here, too. I and others in have been quite up front about them. I just wrote a list for anyone interested, and the archiver should see it soon: http://www.postgresql.org/search/?m=1&q=CAMsr%2BYEfTEp35bQ5ia__nDwc2MwzYUQcuQ__vT_b%3DUNOHyooMA%40mail.gmail.com
For example, it’d be really useful for someone to work on making replication slots replicate over streaming replication and be preserved by pg_basebackup to allow physical failover between logical replication nodes. Or sequence support in logical decoding. Lots of possibilities.
BDR’s out there and the code is available, so it’s not like we’re keeping it hidden and locked away. We’ve worked hard to merge as much as possible while working within the PostgreSQL development process. People who want it to happen faster need to step up and help.
As a long time operator of Postgres, and someone who likely falls into that “competitor” category, I think 2Q should be commended for the approach they have taken with BDR. Given the size and complexity of the project, they have been very up front and open about their plans and expectations for the project, and how they foresee contributing those changes back to core Postgres. This is the model that operators have always tried to follow but it has been much more difficult to get the large development shops to follow along; I hope that other companies will follow the example of the BDR project in the future. Again, my kudos to Simon and the gang for showing real leadership in this area.
Thanks Robert.
It often feels like no matter what approach we take someone will be upset about it, so I really appreciate that.