The Physics of Multi-Master

If you try to update the same data at the same time in multiple locations, your application has a significant problem, period.

That’s what I call the physics of multi-master.

How that problem manifests itself is really based upon your choice of technology. Choosing Postgres, Oracle or ProblemoDB won’t change the problem, just gives you choices for handling the problem.

If you choose single master, then you get an error because one of the nodes you try to update is read-only and so can’t be updated at all.

If you have multiple masters, then you get to choose between an early abort because of serialization errors, or a later difficulty when conflict resolution kicks in. Eager serialization causes massive delays on every transaction, even ones that have no conflict problems. A better, more performant way is to resolve any conflicts later, taking an optimistic approach that gives you no problems if you have no conflicts. That is why BDR supports post-commit conflict resolution semantics.

Or if you use a shared cache system like Oracle RAC then you get significant performance degradation as the data blocks ping around the cluster. The serialization is enforced by block-level locks.

There isn’t any way around this. You either get an ERROR, or you get some other hassle.

So BDR and multi-master isn’t supposed to be a magic bullet, its an option you can take advantage of for carefully designed applications. Details on BDR

Now some developers reading this will go “Man, I’m never touching that.” One thing to remember is that single master means one node, in one place in the world. Nice response times if you’re sitting right next to it, but what happens when you’re the other side of the planet? Will all the people using your application wait while you access your nice simple application?

Physics imposes limitations and we need database solutions to work around them.

4 replies
  1. Jakub Wartak
    Jakub Wartak says:

    “Or if you use a shared cache system like Oracle RAC then you get significant performance degradation as the data blocks ping around the cluster. The serialization is enforced by block-level locks.”

    I was actually always very curious why Postgres community doesn’t develop just like that for OLTP – shared everything with SAN (i know the answer that it won’t happen, talked with one of your guys – Petr – about it).

    I’m mostly Oracle guy, working with RAC systems and really the performance of that solution WHEN DONE PROPERLY isn’t that bad as you saying – actually it’s freaking AWESOME. RAC is master piece of engineering. It is simply not true in properly tuned system/app where you avoid those “block ping-pongs”/hotspots between nodes.

    Historically it was bad when RAC was called Oracle Parallel Servers somewhere starting in 90s and Oracle did not have Cache Fusion at that time. Now it has plenty of stuff that solve all those problems since 2004 at least: hash (sub)partitioned tables/indexes, REVERSE indexes, DB services to “partition the load” through listeners, cached no-ordeded sequences, Cache Fusion(via dedicated interconnects ethernet or IB ones, max two way/three way block transfers even in clusters of >=4 nodes), Dynamic Block Remastering, Past Images, avoiding disk pings (IO) in case of remastering dirty blocks, caching of previous versions of block for read consistency, handling custom small block (e.g. 2kB) tablespaces, tricks with minimize_rows_per_block/PCTFREE… and best of all that is nearly completely transparent to applications and rest of Oracle technologies without the need to worry about what can happen with write-write conflicts between nodes because part of RAC – GES – Global Enqueue Services – solves global cluster locking for you… It was more than 10 years of journey…

    Reply
    • Matt
      Matt says:

      A shared nothing approach is the only thing that does not wake a DBA up at night. Shared resources made sense when fast and reliable data storage was expensive, but now local storage cost & capacity advances have made all of that overhead, complexity budget, and shared failure modes a liability. A bad SAN firmware update can take down an entire SAN. I would not put anything on a SAN without file system or application level checksums.

      The only proper implementation is redundant paths from end point to end point, with nothing shared in-between. For example (with a different technology), if you want a reliable TCP connection, consider dual homing everything with MultiPath TCP.

      Reply
  2. Jim Nasby
    Jim Nasby says:

    “So BDR and multi-master isn’t supposed to be a magic bullet, its an option you can take advantage of for carefully designed applications.”

    We need to make “there’s no such thing as a magic bullet” t-shirts.

    An interesting look into running into the laws of physics is the Google Spanner whitepaper (http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/spanner-osdi2012.pdf). Long-story short, they installed multiple GPS receivers and actual atomic clocks in a bunch of their data centers to try and build a global distributed multi-master system… that apparently maxes out at several thousand TPS. I know 2nd Quadrant and others have pushed Postgres well past 10x that level.

    To be fair, Google is running their largest money maker (AdSense) on Spanner, so clearly it works well enough for what they need. What I find most interesting about it is TrueTime, a time data type that accounts for actual clock drift. I wrote up about it at http://bluetreble.com/2015/10/time-travel/.

    Reply

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *