2ndQuadrant is now part of EDB

Bringing together some of the world's top PostgreSQL experts.

2ndQuadrant | PostgreSQL
Mission Critical Databases
  • Contact us
  • EN
    • FR
    • IT
    • ES
    • DE
    • PT
  • Support & Services
  • Products
  • Downloads
    • Installers
      • Postgres Installer
      • 2UDA – Unified Data Analytics
    • Whitepapers
      • Business Case for PostgreSQL Support
      • Security Best Practices for PostgreSQL
    • Case Studies
      • Performance Tuning
        • BenchPrep
        • tastyworks
      • Distributed Clusters
        • ClickUp
        • European Space Agency (ESA)
        • Telefónica del Sur
        • Animal Logic
      • Database Administration
        • Agilis Systems
      • Professional Training
        • Met Office
        • London & Partners
      • Database Upgrades
        • Alfred Wegener Institute (AWI)
      • Database Migration
        • International Game Technology (IGT)
        • Healthcare Software Solutions (HSS)
        • Navionics
  • Postgres Learning Center
    • Webinars
      • Upcoming Webinars
      • Webinar Library
    • Whitepapers
      • Business Case for PostgreSQL Support
      • Security Best Practices for PostgreSQL
    • Blog
    • Training
      • Course Catalogue
    • Case Studies
      • Performance Tuning
        • BenchPrep
        • tastyworks
      • Distributed Clusters
        • ClickUp
        • European Space Agency (ESA)
        • Telefónica del Sur
        • Animal Logic
      • Database Administration
        • Agilis Systems
      • Professional Training
        • Met Office
        • London & Partners
      • Database Upgrades
        • Alfred Wegener Institute (AWI)
      • Database Migration
        • International Game Technology (IGT)
        • Healthcare Software Solutions (HSS)
        • Navionics
    • Books
      • PostgreSQL 11 Administration Cookbook
      • PostgreSQL 10 Administration Cookbook
      • PostgreSQL High Availability Cookbook – 2nd Edition
      • PostgreSQL 9 Administration Cookbook – 3rd Edition
      • PostgreSQL Server Programming Cookbook – 2nd Edition
      • PostgreSQL 9 Cookbook – Chinese Edition
    • Videos
    • Events
    • PostgreSQL
      • PostgreSQL – History
      • Who uses PostgreSQL?
      • PostgreSQL FAQ
      • PostgreSQL vs MySQL
      • The Business Case for PostgreSQL
      • Security Information
      • Documentation
  • About Us
    • About 2ndQuadrant
    • 2ndQuadrant’s Passion for PostgreSQL
    • News
    • Careers
    • Team Profile
  • Blog
  • Menu Menu
You are here: Home1 / Blog2 / Cloud Native3 / Automated rapid switchover with a BDR database cluster in Kubernetes

Discover how BDR and Kubernetes allow you to reach very high uptimes in a year for a database solution thanks to the fast failover capability. Watch the demo!

In my previous article we went through the deployment of a BDR database in a Kubernetes cluster using our Cloud Native BDR Operator, and in particular we focused on declarative configuration and multi-master capability. We demonstrated how it is possible to create a 3-node BDR group in a few dozens of seconds and showcased DDL replication as well as writes on multiple nodes.

This post is about high availability, intended as one of the fundamental components of a system’s business continuity. High Availability is about restoring a service after a failure in the shortest amount of time and is usually measured in terms of recovery time objective (RTO). Thanks to its self-healing capabilities, Kubernetes is conceived and well suited for high availability of a given service.

High availability with PostgreSQL

Normally in the database sector, high availability is associated with the concept of read-only or passive replicas. For example, with a single primary database system like PostgreSQL, we rely on standby servers – which are usually kept synchronised through physical streaming replication for Write Ahead Log (WAL) shipping. In case of failure, one of the standby servers is selected and promoted to primary. Technically, this operation requires the standby to exit recovery mode and start serving write operations – thus it might not be immediate. Consider for example a PostgreSQL cluster with a very high workload where standby servers have a natural lag when it comes to replaying REDO log data contained in WAL files.

Also, it is important that cluster manager systems (like repmgr or Patroni, to name a few) ensure that the old primary server is down and that applications are correctly handled during this process, which is commonly known as failover. Failover requires proper monitoring and can be fully automated (including detection), automated with manual trigger or entirely manual.

Even though implementation details may change, failover procedures in single primary database management systems like PostgreSQL require a transition from recovery mode to primary (promotion) that is not always immediate and deterministic.

High availability with BDR and Kubernetes

On the other hand, a technology like BDR allows us to implement multi-master architectures. Consider for example the WriteAnywhere architecture, available with our operator for Kubernetes, in which a BDR group can have 3 or more masters and also take advantage of two services for your Cloud Native applications: any random server or the selected “Lead master”.

As mentioned earlier, Kubernetes provides an entire framework specifically designed for high availability. We have programmed our operator to integrate itself with the Kubernetes API and to properly react after voluntary/involuntary disruptions that involve a BDR group (self-healing).

For example, in case of failure of a Pod of any BDR node, Kubernetes removes that specific endpoint from all services so that applications won’t use it. Moreover, in case of failure of the lead master, Kubernetes immediately transfers the lead master role to the next available BDR node.

The added value of BDR is that, as you have probably already noticed, there is no promotion involved as all nodes are always active and accepting write operations. This operation is instantaneous, a matter of milliseconds in case of disruption of a specific Pod. This is the reason why with BDR we use the term fast failover – or even rapid switchover to emphasise on the fact that the operation is primarily a switch without promotion.

It is worth noting that our CI/CD pipeline has several End-to-End tests for the operator, and one of them involves systematic measurement of fast failover performance: if, after a Pod is killed, failover does not happen in less than a second, the test fails.

Of course, depending on how you have configured your liveness and readiness probes as well as timeouts in your Kubernetes cluster, different disruptions like worker node failures might have slightly higher recovery times.

Demonstration video

In the following “Hard to kill” video I will go through a demonstration of self-healing capabilities of a BDR database in Kubernetes, measuring the high availability of the cluster from the applications point-of-view.

I will use “kind” (Kubernetes IN Docker) on my laptop.

The failure that we have selected is a common one in database context: issue on a persistent volume, which becomes unusable.

Our BDR operator allows us to annotate a PVC as unusable: we will use this technique to simulate this kind of failure in our test. Specifically, we will simulate such a problem on the pod where the lead master is running. Then we will proceed by deleting that pod.

Our Kubernetes and our BDR operator should detect this issue and react by routing the lead-master service to the next BDR node, parting the deleted node from the cluster and creating a new pod to restore the desired state of 3 masters. In a few words, issue a fast failover, or if you prefer a rapid switchover.

The HTTP load generator should be transparently redirected to the new lead-master.

As a final step, we will execute a query in the database that reports the biggest lag from the previous record in the table, using the timestamp. This is a pessimistic estimation of the downtime experienced by the frontend application.

We will also make sure that the self-healing process completes through the restoration of the desired configuration of 3 masters, showing how Kubernetes promptly detects change of status and reacts by correctly updating the service used by the applications.

Conclusions

Cloud Native BDR is available to 2ndQuadrant customers through our 24/7 Production support service. Enquire about Cloud Native BDR Quickstart now to get access to binaries, training and consulting.

Support & Services

24/7 Production Support

Developer Support

Remote DBA for PostgreSQL

PostgreSQL Database Monitoring

PostgreSQL Health Check

PostgreSQL Performance Tuning

Database Security Audit

Upgrade PostgreSQL

PostgreSQL Migration Assessment

Migrate from Oracle to PostgreSQL

Products

HA Postgres Clusters

Postgres-BDR®

2ndQPostgres

pglogical

repmgr

Barman

Postgres Cloud Manager

SQL Firewall

Postgres-XL

OmniDB

Postgres Installer

2UDA

Postgres Learning Center

Introducing Postgres

Blog

Webinars

Books

Videos

Training

Case Studies

Events

About Us

About 2ndQuadrant

What does 2ndQuadrant Mean?

News

Careers 

Team Profile

© 2ndQuadrant Ltd. All rights reserved. | Privacy Policy
  • Twitter
  • LinkedIn
  • Facebook
  • Youtube
  • Mail
Webinar: Creating Graph Databases in PostgreSQL [Follow Up] A tale of password authentication methods in PostgreSQL A tale of password authentication methods in PostgreSQL
Scroll to top
×