Postgres-XL: Dealing with read-only transactions
One of the problems with Postgres-XL 9.2 is that it assigns a global transaction identifier (GXID) for every transaction that is started in the cluster. Since its hard for the system to know if a transaction is read-only or not (remember, even SELECT queries can do write activities), Postgres-XL would always assign a GXID and send that to the datanodes. This is quite bad for system performance because every read-only transaction now must go to the GTM, start a new transaction, grab an GXID and also finish the transaction on the GTM when its committed or aborted. For short transactions, like read-only pgbench workload, this adds a lot of overhead and latency. The overhead is even more noticeable when the actual work done by the transaction is very fast, say because the data is fully cached in shared buffers.
This is now set to change in the upcoming Postgres-XL 9.5 release. We have implemented the same lazy GXID assignment technique similar to PostgreSQL. This required us to enhance the concept of global session. A transaction which involves more than one node runs a global session and gets a unique global session identifier. If a node while executing the query decides to assign GXID to the transaction, it sends the global session identifier to the GTM. If the global transaction, identified by the global session, had already been assigned a GXID, the GTM sends back the same GXID to the node or a new GXID is assigned to the transaction. Further, the node sends back the assigned GXID to the coordinator so that it can decide whether to run a single phase or two-phase commit for the transaction.
While the lazy GXID assignment resulted in significant reduction in the GXID consumption rate, its still not enough because of yet another problem that we must solve. As you’re aware, Postgres-XL uses Multi-version Concurrency Control or MVCC for providing various transaction guarantees. But to reclaim the dead row versions, every node must check row versions against something called as RecentGlobalXmin which, in simple terms, is the oldest transaction that might be visible to the oldest transaction running in the system. To correctly compute this value, the GTM must be aware of the snapshot used by every running transaction in the cluster, even read-only transactions. That implies that every transaction must still report its start and end to the GTM, along with the oldest transaction its still interested in (known as snapshot xmin). If we wanted to do any significant improvement for read-only, short transaction workload, we must address this problem too.
So what we now have is a new cluster-monitor process that every datanode and coordinator runs as a postmaster child process. This process is responsible for periodically reporting node-local state to the GTM, which then computes the global state and sends back to all the nodes. To address the RecentGlobalXmin described above, the cluster-monitor sends the RecentGlobalXmin computed using node-local information. The GTM then computes a cluster-wide RecentGlobalXmin which node must use. The entire algorithm is much more complex than it sounds because we must account for node failures, node disconnections and ensure that the RecenGlobalXmin computation does not unnecessarily stalled because that will prevent any dead row versions getting removed from the system.
To summarise, the old way of executing a read-only transaction suffered from the following problems that the new Postgres-XL 9.5 will solve:
- Coordinator to GTM roundtrip to start a transaction and get a GXID
- Coordinator to GTM roundtrip for finishing the transaction
- Extra state and load on the GTM
- Rapid consumption of GXID, leading to wrap-around related problems
These features together give us an edge over other MPP databases as far as short, read-only transactions are concerned. Stay tuned for my next blog on OLTP performance benchmark results using these new enhancements.
Great work Pavan and team.
Something that would be very useful in environments that have a handful of heavy update small tables (ie: queue tables) is the ability to block a backend from accessing those tables for the duration of the current transaction. That would allow publishing that info along with the snapshot, allowing vacuuming to be more granular than recent xmin.
This case probably isn’t compelling enough to add to core Postgres, but maybe it would be more useful in XL.
Grea work pavan. I was struggling decipher the detail from the commit logs. But now its clear. BTB we had come issues with DBT3 benchmark Where GTM gets shut down randomly ever since I have have taken some commits which where done for GTM optimisations. I md aiding another round of testing to confirm this. But curious to know if the current changes make the GTM more heavy in memory ?