Thoughts on Uber’s List of Postgres Limitations
An Uber technical blog of July 2016 described the perception of “many Postgres limitations”. Regrettably, a number of important technical points are either not correct or not wholly correct because they overlook many optimizations in PostgreSQL that were added specifically to address the cases discussed. In most cases, those limitations were actually true in the distant past of 5-10 years ago, so that leaves us with the impression of comparing MySQL as it is now with PostgreSQL as it was a decade ago. This is no doubt because the post was actually written some time/years? ago and only recently published.
This document looks in detail at those points to ensure we have detailed information available for a wider audience, so nobody is confused by PostgreSQL’s capabilities.
Having said that, I very much welcome the raising of those points and also wish to show that the PostgreSQL project and 2ndQuadrant are responsive to feedback. To do this, detailed follow-ups are noted for immediate action.
These points were noted in the blog
- Poor replica MVCC support
- Inefficient architecture for writes
- Inefficient data replication
- Difficulty upgrading to newer releases
Poor replica MVCC support
“If a streaming replica has an open transaction, updates to the database are blocked if they affect rows held open by the transaction. In this situation, Postgres pauses the WAL application thread until the transaction has ended.”
This is true, though misses the point that a parameter exists to control that behaviour, so that when
hot_standby_feedback = on the described behaviour does not occur in normal circumstances. This is supported from PostgreSQL 9.1 (2011) and above. If you’re not using it, please consider doing so.
Later, this comment leads to the conclusion
“Postgres replicas … can’t implement MVCC” which is wholly incorrect and a major misunderstanding. PostgreSQL replicas certainly allow access to data with full MVCC semantics.
Inefficient architecture for writes
“If old transactions need to reference a row for the purposes of MVCC MySQL copies the old row into a special area called the rollback segment.”
“This design also makes vacuuming and compaction more efficient. All of the rows that are eligible to be vacuumed are available directly in the rollback segment. By comparison, the Postgres autovacuum process has to do full table scans to identify deleted rows.”
Moving old rows to a rollback segment adds time to the write path for UPDATEs, but that point isn’t mentioned. PostgreSQL is more efficient architecture for writes in relation to MVCC because it doesn’t need to do as many push-ups.
Later, if the workload requires that we access old rows from the rollback segment that is also more expensive. That is not always needed, yet it is very common for longer running queries to need to access older data. However, if all transactions are roughly the same short duration access to the rollback segment is seldom needed, which just happens to make benchmark results appear good while real-world applications suffer.
By contrast, PostgreSQL has multiple optimizations that improve vacuuming and compaction. First, an optimization called HOT improves vacuuming in heavily updated parts of a table (since 2007), while the visibility map ensures that VACUUM can avoid full table scans (since 2008).
Whether rollback segments help or hinder an application depend on the specific use case and it’s much more complex than this first appears.
Next, we discuss indexes…
“With Postgres, the primary index and secondary indexes all point directly to the on-disk tuple offsets.”
This point is correct; PostgreSQL indexes currently use a direct pointer between the index entry and the heap tuple version. InnoDB secondary indexes are “indirect indexes” in that they do not refer to the heap tuple version directly, they contain the value of the Primary Key (PK) of the tuple.
Comparing direct and indirect indexes we see:
- direct indexes have links that go index → heap
- indirect indexes have links that go index → PK index → heap
Indirect indexes store the PK values of the rows they index, so if the PK columns are wide or contain multiple columns the index will use significantly more disk space than a direct index, making them even less efficient for both read and write (as stated in MySQL docs). Also indirect indexes have index search time >=2 times worse than direct indexes, which slows down both reads (SELECTs) and searched writes (UPDATEs and DELETEs).
Performance that is >=100% slower is understated as just a “slight disadvantage” [of MySQL].
“When a tuple location changes, all indexes must be updated.”
This is misleading, since it ignores the important Heap Only Tuple (HOT) optimization that was introduced in PostgreSQL 8.3 in 2007. The HOT optimization means that in the common case, a new row version does not require any new index entries, a point which effectively nullifies the various conclusions that are drawn from it regarding both inefficiency of writes and inefficiency of the replication protocol.
“However, these indexes still must be updated with the creation of a new row tuple in the database for the row record. For tables with a large number of secondary indexes, these superfluous steps can cause enormous inefficiencies.”
As a result of ignoring the HOT optimization this description appears to discuss the common case, rather than the uncommon case. It is currently true that for direct indexes if any one of the indexed columns change then new index pointers are required for all indexes. It seems possible for PostgreSQL to optimize this further and I’ve come up with various designs and will be looking to implement this best fairly soon.
Although they have a higher read overhead, indirect indexes have the useful property that if a table has multiple secondary indexes then an update of one secondary index does not affect the other secondary indexes if their column values remain unchanged. This makes indirect indexes useful only for the case where an application needs indexes that would be infrequently used for read, yet with a high update rate that does not touch those columns.
Thus, it is possible to construct cases in which PostgreSQL consistently beats InnoDB, or vice versa. In the “common case” PostgreSQL beats InnoDB on reads and is roughly equal on writes for btree access. What we should note is that PostgreSQL has the widest selection of index types of any database system and this is an area of strength, not weakness.
The current architecture of PostgreSQL is that all index types are “direct”, whereas in InnoDB primary indexes are “direct” and secondary indexes “indirect”. There is no inherent architectural limitation that prevents PostgreSQL from also using indirect indexes, though it is true that has not been added yet.
We’ve done a short feasibility study and it appears straightforward to implement indirect indexes for PostgreSQL, as an option at create index time. We will pursue this if the HOT optimizations discussed above aren’t as useful or possible, giving us a second approach for further optimization. Additional index optimizations have also been suggested.
Inefficient data replication
“However, the verbosity of the Postgres replication protocol can still cause an overwhelming amount of data for a database that uses a lot of indexes.”
Again, these comments discuss MySQL replication which can be characterized as Logical Replication. PostgreSQL provides both physical and logical replication. All of the benefits discussed for MySQL replication are shared by PostgreSQL’s logical replication. There are also benefits for physical replication in many cases, which is why PostgreSQL provides both logical and physical replication as options.
PostgreSQL physical replication protocol itself is not verbose – this comment is roughly the same as the “inefficient writes” discussion: if PostgreSQL optimizes away index updates then they do not generate any entries in the transaction log (WAL), so there is no inefficiency. Also, the comment doesn’t actually say what we mean by “overwhelming”. What this discussion doesn’t consider is the performance of replication apply. Physical replication is faster than logical replication because including the index pointers in the replication stream allows us to insert them directly into the index, rather than needing to search the index for the right point for insertion. Including the index pointers actually increases not decreases performance, even though the replication bandwidth requirement is higher.
PostgreSQL Logical Replication is available via 2ndQuadrant’s pglogical and will be available in PostgreSQL 10.0 in core.
(MySQL) “Statement-based replication is usually the most compact but can require replicas to apply expensive statements to update small amounts of data. On the other hand, row-based replication, akin to the Postgres WAL replication, is more verbose but results in more predictable and efficient updates on the replicas.”
Yes, statement-based replication is more efficient in terms of bandwidth, but even less efficient in terms of the performance of applying changes to receiving servers. Most importantly, it leads to various problems and in various cases replication may not work as expected, involving developers in diagnosing operational problems. PostgreSQL probably won’t adopt statement-based replication.
Difficulty upgrading to newer releases
“The basic design of the on-disk representation in 9.2 hasn’t changed significantly since at least the Postgres 8.3 release (now nearly 10 years old).”
This is described as if it were a bad thing, but actually it’s a good thing and is what allows major version upgrades to occur quickly without unloading and reloading data.
“We started out with Postgres 9.1 and successfully completed the upgrade process to move to Postgres 9.2. However, the process took so many hours that we couldn’t afford to do the process again. By the time Postgres 9.3 came out, Uber’s growth increased our dataset substantially, so the upgrade would have been even lengthier.”
-k option provides an easy and effective upgrade mechanism. Pg_upgrade does require some downtime, which is why 2ndQuadrant has been actively writing logical replication for some years, focusing on zero-downtime upgrade.
Although the logical replication upgrade is only currently available from 9.4 to 9.5, 9.4 to 9.6 and 9.5 to 9.6, there is more good news coming. 2ndQuadrant is working on highly efficient upgrades from earlier major releases, starting with 9.1 → 9.5/9.6. When PostgreSQL 9.1 is desupported later in 2016 this will allow people using 9.1 to upgrade to the latest versions. This is available as a private service, so if you need zero-downtime upgrade from 9.1 upwards please get in touch.
In 2017, upgrades from 9.2 and 9.3 will also be supported, allowing everybody to upgrade efficiently with zero-downtime prior to the de-supporting of those versions.
> Pg_upgrade does require some downtime, which is why 2ndQuadrant has been actively writing logical replication for some years, focusing on zero-downtime upgrade.
If I have, for example, 5 standby servers for my production load, is it still possibile to logical upgrade all of it without downtime?
That’s what we’re aiming towards, yes. We’re hoping all of this will be in-core for Postgres 10.x
to say that “pg_upgrade” require some downtime is a bit of understatement.. specially if you DB cluster is sprawled all over the world..
we successfully used pglogical (instead of slony), and it was much easier to use, to migrate our hot standby streaming cluster from 9.4.4 -> 9.5.2. This is for our 90gig db across 4 continents, 6 direct slaves and 3 slave of slaves with a clocked downtime for read on slaves of 3 seconds each done during the none peak hours for the specific slave, and 7 seconds on the master for writes.
on a different note, to migrating from PostgreSQL -> mysql boggles the mind but maybe as you hinted it was old evaluation and just got published now for some reason…
Good writeup Simon!
While Uber surprisingly forgot to mention the HOT optimisation, it’s not clear at all that HOT is the common case.
AFAIU as soon as an update touches an indexed column, the HOT optimisation cannot be used. And as it seems that Uber indexes a lot of columns (they talk about having a table with a dozen indexes), they might rarely benefit from HOT.
Quoting src/backend/access/heap/README.HOT: HOT solves this problem for a restricted but useful special case: where a tuple is repeatedly updated in ways that do not change its indexed columns. (Here, “indexed column” means any column referenced at all in an index definition, including for example columns that are tested in a partial-index predicate but are not stored in the index.)
It is the common case in general. But not for every table and all workloads, I agree, but then the indexes are added by you.
Knowing about HOT allows you to choose how and when you add indexes, balancing the need to minimize the write workload at the same time as optimising reads.
The number of indexes is irrelevant, this is about whether you choose to index column(s) that change with sufficient frequency that you could call that the common case.
For example, rather than indexing the exact position of each car every 1ms, it is likely sufficient to index the position of the car to within 1 mile every 1 minute. UPDATEing every 1ms would create a write workload 60,000 times higher, so that should be avoided. And UPDATEing the position to within 1 mile will likely mean most updates would not change the indexed values.
Interesting point Simon,
Do you mean there are benefits of decomposing the position into 2 fields: km + m, so that km is not updated that frequently? Do we then index only km, so that m column is filtered as additional step?
Unfortunately, hot_standby_feedback still does not work well if the replica is constantly running overlapping long-running queries that access heavily updated tables. The replication will pause so much that the lag will grow indefinitely, and/or the master will become incredibly bloated. Physical replication is good for OLTP read-replicas, proper standbys and some carefully designed batch job use cases. It is not good for using a standby as a busy multi-purpose concurrently accessed analytics database.
Hi. I don’t think it is entirely correct to claim “>=2 times”. In fact I think in practice clustered pk indirection is just as fast as direct. Here is why.
It is very common for primary keys to be something short and simple, just 8 byte integer. Arguably, most common.
So on well tuned system all or most of inner nodes of pk btree will easily fit into buffer cache. I.e. for 16 kbyte leaf nodes, and total data size of 1Tb, if we assume 32 bytes per inner btree node entry (possibly optimistic; and also ignoring up to 50% of space wastage in btree nodes) we need only 2 gbytes of ram to fit all inner nodes.
If we assume that inner nodes of pk btree are in ram, then fetching record by it’s primary key (very common operation, regardless of how secondary indexes work) is just one seek (fetching leaf btree node with record). About same as with direct indexes.
On the other hand, if record is in ram too, then higher cpu cost of traversing btree might be (and likely is) more visible. So in this case direct indexes may win by much.
Here I am assuming that innodb’s leaf btree nodes of primary key index actually contain records and not just their offsets. I don’t know what they really do, but I assume they’re doing it right.
Concerning “write inneficiency”, whatever you may argue that their design was at fault, HOT optimisation was useless in their case as the code source quoted by vdp explained.
And in other DBMS like Mysql or Oracle, with that same DB design, they would not have paid such a penalty.
I think PG – which is no doubt a very good DBMS – should take this Uber move with an open mind and better see it as an opportunity for improvements.
In mysql table is stored as index on Primary Key columns like index organize table in Oracle. And any reads on secondary indexes will have to fetch other table columns from that primary index in just one fetch. There is no table heap there seperately as mentioned in the blog.
Specifically, InnoDB uses index-organised tables.