Performance of Sequences and Serials in Postgres-XL
In Postgres-XL, sequences are maintained at the Global Transaction Manager (GTM) to ensure that they are assigned non-conflicting values when they are incremented from multiple nodes. This adds significant overhead for a query doing thousands of INSERTs in a table with a serial column, incrementing sequence one at a time and making a network roundtrip to the GTM, for every INSERT.
Shaun Thomas in a recent blog complained about INSERTs running a magnitude slower on Postgres-XL as compared to vanilla PostgreSQL. There is already a way to improve performance for sequences, but it’s clearly not well advertised. I thought this is a good opportunity to explain the facility.
Postgres-XL provides a user-settable GUC called sequence_range. Every backend requests a block of sequence values as controlled by this GUC. Given that COPY is popularly used to bulk load data in Postgres, Postgres-XL automatically overrides this GUC during COPY operation and sets it to 1000, thus dramatically improving COPY performance. Unfortunately, for regular INSERTs, the default is 1 and unless user explicitly sets sequence_range to a reasonably higher value, INSERT performance suffers. Here is an example, using the same sample schema as used by Shaun in his blog post.
CREATE TABLE sensor_log ( sensor_log_id SERIAL PRIMARY KEY, location VARCHAR NOT NULL, reading BIGINT NOT NULL, reading_date TIMESTAMP NOT NULL ) DISTRIBUTE BY HASH (sensor_log_id); postgres=# \timing Timing is on. postgres=# INSERT INTO sensor_log (location, reading, reading_date) SELECT s.id % 1000, s.id % 100, now() - (s.id || 's')::INTERVAL FROM generate_series(1, 40000) s(id); INSERT 0 40000 Time: 12067.911 ms postgres=# set sequence_range TO 1000; SET Time: 1.231 ms postgres=# INSERT INTO sensor_log (location, reading, reading_date) SELECT s.id % 1000, s.id % 100, now() - (s.id || 's')::INTERVAL FROM generate_series(1, 40000) s(id); INSERT 0 40000 Time: 397.406 ms
So by appropriately setting sequence_range to 1000, performance of the INSERT query improved by nearly 30 times.
When this feature was added, the default value of sequence_range GUC was set to 1 because it can leave holes in the sequence values. But looking at the performance implications for a very common use case, we decided to increase the default to 1000 and this has now been committed to the XL9_5_STABLE branch of the repository.
Its important to note that while a high value of sequence_range will improve performance for sequences and serials, it can also leave large holes in sequence ranges since the sequence ranges are cached at a backend level. To address this issue, Postgres-XL starts with the specified CACHE parameter value used at sequence creation time and doubles it every time (limited by sequence_range) if sequences are being consumed at a very high rate.
Similar improvement can also be achieved by increasing the CACHE parameter value of the sequence so that a chunk of sequence values are cached at the backend level. Following example shows how to do that for a serial column. But the sequence_range GUC provides an easy way to override the global default and also ensures that the sequences are cached only when they are getting incremented very rapidly.
postgres=# ALTER SEQUENCE sensor_log_sensor_log_id_seq CACHE 1000; ALTER SEQUENCE Time: 8.683 ms postgres=# SET sequence_range TO 1; SET Time: 2.341 ms postgres=# INSERT INTO sensor_log (location, reading, reading_date) SELECT s.id % 1000, s.id % 100, now() - (s.id || 's')::INTERVAL FROM generate_series(1, 40000) s(id); INSERT 0 40000 Time: 418.068 ms
You may choose any of these techniques to improve the performance. Though now that the default value of sequence_range is changed to 1000, not many users may see the difference in performance.
Is there a reason to choose this technique over UUIDs? I ask because other products with the same requirement to independently generate non-colliding IDs (ex. MongoDB) use UUIDs instead of SERIAL (or equivalent) + logic-to-avoid-collisions. I have recommended UUIDs to customers, but have never done any testing, so I am wondering what your test results were. Thanks!
Since Postgres-XL has a notion of a centralised GTM, it was far easier to leverage that for this purpose. This also allows us to stay as close to PostgreSQL as possible with respect to sequence semantics. But yeah, we had thought about this possibility of partitioning the namespace to avoid collision. Another challenge of course is that XL supports extendable clusters. So when new nodes are added to the cluster, we would need to remap the namespace. Using GTM for tracking sequences helped solve some of these problems.
How about using multiple GTM server (or multithread)?
If we use 3 GTM, the sequences would be like this:
This would increase performance as we need no coordination between sequences.
This technique used at galera cluster.
What about something similar to this?
With the shard being some ID that postgres-xl assigns?
Not bad! Good to know a few ways around this. 🙂
Thanks Shaun! With these tweaks, I hope you will find performance much better next time you try 🙂