Parallel Aggregate – Getting the most out of your CPUs
A small peek into the future of what should be arriving for PostgreSQL 9.6.
Today PostgreSQL took a big step ahead in the data warehouse world and we now are able to perform aggregation in parallel using multiple worker processes! This is great news for those of you who are running large aggregate queries over 10’s of millions or even billions of records, as the workload can now be divided up and shared between worker processes seamlessly.
We performed some tests on a 4 CPU 64 core server with 256GB of RAM using TPC-H @ 100 GB scale on query 1. This query performs some complex aggregation on just over 600 million records and produces 4 output rows.
The base time for this query without parallel aggregates (max_parallel_degree = 0) is 1375 seconds. If we add a single worker (max_parallel_degree = 1) then the time comes down to 693 seconds, which is just 6 seconds off being twice as fast! So quite close to linear scaling. If we take the worker count up to 10 (max_parallel_degree=10), then the time comes down to 131 seconds, which is once again just 6 seconds off perfect linear scaling!
The chart below helps to paint a picture of this. The blue line is the time in seconds. Remember that the time axis is on a logarithmic scale, (the performance increase is a little too much to see the detail at higher worker counts otherwise)
You can see that even with 30 worker processes we’re still just 20% off of the linear scale. Here the query runs in 56 seconds, which is almost 25 times faster than the non-parallel run.
This really makes the existing parallel seqscan and parallel join changes really shine. Without parallel seqscan, parallel aggregation wouldn’t have been possible, and without parallel join parallel aggregate would only be possible on queries that didn’t contain any joins.
More work is still to be done to parallel-enable some aggregate functions, but today’s commit is a great step forward.
Please note that since writing this blog post the max_parallel_degree setting has been renamed to max_parallel_workers_per_gather. It’s also possible to now experiment with various worker counts by changing the table’s “parallel_workers” option with: ALTER TABLE name SET (parallel_workers = N); where N is the required number of workers. You should also ensure that the server’s max_worker_processes GUC setting is set high enough to accommodate the number of workers that you require. The min_parallel_relation_size, parallel_setup_cost and parallel_tuple_cost settings may also be of some interest. Please see http://www.postgresql.org/docs/9.6/static/runtime-config-query.html for more details.
Awesome test. Can you say what actual hardware (4 cpu 64 core) you used? Thanks.
Hi Paul, the processors on that machine are Xeon E5-4620
First big thank’s for implementing parallel aggr for PG.
I try to test Q1 query from TPC-H (50GB scale factor, 300M rows of LINEITEM table) on PG 9.6beta1 and get next results :
302171.230 ms for max_parallel_degree = 0
175506.106 ms for max_parallel_degree = 4
144931.588 ms for max_parallel_degree = 8
Changes made to postgresql.conf :
shared_buffers = 8000MB
work_mem = 2000MB
max_parallel_degree = 16
enable_seqscan = on
Test performed on SLES11 SP3 virtual machine with 16VCPU
Can you please advise some tuning for parallel work in this case ?
Thanks for the comment.
Remember the max_parallel_degree is “maximum”, the size of the relation will control how many workers the planner requests the executor to use. Also remember that the number of workers that the planner request the executor might not have available at the time the query executes. If you run EXPLAIN ANALYZE on the query it should show you how many workers are requested and how many are used.
However, I’d say most likely is that I/O requests cannot be served fast enough to feed the CPUs with the lineitem heap pages they need. When I ran my test I ran it on a machine with 256GB of RAM, and all the buffers were either in kernel page cache or in PostgreSQL’s buffer cache, so it wouldn’t have needed to read any pages from disk.
I’d say if you have less than 64GB of memory on this machine then that might be the reason. How does it perform on Q1 @ 10GB scale?
If the planner is not requesting as many workers as you’d like, then you can tweak this manually with;
alter table lineitem set (parallel_degree=16);
I’m interested in hearing your updated results on this.
Thank you for explanation.
You rigth, it is some disk I/O bottleneck’s on my VM
Unfortunatelly for some reason’s can’t cache full LINEITEM table (44GB) on 60GB RAM VM
(each Q1 run is reading from disk, but filesystem cache is 45GB populated)
Ok. I test Q1 (no filter predicates, scan of full table) for rows subset of 45M rows from LINEITEM table (data for 1994 year, 6.5GB in size) and after it is cached get next results :
86235.123 ms for max_parallel_degree = 0
22301.170 ms for max_parallel_degree = 4
12711.368 ms for max_parallel_degree = 8
Next I see what increasing max_parallel_degree more than 8 is not increasing number of real parallel workers.
As you advise I change table default parallelism as :
alter table lineitem1994 set (parallel_degree=16);
But EXPLAIN ANALYZE for Q1 show what :
Workers Planned: 16
Workers Launched: 8
So,no more than 8 workers again.
Remember that the executor will only start at the most max_worker_processes. Setting the relations parallel_degree higher than this does not make sense.
It looks like it’s scaling better with a smaller table. Seems that 4 workers is about 77% efficient compared to no parallelism, and 8 workers 75%. This not really very close to what I got, so perhaps it’s worth watching top to see if the workers are fully utilising the CPU. If they’re not then you have a bottleneck somewhere else.
i have tabel and record 1.700.000.000,
i’m test ruuning in sqlserver need time 17s
and in postgresql need time 1m.
why ? help me.
max_connections = 100
shared_buffers = 512MB (Max Windows i’m using Windows)
effective_cache_size = 3072GB
work_mem = 1048576kB
maintenance_work_mem = 1GB
min_wal_size = 4GB
max_wal_size = 8GB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
default_statistics_target = 500
max_parallel_degree = 16
max_worker_processes = 64
i’m using explain analyze
Workers Planned: 8
Workers Launched: 8
It would be difficult to say without more information. The most useful thing you could send would be the EXPLAIN ANALYZE VERBOSE for the query in question.
Here’s a list of a few things which you might want to consider which might explain the difference:
1. PostgreSQL might not be fully utilising all of the 8 CPUs. You could check this in Task Manager. This might be easier done on an otherwise idle server. If all 8 processors are not fully utilised then most likely there’s a bottleneck elsewhere, most likely I/O requests can’t be served to PostgreSQL quickly enough.
2. The table in SQL Server may be smaller. I’m not sure of what sort of table that you’re using in SQL Server, but I believe normal tables have a 9 byte row header, which is quite a bit smaller than PostgreSQL’s 24 bytes. The additional bytes in PostgreSQL can be mostly accounted for by PG’s MVCC implementation which is largely suited for OLTP workloads rather than OLAP type workloads like the type you’re doing. If the SQL Server table is smaller then it could be possible that the SQL Server table sits in memory, and part of the PostgreSQL tables are being read from disk.
3. The query plan could be quite different between the 2 databases. Even for a query aggregating data on a single table (no joins) there’s still a choice of how the aggregation takes place. If there are joins then these might be different types, or SQL Server may manage to push down the aggregate below the join, which PostgreSQL currently cannot do.
4. SQL Server could simply be using more worker processes than PostgreSQL.
There’s many many other possibilities. Post the EXPLAIN ANALYZE VERBOSE, and you can also compare that to SQL Servers plan output and see if it look like it could be executed more efficiently.
“PostgreSQL 9.6beta3, compiled by Visual C++ build 1800, 64-bit”
I get an error when running pg.
E:\PG\bin>LOG: unrecognized configuration parameter “max_parallel_degree” in file “e:/pg/data/postgresql.conf” line 653
FATAL: configuration file “e:/pg/data/postgresql.conf” contains errors
Hi, This setting has changed name since this blog post was written. I’ve added a small update to it at the bottom to answer your question.