One of the ugly parts of Linux with PostgreSQL is that the OS will happily cache up to around 5% of memory before getting aggressive about writing it out. I’ve just updated a long list of pgbench runs showing how badly that can turn out, even on a server with a modest 16GB of RAM. Note that I am intentionally trying to introduce the bad situation here, so this is not typical performance. The workload that pgbench generates is not representative of any real-world workload, it’s as write-intensive as it’s possible to be.
Check out test set 5, which is running a stock development version PostgreSQL 9.1. Some the pauses where the database is unresponsive during checkpoints, as shown by the max_latency figure there (which is in milliseconds), regularly exceed 40 seconds. And at high client counts you can see >80 seconds of the database completely stalled.
It used to be possible to improve these by tuning Linux’s dirty_ratio and dirty_background ratio parameters. Nowadays system RAM is so large that even the minimum settings possible there are caching way too much. A better UI was introduced in kernel 2.6.29. Now you can set these in bytes instead, which allows much smaller settings, using dirty_bytes and dirty_background_bytes.
You can see what happens when you tune those down by looking at test set #7 [Note that this also includes a work in progress PostgreSQL patch to fix some other bad behavior in this area, introduced in set #6] I set the new dirty_* parameters to 64MB/128MB, trying to be below the 256MB battery-backed cache in the RAID card. The parts that
improved are quite obvious. Now maximum latency on the smaller tests drops to <10 seconds. And the worst one shown is just over 20 seconds now. Both of these are about 1/4 of the worst-case latency shown before I tweaked these parameters.
There is a significant drop in transactions/second, too, but not necessarily an unacceptable one. Across the whole set the averages were scale=500, tps=690 and scale=1000, tps=349 before; now it’s scale=500, tps=611 and scale=1000, tps=301. Batching up work into larger chunks always gives higher transaction rates but lower latency. That part would be a completely reasonable trade-off in a lot of situations: 1/4 the worst-case latency for a 10-15% drop in TPS.
Unfortunately I discovered a bigger downside here too. In between each pgbench test, there is some basic table clean up done to try and make the tests more repeatable. Part of that is doing a VACUUM of the database. I noticed that this test series took way longer to finish than any previous one. The figure you can compare reasonably here is to ask “what’s the minimum time seen to clean the database up?”. That varies a bit from test to test, but the minimums across a larger set are quite consistent. It takes a certain amount of time to do that job, and the fastest such cleanup job is a pretty stable figure here at a given database size.
Here are those figures from the last 3 test sets:
set | scale | min_setup_time
5 | 500 | 00:03:41.220735
5 | 1000 | 00:06:47.818328
6 | 500 | 00:03:33.139611
6 | 1000 | 00:06:41.154474
7 | 500 | 00:06:06.56808
7 | 1000 | 00:10:14.010876
You can see that test sets 5 and 6 both took around 3.5 minutes to cleanup a scale=500 database, and 6.75 minutes for scale=1000. Dialing down the cached memory Linux was keeping around increased those times to 6 minutes and 10 minutes instead. That’s 71% and 48% longer, respectively, than those operations took with a large amount of write caching managed by the kernel.
Now that is a much harder price to pay for decreased latency. Looks like some of the server-side work planned to try and reduce these checkpoint spikes is still completely relevant even on a kernel that has these knobs available.