One of the ugly parts of Linux with PostgreSQL is that the OS will
happily cache up to around 5% of memory before getting aggressive about
writing it out. I’ve just updated a long list of pgbench runs showing how badly
that can turn out, even on a server with a modest 16GB of RAM. Note that I am intentionally trying to introduce the bad situation here, so this is not typical performance. The workload that pgbench generates is not representative of any real-world workload, it’s as write-intensive as it’s possible to be.
out test set 5, which is running a stock development version PostgreSQL 9.1. Some the
pauses where the database is unresponsive during checkpoints, as shown
by the max_latency figure there (which is in milliseconds), regularly
exceed 40 seconds. And at high client counts you can see >80
seconds of the database completely stalled. Check out a sample of a
really bad one. See the dead spot between 26:00 and 27:00 where no transactions are processed? That’s a bad thing.
It used to be possible to improve these by tuning Linux’s dirty_ratio and
dirty_background ratio parameters. Nowadays system RAM is so large
that even the minimum settings possible there are caching way too
much. A better UI was introduced in kernel 2.6.29. Now you can set
these in bytes instead, which allows much smaller settings, using
dirty_bytes and dirty_background_bytes.
You can see what happens when you tune those down by looking at test
set #7 [Note that this also includes a work in progress PostgreSQL
patch to fix some other bad behavior in this area, introduced in set
#6] I set the new dirty_* parameters to 64MB/128MB, trying to be below
the 256MB battery-backed cache in the RAID card. The parts that
improved are quite obvious. Now maximum latency on the smaller tests
drops to <10 seconds. And the worst one shown is just over 20
seconds now. Both of these are about 1/4 of the worst-case latency
shown before I tweaked these parameters.
There is a significant drop in transactions/second, too, but not
necessarily an unacceptable one. Across the whole set the averages
were scale=500, tps=690 and scale=1000, tps=349 before; now it’s
scale=500, tps=611 and scale=1000, tps=301. Batching up work into
larger chunks always gives higher transaction rates but lower latency.
That part would be a completely reasonable trade-off in a lot of
situations: 1/4 the worst-case latency for a 10-15% drop in TPS.
Unfortunately I discovered a bigger downside here too. In between each
pgbench test, there is some basic table clean up done to try and make
the tests more repeatable. Part of that is doing a VACUUM of the
database. I noticed that this test series took way longer to finish
than any previous one. The figure you can compare reasonably here is
to ask “what’s the minimum time seen to clean the database up?”. That
varies a bit from test to test, but the minimums across a larger set
are quite consistent. It takes a certain amount of time to do that
job, and the fastest such cleanup job is a pretty stable figure here at
a given database size.
Here are those figures from the last 3 test sets:
set | scale | min_setup_time -----+-------+------------------ 5 | 500 | 00:03:41.220735 5 | 1000 | 00:06:47.818328 6 | 500 | 00:03:33.139611 6 | 1000 | 00:06:41.154474 7 | 500 | 00:06:06.56808 7 | 1000 | 00:10:14.010876
You can see that test sets 5 and 6 both took around 3.5 minutes to
cleanup a scale=500 database, and 6.75 minutes for scale=1000. Dialing
down the cached memory Linux was keeping around increased those times
to 6 minutes and 10 minutes instead. That’s 71% and 48% longer,
respectively, than those operations took with a large amount of write
caching managed by the kernel.
Now that is a much harder price to pay for decreased latency. Looks
like some of the server-side work planned to try and reduce these
checkpoint spikes is still completely relevant even on a kernel that
has these knobs available.