Tracing PostgreSQL with perf
The profiling utility perf that ships with the Linux kernel is extremely useful for examining system-wide and multi-process behaviour – but it does a lot more than the CPU profiling it’s often used for. You’ve probably looked at perf top -az or perf top -u postgres output, but that’s only the littlest bit of what it can do. (If you want the TL/DR version, jump down to “User space dynamic probes”).
One of the big advantages of perf is that it’s non-intrusive. You don’t have to attach a debugger and interrupt execution. You don’t have to run commands directly under a profiler in a special environment. There’s no need to re-start the server to debug a problematic workload, and often no need to re-compile with debug options. This is extremely useful when you’re trying to track down performance problems in a live system, as it lets you test theories about what might be going on quickly and with minimal impact.
perf isn’t just a profiler, it also has tracing support. Profiling is based on sampling of the state of the system when triggered by hardware or software performance counters; it gives a statistical sampling of points where the system spends most time. Tracing instead takes samples whenever a particular trace event occurs, so it’s much more useful for infrequent but important events.
When working with PostgreSQL one of the most exciting features of perf is the ability to trace user-space processes. Want to know how often your PostgreSQL is swapping WAL segments, how often it does foreign key lookups, etc? For one PostgreSQL backend or across the whole cluster? perf can help with that.
User-space and kernel-space tracepoints can be mixed and can be used at the same time as performance counter profiling to help you get a good picture of the system. perf can capture stack traces from both kernel and user-space, and can also do statistical visualisations. User-space tracepoints are created with dynamic probes; kernel-space ones can be pre-defined or can be dynamic probes.
So, how do you use some of these features?
Install the tools
First, make sure you’re using a current perf. This article was written on Fedora 19 with perf 3.11.6 on x86_64, and some of the features are relatively new.
If you want user-space stack results you’ll want the code you’re looking at to be built with -Og -ggdb -fno-omit-frame-pointer. If you’re using a perf built with libunwind you don’t need frame-pointers; see this Stack Overflow post and RH Bugzilla #1025603. None of this is needed if you’re only interested in kernel side data. If you’re using distro packages you might need to install -debuginfo packages too.
All the following tests were run with stock PGDG PostgreSQL 9.2 packages from http://yum.postgresql.org/ using a perf rebuilt with libunwind support according to the above instructions.
Kernel tracepoints and probes
perf can capture data from pre-defined kernel tracepoints, some of which are informative when looking at issues with memory fragmentation, disk I/O, etc. You can get a list of tracepoints with sudo perf list. Tracepoint lists can be specified and wildcards are supported. For example, if we want to get write and disk flush statistics on a running PostgreSQL instance we might run:
sudo perf record -g dwarf -e block:block_rq_issue,syscalls:sys_enter_fsync -u postgres sleep 10
to capture the data. Instead of a sleep you can use no command and hit control-C when you’re done capturing, or you can use some other command like a psql -c to trigger the workload you want to measure.
-u postgres profiles all processes running as user postgres. You can instead use -a for whole-system profiling across all CPUs. It’s also possible to trace just one backend. Start psql, run select pg_backend_pid(), run perf with -p $the_pid, then start the workload in the same psql session.
When you’re working with PostgreSQL the default target process, which is the command run under the control of perf, isn’t usually very useful because the backend does most of the work, not psql. It’s still useful to use the subcommand to control the test workload and timing.
Once you’ve captured the data you can use perf report to examine it. There are too many options to discuss here – to control result aggregation and simplification, stack trace display, interactive curses vs text report output, and more.
Take this session as an example, where there is a shell session (terminal “T2”) and a postgres session connected to database “regress” (terminal “T1”):
T1| regress=> select pg_backend_pid(); T1| pg_backend_pid T1| ---------------- T1| 4495 T1|(1 row)
T2| $ sudo perf record -g dwarf -e block:block_rq_*,syscalls:sys_enter_write,syscalls:sys_enter_fsync -p 4495
T1| regress=> create table x as select a FROM generate_series(1,1000000) a; T1| regress=>
T2| $ ^C T2| [ perf record: Woken up 332 times to write data ] T2| [ perf record: Captured and wrote 86.404 MB perf.data (~3775041 samples) ] T2| T2| $ sudo perf report -g
You can use perf report‘s curses gui to dig down into the trace, or you can use the perf report --stdio option to get it to stream data to stdout. For example, if you want stack traces you might run:
$ sudo perf report -g --stdio ... blah blah ... # Samples: 1 of event 'syscalls:sys_enter_fsync' # Event count (approx.): 1 # # Overhead Command Shared Object Symbol # ........ ........ ............. ..................... # 100.00% postgres libc-2.17.so [.] __GI___libc_fsync | --- __GI___libc_fsync mdimmedsync heap_sync intorel_shutdown standard_ExecutorRun ExecCreateTableAs PortalRunUtility PortalRunMulti PortalRun PostgresMain ServerLoop PostmasterMain main __libc_start_main _start (nil) ... blah blah...
showing that for the event syscalls:sys_enter_fsync there was one event with the above stack, an fsync invoked via ExecCreateTableAs.
(For a reason I haven’t yet been able to pin down the final fsync() doesn’t seem to be captured by perf when psql is run directly under the control of perf. This isn’t an issue with perf stat, only perf record. That’s why I’m jumping through hoops to pre-select the backend by pid above.)
User-space dynamic probes
Sometimes you’re more interested in something happening within PostgreSQL its self than events within the kernel that are triggered by PostgreSQL. Newer versions of perf can help with this by inserting dynamic tracepoints that trigger on calls in user-space programs.
Let’s say you’re interested in watching WAL activity, and want to see when XLogFlush, XLogFileInit or XLogFileOpen are called. You can insert dynamic tracepoints for these calls with perf:
sudo perf probe -x `which postgres` XLogFileInit sudo perf probe -x `which postgres` XLogFileOpen sudo perf probe -x `which postgres` XLogFlush
You can only probe external symbols (non-static, not hidden by -fvisibility flags) unless you built with -ggdb. perf will complain no symbols found if you try to use a symbol that doesn’t exist. At time of writing perf doesn’t support using external debuginfo to look up symbols for probes, though it can use it for stack analysis. In general, if it’s an extern in src/include you can use it with perf.
Each tracepoint will print the name of the tracepoint created and you can use perf probe -l to list them all anyway:
$ sudo perf probe -l probe_postgres:XLogFileInit (on 0x000000000009a360) probe_postgres:XLogFileOpen (on 0x000000000009a860) probe_postgres:XLogFlush (on 0x00000000000a0670)
These probes are now usable as perf events. Let’s take a look at xlog activity during a sample workload, monitoring across the whole cluster while I do a pgbench run:
sudo perf record -g dwarf -u postgres -e probe_postgres:XLogFileInit,probe_postgres:XLogFileOpen,probe_postgres:XLogFlush
Try it yourself with perf report -g . Here’s how the results look. You can use options like -g fractal,0 to control detail. You’ll be able to see the percentage of a given counter’s hits that came from one stack branch or another, the pid and process, etc. The --sort options give you more control over aggregation and grouping.
But wait, there’s more
You should also check out the perf stat and perf top commands. They can take the same event-lists as perf record, though for some odd reason their process-filter support is different.
Here’s an example that runs a dummy workload and looks at I/O kernel tracepoints during the run:
$ sudo perf stat -e block:block_rq_*,syscalls:sys_enter_write,syscalls:sys_enter_fsync -a -r 5 -- psql -q -U postgres craig -c "drop table if exists x; create table x as select a FROM generate_series(1,1000000) a;"; Performance counter stats for 'psql -U postgres craig -c drop table if exists x; create table x as select a FROM generate_series(1,1000000) a;' (5 runs): 0 block:block_rq_abort [100.00%] 0 block:block_rq_requeue [100.00%] 97 block:block_rq_complete ( +- 14.82% ) [100.00%] 96 block:block_rq_insert ( +- 14.97% ) [100.00%] 98 block:block_rq_issue ( +- 14.67% ) [100.00%] 0 block:block_rq_remap [100.00%] 10,607 syscalls:sys_enter_write ( +- 0.17% ) [100.00%] 1 syscalls:sys_enter_fsync 0.908835058 seconds time elapsed ( +- 18.31% )
You can see it’s doing an average of about 100 block layer I/O requests over 10k write()s and doing a single fsync(). Some of that is system background noise since we’re doing all system profiling (-a), but since the system is fairly idle it isn’t much, and it’s averaged out over five runs.
Similarly, using the dynamic probes we added earlier, keep an eye on xlog activity during a pgbench run:
$ sudo perf stat -e probe_postgres:XLogFileInit,probe_postgres:XLogFileOpen,probe_postgres:XLogFlush -a -- /usr/pgsql-9.2/bin/pgbench -U postgres craig -c 2 -t 10000 starting vacuum...end. transaction type: TPC-B (sort of) scaling factor: 100 query mode: simple number of clients: 2 number of threads: 1 number of transactions per client: 10000 number of transactions actually processed: 20000/20000 tps = 715.854663 (including connections establishing) tps = 716.092133 (excluding connections establishing) Performance counter stats for '/usr/pgsql-9.2/bin/pgbench -U postgres craig -c 2 -t 10000': 64 probe_postgres:XLogFileInit [100.00%] 0 probe_postgres:XLogFileOpen [100.00%] 55,440 probe_postgres:XLogFlush 27.987364469 seconds time elapsed
There’s plenty more you can do, including capturing the state of local variables with perf probe. I’ll write up some useful examples of that one later. In the mean time, play, explore and have fun with a new diagnostic tool.
Update: Michael Paquier recently wrote a related article about tracing PostgreSQL with systemtap that may be of interest to readers of this one. You have to recompile Pg to use that approach, but the syntax is nicer and it offers some other advantages.
Thanks, this was useful. I assume you had to enable these tracepoints first?
The “perf probe” commands create tracepoints in user-space executables/libraries; this is shown above with the creation of the probe_postgres:XLogFileInit tracepoint, etc.
The other tracepoints used are built in to perf and the kernel. If you don’t have them you might either be running perf without appropriate access rights (it can’t see all tracepoints as non-root) or on an older kernel.