In the defense of sar (and how to configure it)
Let me discuss a topic that is not inherently PostgreSQL specific, but that I regularly run into while investigating issues on customer systems, evaluating “supportability” of those systems, etc. It’s the importance of having a monitoring solution for system metrics, configuring it reasonably, and why sar
is still by far my favorite tool (at least on Linux).
On the importance of monitoring
Firstly, monitoring of basic system metrics (CPU, I/O, memory) is extremely important. It’s a bit strange having to point this in discussions with other engineers, but I’d say 1 in 10 engineers thinks they don’t really need monitoring. The reasoning usually goes along these lines:
It’s just another source of useless overhead. You don’t really need monitoring unless there’s an issue, and issues should be rare. And if there’s an issue, we can enable the monitoring temporarily.
It’s true monitoring adds overhead, no doubt about it. But it’s likely negligible compared to what the application is doing. Actually, sar
is not really adding any extra instrumentation, it’s merely reading counters from the nernel, computing deltas and writing that to disk. It may need some disk space and I/O (depending on number of CPUs and disks) but that’s about it.
For example collecting per-second statistics on a machine with 32 cores and multiple disks will produce ~5GB of raw data per day, but it compresses extremely well, often to ~5-10%). And it’s barely visible in top
. Per-second resolution is a bit extreme, and using 5 or 10 seconds will further reduce the overhead.
So no, turns out the overhead actually is not a valid reasons not to enable monitoring.
Costs vs. benefits
More importantly though, “How much overhead do I eliminate by not enabling monitoring?” is the wrong question to ask. Instead you should be asking “What benefits do I get from the monitoring? Do the benefits outweigh the costs?”
We already know the costs (overhead) are fairly small or entirely negligible. What are the benefits? In my experience, having monitoring data is effectively invaluable.
Firstly, it allows you to investigate issues – looking at a bunch of charts and looking for sudden changes is surprisingly effective, and often leads you directly to the right issue. Similarly, comparing the current data (collected during the issue) to a baseline (collected when everything is fine) is very useful, and impossible if you only enable monitoring when things break.
Secondly, it allows you to evaluate trends and identify potential issues before they actually hit you. How much CPU are you using? Is the CPU usage growing over time? Are there some suspicious patterns in memory usage? You can only answer those questions if you have the monitoring in place.
Why sar
is my favorite tool
Let’s assume I’ve convinced you monitoring is important and you should definitely do it. But why is sar
our favorite tool, when there are various fancy alternatives, both on-premise and cloud based?
- It’s included in all distributions, ans is trivial to install/setup. This makes it fairly simple to convince people to enable it.
- It’s right on the machine. So if you SSH to the machine, you can also get the monitoring data.
- It’s using simple text output. Trivial process the data – import it into a database, analyze, attach it to a support ticket. That’s pretty difficult with other tools that generally don’t allow you to export the data easily, only show charts and/or significantly restrict what analysis you can perform, etc.
I do admit some of this comes from the fact that I work for a company providing PostgreSQL services to other companies (be it 24×7 support or Remote DBA. So we usually get only a very limited access to customer systems (mostly just database servers and nothing more). That means having all the important data on the database server itself, accessible over plain SSH, is extremely convenient and eliminates unnecessary round-trips only to request another piece of data from some other system. Which saves both time and sanity on both sides.
If you have many systems to manage, you’ll probably prefer a monitoring solution that collects data from many machines to a single place. But for me, sar
still wins.
So, how to configure it?
I mentioned installing and enabling sar
(or rather sysstat
, which is the package including sar
) is very simple. Unfortunately, the default configuration is somewhat bad. After installing sysstat
, you’ll find something like this in /etc/cron.d/sysstat
(or wherever your distribution stores cron
configuration):
*/10 * * * * root /usr/lib64/sa/sa1 1 1
This effectively says the sa1
command will be executed every 10 minutes, and it will collect a single sample over 1 second. There are two issues, here. Firstly, 10 minutes is fairly low resolution. Secondly, the sample only covers 1 second out of 600, so the remaining 9:59 are not really included in it. This is somewhat OK for long-term trending, where low-resolution random sampling is sufficient. For other purposes you probably need to do something like this instead:
* * * * * root /usr/lib64/sa/sa1 -S XALL 60 1
Which collects one sample per minute, and every sample covers one minute. The -S XALL
means all statistics should be collected, including interrupts, individual block devices and partitions, etc. See man sadc
for more details.
Summary
So, to sum this post into a few simple points:
- You should have monitoring, even if you think you don’t need it. Once you run into issues, it’s too late.
- The costs of monitoring are probably negligible, but certainly much lower than the benefits of having the monitoring data.
sar
is convenient and very efficient. Maybe you’ll use something else in the future, but it’s good first step.- The default configuration is not particularly great (low resolution, 1-second samples). Consider increasing the resolution.
One thing I haven’t mentioned is that sar
only deals with system metrics – CPU, disks, memory, processes, not with PostgreSQL statistics. You should definitely monitor that part of the stack too, of course.
What do you suggest for PostgreSQL statistics monitoring?
I don’t think there’s anything akin to sar, i.e. “simple and on the same machine”. You can snapshot the statistics catalogs, but that’s about it.
With regular monitoring systems (collecting data from multiple machines/services to a central place), I’d say collectd/grafana is a good choice. A simple alternative is Munin, but it has various limitations as it’s RRD-based.
Thank you for this article.
FYI all per-second statistics displayed by sar are average values over the given time interval. This is true also for CPU activity, which means that even if sa1 is executed only every 10 minutes, the sample collected will give you a value covering the past 10 minutes, not 1 second.
No, that is not true. Or more precisely, it’s not true for the default parameters used in cron jobs. The usual cron job is this:
*/10 * * * * root /usr/lib64/sa/sa1 1 1
which says, run it every 10 minutes, and every time collect one 1-second sample.
No, I insist… 🙂
*/10 * * * * root /usr/lib64/sa/sa1 1 1
actually means “run sa1 (sadc) every 10 minutes, and every time collect one sample”. The interval parameter (the first “1” value given to sa1) is meaningless here since you need at least 2 samples to define an interval.
The meaning is a bit different from that of sar, where e.g., “sar 1 1” means “display one line of statistics covering a one-second interval”, and in this case, *two* samples need to be collected by sar’s backend (sadc). You can have a look at question 2.22 from sysstat’s FAQ (see link below).
Remember that counters collected by sadc are, in most cases, cumulative values since boot time (also see question 2.15 in sysstat’s FAQ). So you can take one snapshot at time t, then another one 10 minutes later, the values displayed will actually cover the whole 10 minutes interval, but of course, those statistics (CPU utilization, network traffic, context switches, etc.) will be average values over the period. So maybe the dips and spikes will be less visible.
On the other hand values like e.g., memory utilization (values displayed by “sar -r”), are actually instantaneous values: They give you a view of your system at the very moment when they are collected.
Sysstat’s FAQ:
http://sebastien.godard.pagesperso-orange.fr/faq.html
https://github.com/sysstat/sysstat/wiki
Regards,
Sebastien (author and maintainer of the sysstat package)
Aha, I see! So `sa1 1 1` essentially just writes a single record into the data file, and `sar` then computes “per second” averages between those samples.
So with the default cronjob definition
*/10 * * * * root /usr/lib64/sa/sa1 1 1
will produce averages over 10-minute intervals. That is certainly better than collecting 1-second samples every 10 minutes, that’s for sure. I still think that’s not a sufficient granularity for our needs, though (even ignoring that some metrics are not cumulative but instantaneous).
Thanks for correcting my long-term misunderstanding, and I promise to not argue with the guy who wrote the tool again 😉
And thanks for writing it and maintaining it, BTW! It’s extremely useful and versatile.