There’s a lot of information out there on how to configure PostgreSQL, on the importance of backups and testing them, etc.
But what about the server you run PostgreSQL on? We tend to pay a lot less attention to tools that you won’t need unless something breaks. But it’s worth taking some time to do so now, because that’s time you won’t have when your server is down and you’re in a rush.
Debuginfo and gdb
Seriously consider installing debug-info packages for the major services you run on a server, and for the libraries that service uses. It wastes some disk space, but it saves you a lot of time if you end up needing that information in a rush.
Debug info lets tools like the GNU Debugger gdb show variables, program arguments, etc when it is connected to a running program, or to a crash dump (core file). It lets people doing advanced support and bug fixing help you a lot more effectively. I have been able to fix problems in running PostgreSQL instances in gdb without having to shut the server down due to the availability of gdb and debuginfo.
Ideally, you wouldn’t need to install debuginfo before you intend to use it. You could just fetch it on demand. But debuginfo needs to exactly match the packages installed, and often repositories only carry a limited history of old versions, so it might not be there when you need it. It’s also fairly common for servers to be Internet-isolated, so you can’t just install packages on demand. And it’s possible the person doing the debugging won’t be the sysadmin with root access, so they can’t install them if they want to.
So please install it along with the PostgreSQL packages. For example, on Debian/Ubuntu:
sudo aptitude install postgresql-9.6-dbg
or for RHEL/Fedora:
sudo dnf install postgresql96-debuginfo
Better, use the debuginfo-install program for RHEL/Fedora:
sudo debuginfo-install postgresql
strace is nice to have too.
Performance: sysstat, iotop, perf
You should always install sysstat and make sure that it is sar is collecting information for sar. I recommend also lowering the interval to 1 minute, from the default of 10. The disk space cost is small, and having finer grained information helps a huge amount when diagnosing stalls and load spikes.
You need it installed before you have a problem so you can look back at historical trend data. That lets you say say “aha, it got slower just when we enabled feature X in the app.” Or “it seems that load has been growing steadily, and has now reached saturation.” It’s crucial for diagnosing the why behind “my server is suddenly really slow” problems.
You should make sure to install perf and iotop too, as they’re key tools for debugging performance issues.
Queries: auto_explain and pg_stat_statements
For similar reasons, I recommend installing and enabling auto_explain and pg_stat_statements. You’ll want to be able to look back on your logs and say “here’s when this query started taking too long” or “this query seems to have switched to a different query plan and suddenly got really slow”.
The information collected isn’t perfect because it won’t tend to log output for queries that are running quickly. But it still provides lots of useful background.
Old logs are valuable. Don’t rotate logs away after a few days. They can provide critical forensic information and trend data too.
Consider archiving a month’s worth of logs, or something like that.
Last year a team I was part of was able to use old logs, combined with old backups, to rescue and repair a database that had been severely corrupted by a SAN fault. The customer lost no data despite a seemingly severely corrupt database with broken pg_clog etc.
To fix a system, sometimes you need to know how it got broken. For that, the logs are a vital tool.
Keep your kernel logs (kern.log, dmesg.log, etc) too.
Corruption: pageinspect, pg_freespacemap
If a disk fault, memory error, SAN fault, filesystem error, admin mistake, or PostgreSQL bug corrupts a table or index, it’s useful to have tools to look at what exactly is wrong. One of them is pageinspect, and I suggest installing it routinely on your databases. It’s part of contrib, so just make sure contrib is always installed. There’s no need to actually CREATE EXTENSION pageinspect; ahead of time.
If you have people doing remote support, consider whether restricting root access is really worthwhile. If someone has access to the postgres unix user, they can delete or corrupt your entire database if they want to. If someone only has access to the postgres database superuser, they can generally gain access to the unix postgres user. PostgreSQL’s superuser access doesn’t try hard to stop the DB user from breaking out, and they can use plpythonu, plperlu, etc to get a shell.
If you restrict access to a non-superuser database user, then there’s not a great deal that can be done to help in a support role. No logs access, many system views are restricted, no debugging, no pageinspect, no system level performance metrics, etc. This is one of the challenges with diagnosing issues on cloud-hosted systems – we don’t have superuser rights, and a lot of crucial information is inaccessible.
So if you give someone who’s doing support on your server enough access to do the job, it’s my
opinion that you should give them ‘sudo’ to root. It doesn’t gain you meaningful protection or security since they can already do massive harm with just the postgres account. But it means that in an emergency they cannot access all the tools they need to help you effectively.
When I don’t have root on a failing customer server, I can’t install gdb and debug info packages to diagnose a crash. I can’t enable core dumps. I can’t install I/O tracing and diagnostic tools to see why it’s really slow. I can’t ask the firmware for information about recent memory errors or probe for disk faults. I can still fix and diagnose things, but it can be a slower, more difficult process that means more downtime for the customer.
I recommend having a separate account that has both rights to `sudo -u postgres` and to sudo to root. Don’t give the `postgres` account `sudo`.
This can be the difference between arriving at your desk to an email saying “we got an alert and fixed your server overnight” and getting a 4am phone call.