On June 17th, I gave a presentation on Postgres Monitoring to the Chicago Postgres User Group. Since MeetUp doesn’t allow uploading slides, I figured I’d convert my presentation into a more comprehensive writeup.
So, what are the finer points of Disaster Mitigation?
Let’s start with a few questions:
- What broke?
- When did it break?
- Why did it break?
- How badly did it break?
- Who broke it?
- How long has it been broken?
Do we really want to answer these?
No. Of course not. Besides, there’s a better question: What’s about to break? It’s about being proactive.
Answer These Questions First!
- What could break?
- How screwed are we if it does break?
- When do we want to know?
- Who should be notified?
We all know monitoring is important, but monitoring can prevent just as many problems as it reveals, so long as we pay attention. For Postgres, it’s about prioritization, evaluation, and lead time.
So how can we get there?
Form Follows Function
Once we know what we’re looking for, we know what to build.
So What’s Critical to Postgres? Here’s a non-exhaustive list of just a few things we like to monitor:
- Count Per State
- Replication / Slots
- Disk Space
- WAL Throughput
- XID Wraparound
- Table Age
- Process Counts
- Temp Space
That’s only a tiny example. There are literally dozens of metrics, so we’ll cover the critical ones in this post rather than trying to every single aspect. There are some things that could spell the end of the world, so to speak, and sometimes monitoring them isn’t outright obvious, or is in some ways incomplete.
Let’s break down the mystery.
Transaction ID Wraparound
This tends to keep me up at night. It’s been the topic of my past Around the World in Two Billion Transactions and a portion of 10 Things Postgres Could Improve – Part 1. In both of those, I mentioned some hand-wavy ways to monitor for this, so let’s get more specific.
Why XID Wraparound Happens
FREEZE old tuples so there’s no longer an old transaction reference to them. Doing this makes the tuple readable for all future transactions under the assumption there is no longer a minimum transaction necessary to see them. That’s the metadata
xmin field associated with each tuple, and
FREEZE tells Postgres to ignore the
xmin value, essentially nullifying it.
Older versions of Postgres used to set this field to
-1 instead, but that deprecated transaction value could be useful for page inspections and write history in some cases, so now it’s a bit flag in the tuple header. Either way, ancient tuples from equally antediluvian transactions no longer hold these lower XID values, so they may be reused.
This is what allows the XID counter to wrap back around to "lower" values. Now remember that XID is the global transaction identifier, that it’s only 32 bits, and these wraparounds occur constantly. Thus if
age(XID) < age(xmin), it’s game over since we can’t tell if that tuple is in the past or future. Postgres will shut itself down far before something like that happens, but that’s going to severely impact our application and business either way.
Watching for Wraparound
Let’s monitor for the situation instead with one of these two queries. The first will show database transaction age:
SELECT datname AS database_name, age(datfrozenxid) AS age FROM pg_database;
The next one does this at the individual table level:
SELECT c.relname AS table_name, greatest(age(c.relfrozenxid), age(t.relfrozenxid)) as age FROM pg_class c LEFT JOIN pg_class t ON c.reltoastrelid = t.oid WHERE c.relkind IN ('r', 'm', 'p') AND greatest(age(c.relfrozenxid), age(t.relfrozenxid)) > current_setting('autovacuum_freeze_max_age')::NUMERIC * 0.9 ORDER BY table_name DESC;
It’s a bit more complicated because we also need to account for
TOAST data that is stored externally to the tables themselves. This particular query is also designed to only care about tables that are above 90% of the currently configured
autovacuum_freeze_max_age where Postgres will freeze the table anyway.
Regardless of what happens, the
age output should be less than
autovacuum_freeze_max_age, greater than
autovacuum_freeze_min_age, and MUCH less than 1-billion. Some companies may have enough data turnover that setting
autovacuum_freeze_max_age at 1-billion makes sense, but even in these cases, lower values are far better. What matters is having enough lead time before the absolute 2-billion limit where Postgres will shut down and refuse to process further transactions until the age horizon is rectified.
Of course, the story doesn’t end here.
More XID Madness!
Did you think it would be that easy? Postgres uses
VACUUM to freeze tuples as described, and that’s not some magical process. It works just like any other session in the respect that it is assigned a transaction identifier to clean up tuples visible below that horizon. There are several ways this can happen, and we need to find all of them.
Lurking Prepared Transactions
All things considered, you’re probably not using these. They’re disabled by default, and require setting
max_prepared_transactions in order to enable. Several blog articles out in the wild suggest just this.
The easiest way to protect against them is to leave them disabled. Check to see if you could have some:
What makes prepared transactions a potential issue, is that they prevent old tuples from being vacuumed. If a row is still visible, it clearly can’t be reaped, which can also artificially hold back the XID horizon. What’s worse, is that these transactions survive the life of the session, and even database restart.
So if the application origin crashes or somehow loses track of transactions it’s prepared, these could stick around for days, weeks, or even months. Luckily there’s a catalog view for tracking these:
SELECT database, prepared, round(extract(epoch FROM now() - prepared)) AS age FROM pg_prepared_xacts;
This will identify all prepared transactions and list how long they’ve existed. This is one of the first things we check in support incidents when wraparound is possible, because it’s one of the last places users normally look.
Idle Transactions Are Bad (M’kay?)
Idle transactions show up in
pg_stat_activity as "Idle in transaction" state, and are almost as guilty of holding back XID horizons as prepared transactions. The primary difference is that these don’t survive termination of the origin session.
But imagine an application that launches a transaction and freezes, or enters an infinite loop somewhere else in the code while the transaction languishes. Someone starts a transaction and goes home for the weekend after running a few queries. While less durable than prepared transactions, idle transactions are far more numerous and thus likely.
The best way to prevent these is to set an absolute ceiling by using
idle_in_transaction_session_timeout longer than the longest possible batch job or ad-hoc query. Sessions terminated by this parameter likely require some visibility anyway.
A good query to find these might look like this:
SELECT state, max(now() - xact_start) AS max_duration FROM pg_stat_activity WHERE state != 'idle' GROUP BY state;
Note that we’re not specifically seeking "idle in transaction" here, because this allows us to see the maximum ongoing transaction time by current state. Regular
active sessions can be just as guilty if there’s a rogue query or Cartesian Product, after all. Why not just graph all sessions by type?
hot_standby_feedback. As with idle transactions, these are far more likely to be in use due to how replica servers are commonly used. When enabled,
hot_standby_feedback tells the upstream Primary not to
VACUUM tuples that are currently being referenced on the replica. Otherwise, in order to apply incoming WAL of a
VACUUMed tuple, the replica must cancel any queries viewing the rows so it can apply the changes.
This prevents canceled queries on replica systems, but also means that a replica server can have its own idle transactions, and these can also affect the Primary server. This can lead to database bloat, but in worst cases, also prevent the XID horizon from advancing and increase the risk of wraparound.
Check if you’re using this by checking on every node:
We bring this up to emphasize that replica servers should also be closely monitored, even in cases where the replica is not a Standby server. Idle transactions are bad wherever they are. Active queries longer than a few hours should raise a few eyebrows.
The Effects of Replication Lag
Imagine a broken copy machine that fills your house with toner. In this case, if a replica can’t keep up with the WAL stream, or is using a replication slot and isn’t active, the Primary node may continue to accumulate WAL files until all storage is exhausted.
But aren’t replication slots good? Yes! They allow a replica server to be offline for periods of time without risk of falling behind permanently and thus require being rebuilt. However, the risk of consuming all disk space on the Primary is a very real risk.
As a result, we always strongly recommend monitoring for two kinds of replication lag. Foremost among these is slot lag, because this causes WAL files to persist if the replica is offline or unreachable:
SELECT slot_name, active, pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS restart_lag, pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) AS flush_lag FROM pg_replication_slots;
And then regular replication lag, which retains WAL files until the replica applies them, or disconnects:
SELECT application_name, pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) AS sent_lag, pg_wal_lsn_diff(pg_current_wal_lsn(), write_lsn) AS write_lag, pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn) AS flush_lag, pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS replay_lag FROM pg_stat_replication;
Each of these metrics can be a separate graph line, and give some insight as to the cause of the lag. Is it because of data flushing thanks to storage throughput, or is replay falling behind due to CPU limitations?
Both of these queries are only valid from the perspective of the Primary server, and should always be in place if possible.
What kind of world is it where even reads are writes? It turns out that producing query results requires scratch space. This can either exist in RAM or on disk, but covers the gamut of table sets, merges, hashes, sorting, grouping, and all other operations that the executor could apply before returning the final results.
In most cases all of these separate tasks easily fit in
work_mem amount of RAM, and each operation gets its own copy of that precious resource. But not always. Consider what happens after accidentally omitting a critical
WHERE clause. The resulting Cartesian Product could pull all rows from a dozen billion-row tables. Then it needs to merge the results, and possibly even sort them.
This extra storage activity could not only rob read (and write!) IOPS from servicing other queries, but it could in extreme cases, even exhaust disk space on the server itself.
We always recommend monitoring storage space, but that only shows there is a problem, not the cause. As a result, we suggest adding this kind of monitor as well:
SELECT now() AS check_dt, datname, temp_files, temp_bytes FROM pg_stat_database;
This shows, per database, how much temp space is being used, spread out over how many actual files. Since Postgres stores these as steadily increasing values over time, a timestamp is also necessary to differentiate against previous readings. A huge spike here, especially one that grows steadily, should be cause for immediate concern. It could even be cross-correlated with our earlier check against
pg_stat_activity to figure out which query is contributing so we can quickly cancel it.
Checkpoints are the lifeblood of the system. Postgres writes (or verifies as written) WAL data to backend data files during a checkpoint. Slow syncs may be an early indicator of failing disks or simply overwhelmed IO throughput.
Given that few people know to check this, it often goes overlooked for months or even years. I’ve seen systems that never check for checkpoint activity at all over the entire lifespan of the installation.
Often all it takes is a simple query like this:
SELECT now() AS check_dt, checkpoints_timed, checkpoints_req, checkpoint_write_time, checkpoint_sync_time FROM pg_stat_bgwriter;
Of interest in this case are the amount of timed checkpoints versus requested checkpoints. If a checkpoint is timed, it is being properly scheduled according to
checkpoint_timeout and honors
checkpoint_completion_target to avoid causing a potential disk flush storm.
Requested checkpoints have no such qualms, and several of these over a short span can even result in disk throughput starvation. Extreme cases may force the operating system into synchronous write mode, such that every write must report a sync status before continuing. This alone will drastically increase IO wait times and the server may appear almost unresponsive until write caches are clear.
Likewise, this query also lets us know how much time Postgres spent writing as opposed to waiting for disk synchronization. A spike in either is bad, but sync times are much more sensitive and should be closely monitored.
Transaction and WAL Throughput
Why is it so crowded in here? Lots of transactions may mean lots of data. This can sometimes lead to replicas falling behind, and could indicate a sign of application issues. What kind of issues? Maybe a new deployment writing too much data, maybe caching is no longer operating properly, or perhaps a malformed query or two are contributing.
A good way to track this is to watch transaction activity itself, with a query much like this:
SELECT now() AS check_dt, datname, xact_commit AS commit, xact_rollback AS rollback FROM pg_stat_database;
This shows us how many transactions were committed or rolled back over time on a per-database basis. These are both great chartable values, and a spike in either could mean there are issues. If a normal transaction commit rate is 500, and there’s currently a spike of 20000, it doesn’t matter if the number of rollbacks hasn’t equally risen.
Additionally, it could be a good idea to include a query like what I introduced in Terrific Throughput Tracking:
SELECT now() AS check_dt, pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0') AS total_bytes;
This lets us actually see how much WAL data we’ve generated between our check intervals. A spike here indicates increased write traffic, but more than that, it lets us see what our average WAL throughput actually is. Comparing this month over month or year over year could be much more valuable than simply checking the size of the database in the same period.
Just as there’s always money in the banana stand, there are always tools for much of this work. Here are just a few:
However, last we checked, none of the tools support the WAL throughput query above, and most don’t track temporary file usage. So we recommend sticking with a full monitoring stack which allows specifying custom queries at the collection level.
Something like the one of these tools, for example:
These are both highly extensible monitoring harnesses that allow plugging any of the above queries, and much more.
No matter how you go, have something in place. It’s the best way to ensure that, as the world falls down around you, Postgres will be just fine.