On Rocks and Sand
When working with database capacity planning, there are a lot of variables to consider, and Postgres is no different in this regard. One of the elements which requires management is storage. However, there’s an aspect of storage that escapes inspection almost without exception, and it’s hiding in the shadows between the columns themselves.
In most low-level computer languages like C, in which the venerable Postgres is written, data types are addressed by their maximum size, regardless of how large they actually are. Thus a standard 32-bit integer which can store a value of just over 2-billion, must be read as a whole unit. This means even the value of 0 requires 4 bytes of storage.
Further, Postgres is designed such that its own internal natural alignment is 8 bytes, meaning consecutive fixed-length columns of differing size must be padded with empty bytes in some cases. We can see that with this example:
SELECT pg_column_size(row()) AS empty, pg_column_size(row(0::SMALLINT)) AS byte2, pg_column_size(row(0::BIGINT)) AS byte8, pg_column_size(row(0::SMALLINT, 0::BIGINT)) AS byte16; empty | byte2 | byte8 | byte16 -------+-------+-------+-------- 24 | 26 | 32 | 40
This suggests that an empty Postgres row requires 24 bytes of various header elements, a SMALLINT is 2 bytes, a BIGINT is 8 bytes, and combining them is… 16 bytes? That’s no mistake; Postgres is padding the smaller column to match the size of the following column for alignment purposes. Instead of 2 + 8 = 10, our math becomes 8 + 8 = 16.
By itself, this may not necessarily be a problem. But consider a contrived ordering system with this table:
CREATE TABLE user_order ( is_shipped BOOLEAN NOT NULL DEFAULT false, user_id BIGINT NOT NULL, order_total NUMERIC NOT NULL, order_dt TIMESTAMPTZ NOT NULL, order_type SMALLINT NOT NULL, ship_dt TIMESTAMPTZ, item_ct INT NOT NULL, ship_cost NUMERIC, receive_dt TIMESTAMPTZ, tracking_cd TEXT, id BIGSERIAL PRIMARY KEY NOT NULL );
If it looks a little weird, that’s because it is. It’s not uncommon for column orders to be dictated by a hurried dev simply jotting down attributes, or an ORM that generates output from an arbitrary hashed key position. Seeing the
id column at the end is a good indicator that column order wasn’t part of the architecture or planning roadmap.
Given that, this is what Postgres sees:
SELECT a.attname, t.typname, t.typalign, t.typlen FROM pg_class c JOIN pg_attribute a ON (a.attrelid = c.oid) JOIN pg_type t ON (t.oid = a.atttypid) WHERE c.relname = 'user_order' AND a.attnum >= 0 ORDER BY a.attnum; attname | typname | typalign | typlen -------------+-------------+----------+-------- is_shipped | bool | c | 1 user_id | int8 | d | 8 order_total | numeric | i | -1 order_dt | timestamptz | d | 8 order_type | int2 | s | 2 ship_dt | timestamptz | d | 8 item_ct | int4 | i | 4 ship_cost | numeric | i | -1 receive_dt | timestamptz | d | 8 tracking_cd | text | i | -1 id | int8 | d | 8
typalign column dictates the intended alignment type according to the values listed in the
pg_type manual page.
From earlier discussion, we can see that the INT types trivially map to their respective byte sizes. For
TEXT, things are a bit trickier, but we’ll need to address that later. For now, consider the constant size transitions and the effect that may have on alignment.
A Little Padded Room
To avoid this insanity, Postgres pads each smaller column to match the size of the next consecutive column. As a result from this particular column arrangement, there’s almost a slight amount of buffering between every single column pair.
Let’s fill the table with one million rows and check the resulting size:
INSERT INTO user_order ( is_shipped, user_id, order_total, order_dt, order_type, ship_dt, item_ct, ship_cost, receive_dt, tracking_cd ) SELECT true, 1000, 500.00, now() - INTERVAL '7 days', 3, now() - INTERVAL '5 days', 10, 4.99, now() - INTERVAL '3 days', 'X5901324123479RROIENSTBKCV4' FROM generate_series(1, 1000000); SELECT pg_relation_size('user_order') AS size_bytes, pg_size_pretty(pg_relation_size('user_order')) AS size_pretty; size_bytes | size_pretty ------------+------------- 141246464 | 135 MB
Now we can use that as our benchmark for a value we want to beat. The next question is: how much of that is padding? Well, here’s an estimate of what minimum we’re wasting based on the declared alignment:
- 7 bytes between
- 4 bytes between
- 6 bytes between
- 4 bytes between
So we’re probably losing around 21 bytes per row, with the declared types accounting for 59 bytes of actual space, for a total row length of 80 bytes without row overhead. Again, this is based only on the alignment storage. It turns out the
TEXT columns screw with the totals a bit more than the alignments suggest, but this will work fine for an estimate.
If we’re even close, that means we may be able to shrink the table by 26%, or about 37MB.
Some Ground Rules
The trick to performing this kind of surgery is obtaining the ideal column alignment that results in the fewest added bytes. To do that, we really need to consider the
TEXT columns. Because these are variable length types, they get special treatment. As an example, consider this regarding
SELECT pg_column_size(row()) AS empty_row, pg_column_size(row(0::NUMERIC)) AS no_val, pg_column_size(row(1::NUMERIC)) AS no_dec, pg_column_size(row(9.9::NUMERIC)) AS with_dec, pg_column_size(row(1::INT2, 1::NUMERIC)) AS col2, pg_column_size(row(1::INT4, 1::NUMERIC)) AS col4, pg_column_size(row(1::NUMERIC, 1::INT4)) AS round8; empty_row | no_val | no_dec | with_dec | col2 | col4 | round8 -----------+--------+--------+----------+------+------+-------- 24 | 27 | 29 | 31 | 31 | 33 | 36
These results suggest we can treat
NUMERIC as unaligned, with some caveats. Even a single digit in a
NUMERIC requires 5 bytes, but it also doesn’t impact the previous column the way an
Here’s the same concept with
SELECT pg_column_size(row()) AS empty_row, pg_column_size(row(''::TEXT)) AS no_text, pg_column_size(row('a'::TEXT)) AS min_text, pg_column_size(row(1::INT4, 'a'::TEXT)) AS two_col, pg_column_size(row('a'::TEXT, 1::INT4)) AS round4; empty_row | no_text | min_text | two_col | round4 -----------+---------+----------+---------+-------- 24 | 25 | 26 | 30 | 32
From this, we can see that variable types round to the nearest 4-bytes based on the type of the next column. This means we can chain variable length columns all day long without introducing padding except at the right boundary. Consequently, we can deduce that variable length columns introduce no bloat so long as they’re at the end of a column listing.
If the constant length column types adjust based on the next column, that would imply the largest types should come first. Barring that, we can “pack” columns such that consecutive columns consume 8 bytes cumulatively.
Again, we can see that in action:
SELECT pg_column_size(row()) AS empty_row, pg_column_size(row(1::SMALLINT)) AS int2, pg_column_size(row(1::INT)) AS int4, pg_column_size(row(1::BIGINT)) AS int8, pg_column_size(row(1::SMALLINT, 1::BIGINT)) AS padded, pg_column_size(row(1::INT, 1::INT, 1::BIGINT)) AS not_padded; empty_row | int2 | int4 | int8 | padded | not_padded -----------+------+------+------+--------+------------ 24 | 26 | 28 | 32 | 40 | 40
The previous sections are a fancy way of saying, “Sort the columns by their type length as defined in
pg_type.” Luckily enough, we can get that information by slightly adjusting the query we used to output the column types:
SELECT a.attname, t.typname, t.typalign, t.typlen FROM pg_class c JOIN pg_attribute a ON (a.attrelid = c.oid) JOIN pg_type t ON (t.oid = a.atttypid) WHERE c.relname = 'user_order' AND a.attnum >= 0 ORDER BY t.typlen DESC; attname | typname | typalign | typlen -------------+-------------+----------+-------- id | int8 | d | 8 user_id | int8 | d | 8 order_dt | timestamptz | d | 8 ship_dt | timestamptz | d | 8 receive_dt | timestamptz | d | 8 item_ct | int4 | i | 4 order_type | int2 | s | 2 is_shipped | bool | c | 1 tracking_cd | text | i | -1 ship_cost | numeric | i | -1 order_total | numeric | i | -1
All things being equal, we can mix and match columns with matching type lengths, and if we’re brave or desire a more pretty column ordering, can combine types of shorter lengths where necessary.
Let’s see what happens if we go with this table design:
DROP TABLE user_order; CREATE TABLE user_order ( id BIGSERIAL PRIMARY KEY NOT NULL, user_id BIGINT NOT NULL, order_dt TIMESTAMPTZ NOT NULL, ship_dt TIMESTAMPTZ, receive_dt TIMESTAMPTZ, item_ct INT NOT NULL, order_type SMALLINT NOT NULL, is_shipped BOOLEAN NOT NULL DEFAULT false, order_total NUMERIC NOT NULL, ship_cost NUMERIC, tracking_cd TEXT );
If we repeat the previous insert of 1-million rows, the new table size is 117,030,912 bytes, or roughly 112MB. By simply reorganizing the table columns, we’ve saved 21% of the total space.
This may not mean much for a single table, but repeated on every table in a database instance, it could result in a sizeable reduction in storage consumption. In a data warehouse context where data is commonly loaded once and never modified again, a 10-20% storage reduction would be worth considering simply due to the scales involved. I’ve seen 60TB Postgres databases; imagine reducing that by 6-12TB without actually removing any data.
Much like filling a jar with rocks, pebbles, and sand, the most efficient way to declare a Postgres table is by the column alignment type. Bigger columns first, medium columns next, small columns last, and weird exceptions like
TEXT tacked to the end as if they were dust in our analogy. That’s what we get for playing with pointers.
As it stands, a more “natural” declaration of the table columns is probably along these lines:
CREATE TABLE user_order ( id BIGSERIAL PRIMARY KEY NOT NULL, user_id BIGINT NOT NULL, order_type SMALLINT NOT NULL, order_total NUMERIC NOT NULL, order_dt TIMESTAMPTZ NOT NULL, item_ct INT NOT NULL, ship_dt TIMESTAMPTZ, is_shipped BOOLEAN NOT NULL DEFAULT false, ship_cost NUMERIC, tracking_cd TEXT, receive_dt TIMESTAMPTZ );
In this case, it just so happens we are 8MB away from the ideal, or about 7.7% larger. For the sake of demonstration, we intentionally botched the column ordering to make a point. A real table will probably sit somewhere between the best and worst case scenarios, and the only way to ensure the least amount of leakage is to manually reorder columns.
Some might ask why this isn’t built into Postgres. Surely it knows the ideal column ordering and has the power to decouple a user’s visible mapping from what actually hits the disk. That’s a legitimate question, but it’s a lot more difficult to answer and involves a lot of bike-shedding.
One major benefit from decoupling physical from logical representation is that Postgres would finally allow column reordering, or adding a column in a specific position. If a user wants their column listing to look pretty after multiple modifications, why not let them?
It’s all about priorities. There’s been a TODO item to address this going back to at least 2006. Patches have gone back and forth since then, and every time, the conversation eventually ends without a definitive conclusion. It’s clearly a difficult problem to address, and there are, as they say, bigger fish to fry.
Given sufficient demand, someone will sponsor a patch to completion, even if it requires multiple Postgres versions for the necessary under-the-hood changes to manifest. Until then, a simple query can magically reveal the ideal column ordering if the impact is pressing enough for a particular use case.
To those who love that kind of low-level tweaking, let us know if it was worth the effort. We’d love to hear your stories!
It seems like a simpler and easier fix would be to add an argument to pg_dump to do optimal reordering. Then the column re-ordering can be tested as part of a dump and restore upgrade.
Thanks, this is a very informative article that explains in an easy to understand way how Postgres stores data and fills the storage it with unused bytes. The ability to reorder columns is an important missing feature – after some time of application lifetime and various database modifications the crazy column order you presented in your example may not be far off from the truth. Then working with such table structures in database administration tools becomes clumsy as you end up with related columns far away from one another instead of being grouped logically in close proximity.
Given the padding and space issues it would be logical to disconnect the internal column order from presentational order. And we should have the ability to:
1. change presentational column order with ALTER TABLE (without table rewrite)
2. run a table rewrite that would automatically determine the most efficient internal column order and apply it to storage while preserving the presentational order – something like OPTIMIZE TABLE COLUMN ORDER – or maybe simply do it as part of VACUUM FULL.
I hope the long standing TODO item will find some love among the developers!
Thanks. Very interesting.
Where should jsonb column be placed, and why?
Thanks very informative post
just wonder if the same optimization could be applied for mysql?
Very informative and useful, though it is even harder to keep this strategy as DATABASE evolves (new columns are added).
Would it make difference in ordering of columns, if I know some of the columns will probably have null values in most rows?