Partitioning a large table without a long-running lock
Let’s say you have an application that has a huge table and that needs to be available all the time. It’s got so big that managing it without partitioning it is getting increasingly difficult. But you can’t take the table offline to create a new partitioned version of the table, which would take a great deal of time since this is a huge table.
Here is a recipe for dealing with the problem. It won’t necessarily work for every situation, particularly tables with very heavy write loads, but it could work for many.
First let’s set up our sample table and populate it with some data, 10 million rows in this case:
create table orig_table ( id serial not null, data float default random() ); create index orig_data_index on orig_table(data); create index orig_id_index on orig_table(id); insert into orig_table (id) select nextval('orig_table_id_seq') from generate_series(1,100000);
Now we’re going to set up the partitioning structure. In this case we’re going to use four ranges on the data field:
create table part_table (like orig_table including defaults including indexes including constraints) partition by range(data); create table part_table_p1 partition of part_table for values from (minvalue) to (0.25); create table part_table_p2 partition of part_table for values from (0.25) to (0.5); create table part_table_p3 partition of part_table for values from (0.5) to (0.75); create table part_table_p4 partition of part_table for values from (0.75) to (maxvalue);
We’re going to rename the original table and then create a view with that name which is a union of the rows in the new partitioned table and the old non-partitioned table. But before that, we’ll need a trigger function to handle all the insert, update and delete operations for the view.
create or replace function part_v_trigger() returns trigger language plpgsql as $TRIG$ begin IF TG_OP = 'INSERT' THEN INSERT INTO part_table VALUES(NEW.id, NEW.data); RETURN NEW; ELSIF TG_OP = 'DELETE' THEN DELETE FROM part_table WHERE id = OLD.id; DELETE FROM old_orig_table WHERE id = OLD.id; RETURN OLD; ELSE -- UPDATE DELETE FROM old_orig_table WHERE id = OLD.id; IF FOUND THEN INSERT INTO part_table VALUES(NEW.id, NEW.data); ELSE UPDATE part_table SET id = NEW.id, data = NEW.data WHERE id = OLD.id; END IF; RETURN NEW; END IF; end $TRIG$;
Then we can move to the transitional setup in one quick transaction. Since we won’t be adding new tuples to the old non-partitioned table any more, we disable autovacuum on it.
BEGIN; ALTER TABLE orig_table RENAME TO old_orig_table; ALTER TABLE old_orig_table SET( autovacuum_enabled = false, toast.autovacuum_enabled = false ); CREATE VIEW orig_table AS SELECT id, data FROM old_orig_table UNION ALL SELECT id, data FROM part_table ; CREATE TRIGGER orig_table_part_trigger INSTEAD OF INSERT OR UPDATE OR DELETE on orig_table FOR EACH ROW EXECUTE FUNCTION part_v_trigger(); COMMIT;
Note that all inserts and updates are steered to the partitioned table even if the row being updated is from the old table. We’re going to use that fact to move all the old rows in batches. What we need is a looping program that selects a small number of old table rows to move and updates them so that they are moved. Here is the sample program I used – it’s written in Perl but should be pretty easy for most readers to follow even if not Perl-savvy.
#! /bin/perl use strict; use DBI; my $move_rows = qq{ WITH oldkeys AS ( SELECT id FROM old_orig_table LIMIT 10000 ) UPDATE orig_table SET id = id WHERE ID IN (SELECT id FROM oldkeys) }; my $dbh = DBI->connect("dbi:Pg:dbname=tpart;host=/tmp;port=5711", '','',{AutoCommit => 0, RaiseError => 1, PrintError => 0} ); my $rows_done; do { $rows_done = $dbh->do($move_rows); $dbh->commit; if ($rows_done != 0) # it will be 0e0 which is 0 but true { sleep 2; } } until $rows_done == 0 || ! $rows_done; print "done\n"; $dbh->disconnect;
This program can be safely interrupted if necessary. There are other ways of writing it. If speed is an issue a slightly more complex piece of SQL can be used which avoids calling the trigger.
Once there are no more rows left in the original table, we can replace the view with the fully partitioned table. In a separate transaction (because it can take some time and it’s not critical) we finally drop the old non-partitioned table.
BEGIN; DROP VIEW orig_table CASCADE; DROP FUNCTION part_v_trigger(); ALTER SEQUENCE orig_table_id_seq OWNED BY part_table.id; ALTER TABLE part_table RENAME TO orig_table; COMMIT; BEGIN; DROP TABLE old_orig_table; COMMIT;
Our application should have remained fully functional and blissfully unaware of the changes we have been making under the hood while we were making them.
Hi
Thank you for writing this up. Some cool tricks in here.
I’d like to ask, why did you say that this may not work for write heavy work-loads.
Has this got something to do with the triggers?
Thanks!
Yeah, the trigger could impose a heavy performance penalty.
instead of using the Common table expression oldkeys and the update on orig_table I found this to better in perfomance:
BEGIN;
CREATE TEMPORARY TABLE oldkeys(id bigint);
INSERT INTO oldkeys (id) SELECT id FROM old_orig_table LIMIT 10000;
INSERT INTO part_table(id, data)
SELECT id, data
FROM public.old_orig_table
WHERE ID IN (SELECT id FROM oldkeys);
DELETE FROM old_orig_table WHERE ID IN (SELECT id FROM oldkeys);
COMMIT;
Interesting, thanks for the info. You should probably create the temp table with ON COMMIT DROP.
Actually, the reason this works faster is probably not because of the temp table use but because it’s avoiding the trigger.
Here’s a version of the query that still uses CTEs but avoids use of the trigger:
WITH olddata AS
(
SELECT *
FROM old_orig_table
LIMIT 10000
),
delold AS
(
DELETE
FROM old_orig_table
WHERE ID IN (SELECT id FROM olddata)
)
INSERT INTO part_table
SELECT *
FROM olddata;
Shouldn’t the last section be:
WITH olddata AS
(
SELECT *
FROM old_orig_table
LIMIT 10000
),
delold AS
(
DELETE
FROM old_orig_table
WHERE ID IN (SELECT id FROM olddata)
RETURNING id, data
)
INSERT INTO part_table
SELECT *
FROM delold;
You could write it that way. I think either way will work.