PostgreSQL for IoT Data Retention and Archiving
We do understand that IoT revolution is resulting in enormous amounts of data. With brisk data growth where data is mostly time series append-only, relational databases and DBAs have a rather tough task to store, maintain, archive and in some cases get rid of the old data in an efficient manner. In my previous posts, I talked about various strategies and techniques around better scalability for data storage using PostgreSQL and Postgres-BDR extension. Data retention is becoming ever so important. So let’s see what PostgreSQL 10 and above have to offer to efficiently manage your data retention needs.
PostgreSQL has supported time based partitioning in some form for quite some time. However, it wasn’t part of the core PostgreSQL. PostgreSQL 10 made a major improvement in this area by introducing declarative partitioning.
As Postgres-BDR runs as an extension on top of PostgreSQL 10 and above, we get all partitioning features and improvements of PostgreSQL 10 and above in Postgres-BDR as well. So from the implementation perspective in the IoT domain, you can now create and manage partitions over time and space in PostgreSQL for multimaster environments using Postgres-BDR extension.
So let’s see how easy it is to implement data retention policies with PostgreSQL’s partitioning improvements.
CREATE TABLE iot_temperature_sensor_data ( ts timestamp without time zone, device_id text, reading float ) PARTITION BY RANGE (ts);
PARTITION BY RANGE tells PostgreSQL that we are partitioning this table by range using column logdate.
Let’s create some partitions
CREATE TABLE iot_temperature_sensor_data_2018_february PARTITION OF iot_temperature_sensor_data FOR VALUES FROM ( '2018-02-01') TO ( '2018-03-01' ); CREATE TABLE iot_temperature_sensor_data_2018_march PARTITION OF iot_temperature_sensor_data FOR VALUES FROM ( '2018-03-01') TO ( '2018-04-01' ); CREATE TABLE iot_temperature_sensor_data_2018_april PARTITION OF iot_temperature_sensor_data FOR VALUES FROM ( '2018-04-01') TO ( '2018-05-01' ); CREATE TABLE iot_temperature_sensor_data_2018_may PARTITION OF iot_temperature_sensor_data FOR VALUES FROM ( '2018-05-01') TO ( '2018-06-01' ); CREATE TABLE iot_temperature_sensor_data_2018_june PARTITION OF iot_temperature_sensor_data FOR VALUES FROM ( '2018-06-01') TO ( '2018-07-01' );
Alright, so our sensor data storage table is now partitioned over time. \d+ iot_temperature_sensor_data Table "public.iot_temperature_sensor_data" Column | Type | Collation | Nullable | Default | Storage | Stats target | Description -----------+-----------------------------+-----------+----------+---------+----------+--------------+------------- ts | timestamp without time zone | | | | plain | | device_id | text | | | | extended | | reading | double precision | | | | plain | | Partition key: RANGE (ts) Partitions: iot_temperature_sensor_data_2018_april FOR VALUES FROM ('2018-04-01 00:00:00') TO ('2018-05-01 00:00:00'), iot_temperature_sensor_data_2018_february FOR VALUES FROM ('2018-02-01 00:00:00') TO ('2018-03-01 00:00:00'), iot_temperature_sensor_data_2018_june FOR VALUES FROM ('2018-06-01 00:00:00') TO ('2018-07-01 00:00:00'), iot_temperature_sensor_data_2018_march FOR VALUES FROM ('2018-03-01 00:00:00') TO ('2018-04-01 00:00:00'), iot_temperature_sensor_data_2018_may FOR VALUES FROM ('2018-05-01 00:00:00') TO ('2018-06-01 00:00:00')
So what happens when data grows and we need to either remove old data or archive it?
The simplest option for removing old data is to drop the partition that is no longer necessary:
DROP TABLE iot_temperature_sensor_data_2018_april;
This can very quickly delete millions of records because it doesn’t have to individually delete every record. Note however that the above command requires taking an ACCESS EXCLUSIVE lock on the parent table.
While this approach can help you get rid of old data quickly, often preferable option is to remove the partition from the partitioned table but retain access to it as a table in its own right.
ALTER TABLE iot_temperature_sensor_data DETACH PARTITION iot_temperature_sensor_data_2018_april;
This allows further operations to be performed on the data before it is dropped. For example, this is often a useful time to backup the data using COPY, pg_dump, or similar tools. It might also be a useful time to aggregate data into smaller formats, perform other data manipulations, or run reports.
Partition drop and detach tasks can be automated via scheduled jobs to make sure data archival and removal needs are taken care of automatically. So if your retention policy requires archival/removal of 3 months old data, a cron job could run DROP TABLE or DETACH PARTITION on partitions that grow older than 3 months.
Enter pg_partman
If you need further ease of use with your data retention policy needs, pg_partman makes it easier to create and maintain partitions. So for example to achieve something similar that we did above:
SELECT partman.create_parent('iot.iot_temperature_sensor_data','ts','native','monthly');
Again, let’s create some partitions:
SELECT partman.create_parent('iot.iot_temperature_sensor_data','ts','native','monthly');
By default, pg_partman will create one time-partition for current month, four partitions of past four months and four more partitions for four months in the future. You can tweak around p_start_partition and p_premake if you need to override those settings for cases where you possibly need more partitions for backfilling old data or need to create more partitions into the future respectively. \d+ iot.iot_temperature_sensor_data Table "iot.iot_temperature_sensor_data" Column | Type | Collation | Nullable | Default | Storage | Stats target | Description -----------+-----------------------------+-----------+----------+---------+----------+--------------+------------- ts | timestamp without time zone | | | | plain | | device_id | text | | | | extended | | reading | double precision | | | | plain | | Partition key: RANGE (ts) Partitions: iot_temperature_sensor_data_p2018_03 FOR VALUES FROM ('2018-03-01 00:00:00') TO ('2018-04-01 00:00:00'), iot_temperature_sensor_data_p2018_04 FOR VALUES FROM ('2018-04-01 00:00:00') TO ('2018-05-01 00:00:00'), iot_temperature_sensor_data_p2018_05 FOR VALUES FROM ('2018-05-01 00:00:00') TO ('2018-06-01 00:00:00'), iot_temperature_sensor_data_p2018_06 FOR VALUES FROM ('2018-06-01 00:00:00') TO ('2018-07-01 00:00:00'), iot_temperature_sensor_data_p2018_07 FOR VALUES FROM ('2018-07-01 00:00:00') TO ('2018-08-01 00:00:00'), iot_temperature_sensor_data_p2018_08 FOR VALUES FROM ('2018-08-01 00:00:00') TO ('2018-09-01 00:00:00'), iot_temperature_sensor_data_p2018_09 FOR VALUES FROM ('2018-09-01 00:00:00') TO ('2018-10-01 00:00:00'), iot_temperature_sensor_data_p2018_10 FOR VALUES FROM ('2018-10-01 00:00:00') TO ('2018-11-01 00:00:00'), iot_temperature_sensor_data_p2018_11 FOR VALUES FROM ('2018-11-01 00:00:00') TO ('2018-12-01 00:00:00')
Now if we wanted to implement 3 month data retention policy with pg_partman, it would be as simple as:
UPDATE partman.part_config SET retention = '3 month' WHERE parent_table = 'iot.iot_temperature_sensor_data'; and then a cron job to execute the retention task: SELECT run_maintenance(p_analyze := false);
The cron job will make sure that it detaches 3 month old partitions automatically everytime it runs. Note that this will only detach partitions that are older than 3 months. If you want to remove them altogether, you need to specify that when setting the retention policy.
UPDATE partman.part_config SET retention_keep_table = false, retention = '3 month' WHERE parent_table = 'iot.iot_temperature_sensor_data';
retention_keep_table is a boolean value to determine whether dropped child tables are kept or actually dropped.
For any questions or comments, please get in touch with us using the contact form here.
It is too bad that you need to create the partitions. If you could just tell the database how to calculate the partition key, either declaratively (truncate this date field to whole months) or by supplying an expression, the database could just create partitions as needed. That Postgresql already does ddl inside transactions, might make it easier to implement.