Using Window Functions for Time Series IoT Analytics in Postgres-BDR
Internet of Things tends to generate large volumes of data at a great velocity. Often times this data is collected from geographically distributed sources and aggregated at a central location for data scientists to perform their magic i.e. find patterns, trends and make predictions.
Let’s explore what the IoT Solution using Postgres-BDR has to offer for Data Analytics. Postgres-BDR is offered as an extension on top of PostgreSQL 10 and above. It is not a fork. Therefore, we get the power of complete set of analytic functions that PostgreSQL has to offer. In particular, I am going to play around with PostgreSQL’s Window Functions here to analyze a sample of time series data from temperature sensors.
Let’s take an example of IoT temperature sensor time series data spread over a period of 7 days. In a typical scenario, temperature sensors are sending readings every minute. Some cases could be even more frequent. For the sake of simplicity, however, I am going to use one reading per day. The objective is to use PostgreSQL’s Window Functions for running analytics and that would not change with increasing the frequency of data points.
Here’s our sample data. Again, the number of table fields in a real world IoT temperature sensor would be higher, but our fields of interest in this case are restricted to timestamp of the temperature recording, device that reported it and the actual reading.
CREATE TABLE iot_temperature_sensor_data ( ts timestamp without time zone, device_id text, reading float );
and we add some random timeseries temperature sensor reading data for seven consecutive days
ts | device_id | reading ---------------------+----------------------------------+--------- 2017-06-01 00:00:00 | ff0d1c4fd33f8429b7e3f163753a9cb0 | 10 2017-06-02 00:00:00 | d125efa43a62af9f50c1a1edb733424d | 9 2017-06-03 00:00:00 | 0b4ee949cc1ae588dd092a23f794177c | 1 2017-06-04 00:00:00 | 8b9cef086e02930a808ee97b87b07b03 | 3 2017-06-05 00:00:00 | d94d599467d1d8d3d347d9e9df809691 | 6 2017-06-06 00:00:00 | a9c1e8f60f28935f3510d7d83ba54329 | 9 2017-06-07 00:00:00 | 6fb333bd151b4bcc684d21b248c06ca3 | 9
Some of the very obvious questions of course:
- What was the lowest temperature reading and when?
- What was the highest temperature reading and when?
Quite understandably, we can run Min() and MAX() on ‘reading’ column to get lowest and highest temperature readings:
SELECT MIN(reading), MAX(reading) FROM iot_temperature_sensor_data;
min | max -----+----- 1 | 10 (1 row)
While it is useful, it doesn’t tell us which corresponding dates and/or devices reported lowest and highest temperatures.
SELECT ts, device_id, reading FROM iot_temperature_sensor_data WHERE reading = ( SELECT MIN(reading) FROM iot_temperature_sensor_data);
ts | device_id | reading ---------------------+----------------------------------+--------- 2017-06-03 00:00:00 | 0b4ee949cc1ae588dd092a23f794177c | 1 (1 row)
So far so good!
However what about the questions like:
– Where does a given day rank in terms of temperature reading for the given week in a specified order?
– Let’s take a look to see how low or high the temperature was compared to previous or following day. So essentially we are looking to find the delta in temperature readings for a given day compared to one previous and/or following day.
This is where Window Functions come into play allowing us to compare and contrast values in relation to current row.
So if we wanted to find where a given day ranks in terms of its temperature reading with lowest temperature ranked at the top:
SELECT ts, device_id, reading, rank() OVER ( ORDER BY reading) FROM iot_temperature_sensor_data;
ts | device_id | reading | rank ---------------------+----------------------------------+---------+------ 2017-06-03 00:00:00 | 0b4ee949cc1ae588dd092a23f794177c | 1 | 1 2017-06-04 00:00:00 | 8b9cef086e02930a808ee97b87b07b03 | 3 | 2 2017-06-05 00:00:00 | d94d599467d1d8d3d347d9e9df809691 | 6 | 3 2017-06-02 00:00:00 | d125efa43a62af9f50c1a1edb733424d | 9 | 4 2017-06-06 00:00:00 | a9c1e8f60f28935f3510d7d83ba54329 | 9 | 4 2017-06-07 00:00:00 | 6fb333bd151b4bcc684d21b248c06ca3 | 9 | 4 2017-06-01 00:00:00 | ff0d1c4fd33f8429b7e3f163753a9cb0 | 10 | 7 (7 rows)
If we looked at the rank column, it looks good except that we notice rank 7 right after 4. This is because the next rank gets skipped because we have three rows with identical temperature reading of 9. What if I wanted not to skip ? PostgreSQL provides rank_dense() exactly to serve the same purpose.
SELECT ts, device_id, reading, dense_rank() OVER ( ORDER BY reading) FROM iot_temperature_sensor_data;
ts | device_id | reading | dense_rank ---------------------+----------------------------------+---------+------------ 2017-06-03 00:00:00 | 0b4ee949cc1ae588dd092a23f794177c | 1 | 1 2017-06-04 00:00:00 | 8b9cef086e02930a808ee97b87b07b03 | 3 | 2 2017-06-05 00:00:00 | d94d599467d1d8d3d347d9e9df809691 | 6 | 3 2017-06-02 00:00:00 | d125efa43a62af9f50c1a1edb733424d | 9 | 4 2017-06-06 00:00:00 | a9c1e8f60f28935f3510d7d83ba54329 | 9 | 4 2017-06-07 00:00:00 | 6fb333bd151b4bcc684d21b248c06ca3 | 9 | 4 2017-06-01 00:00:00 | ff0d1c4fd33f8429b7e3f163753a9cb0 | 10 | 5 (7 rows)
How about which days saw the maximum rise in temperature compared to previous day ?
SELECT ts, device_id, reading, reading - lag(reading, 1) OVER ( ORDER BY reading) AS diff FROM iot_temperature_sensor_data;
ts | device_id | reading | diff ---------------------+----------------------------------+---------+------ 2017-06-03 00:00:00 | 0b4ee949cc1ae588dd092a23f794177c | 1 | 2017-06-04 00:00:00 | 8b9cef086e02930a808ee97b87b07b03 | 3 | 2 2017-06-05 00:00:00 | d94d599467d1d8d3d347d9e9df809691 | 6 | 3 2017-06-02 00:00:00 | d125efa43a62af9f50c1a1edb733424d | 9 | 3 2017-06-06 00:00:00 | a9c1e8f60f28935f3510d7d83ba54329 | 9 | 0 2017-06-07 00:00:00 | 6fb333bd151b4bcc684d21b248c06ca3 | 9 | 0 2017-06-01 00:00:00 | ff0d1c4fd33f8429b7e3f163753a9cb0 | 10 | 1 (7 rows)
We see that the last column shows the rise in temperature vs previous day. A quick visual inspection would show that the maximum rise in temperature is 3 degrees on 2017-06-02 00:00:00 and then again on 2017-06-05 00:00:00. With a little bit of CTE magic, we could list the days that saw the maximum rise in temperature.
WITH temperature_data AS ( SELECT ts, device_id, reading, reading - lag(reading, 1) OVER ( ORDER BY reading) AS diff FROM iot_temperature_sensor_data ) SELECT ts, diff FROM temperature_data WHERE diff = ( SELECT MAX(diff) FROM temperature_data);
ts | diff ---------------------+------ 2017-06-05 00:00:00 | 3 2017-06-02 00:00:00 | 3
Here is a list of Window Functions that PostgreSQL and Postgres-BDR support:
- row_number()
- percent_rank()
- cume_dist()
- ntile(
num_buckets
integer
) - lead(
value
anyelement
[,offset
integer
[,default
anyelement
]]) - first_value(
value
any
) - last_value(
value
any
) - nth_value(
value
any
,nth
integer
)
For further reading, please refer to the PostgreSQL documentation on Window Functions in PostgreSQL.
Leave a Reply
Want to join the discussion?Feel free to contribute!