Tablesample and Other Methods for Getting Random Tuples
PostgreSQL’s TABLESAMPLE brings a few more advantages compared to other traditional ways for getting random tuples.
TABLESAMPLE
is a SQL SELECT clause and it provides two sampling methods which are SYSTEM
and BERNOULLI
. With the help of TABLESAMPLE
we can easily retrieve random rows from a table. For further reading about TABLESAMPLE you can check the previous blog post.
In this blog post we’ll talk about alternative ways of getting random rows. How people were selecting random rows before TABLESAMPLE
, what are the pros and cons of the other methods and what we gained with TABLESAMPLE
?
There are awesome blog posts about selecting random rows, you can start reading the following blog posts to gain a deep understanding of this topic.
My Thoughts of Getting Random Row
Getting Random Rows from a Database Table
Let’s compare the traditional ways of getting random rows from a table with the new ways provided by TABLESAMPLE.
Before the TABLESAMPLE
clause, there were 3 commonly used methods for randomly selecting rows from a table.
1- Order by random()
For testing purposes we need to create a table and put some data inside of it.
Let’s create ts_test table and insert 1M rows into it:
CREATE TABLE ts_test (
id SERIAL PRIMARY KEY,
title TEXT
);
INSERT INTO ts_test (title)
SELECT
'Record #' || i
FROM
generate_series(1, 1000000) i;
Considering the following SQL statement for selecting 10 random rows:
SELECT * FROM ts_test ORDER BY random() LIMIT 10;
Causes PostgreSQL to perform a full table scan and also ordering. Therefore this method is not preferred for tables with large number of rows because of performance reasons.
Let’s look into EXPLAIN ANALYZE
output of this query above:
random=# EXPLAIN ANALYZE SELECT * FROM ts_test ORDER BY random() LIMIT 10;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
Limit (cost=33959.03..33959.05 rows=10 width=36) (actual time=1956.786..1956.807 rows=10 loops=1)
-> Sort (cost=33959.03..35981.18 rows=808863 width=36) (actual time=1956.780..1956.789 rows=10 loops=1)
Sort Key: (random())
Sort Method: top-N heapsort Memory: 25kB
-> Seq Scan on ts_test (cost=0.00..16479.79 rows=808863 width=36) (actual time=0.276..1001.109 rows=1000000 loops=1)
Planning time: 1.434 ms
Execution time: 1956.900 ms
(7 rows)
As EXPLAIN ANALYZE
points out, selecting 10 out of 1M rows took nearly 2 seconds. The query also used the output of random()
as the sort key to order results. Sorting seems to be the most time consuming task here. Let’s execute this with scenario using TABLESAMPLE
.
In PostgreSQL 9.5, for getting exact number of rows randomly, we can use the SYSTEM_ROWS sampling method. Provided by the tsm_system_rows
contrib module, it allows us to specify how many rows should be returned by sampling. Normally only the requested percentage could be specified with TABLESAMPLE SYSTEM
and BERNOULLI
methods.
First, we should create tsm_system_rows
extension for using this method since both TABLESAMPLE SYSTEM
and TABLESAMPLE BERNOULLI
methods do not guarantee that the provided percentage will result in the expected number of rows. Please check the previous TABLESAMPLE post to recall why they work like that.
Let’s start by creating tsm_system_rows
extension:
random=# CREATE EXTENSION tsm_system_rows;
CREATE EXTENSION
Now let’s compare “ORDER BY random()
” EXPLAIN ANALYZE
output with the EXPLAIN ANALYZE
output of tsm_system_rows
query which returns 10 random rows out of a 1M row table.
random=# EXPLAIN ANALYZE SELECT * FROM ts_test TABLESAMPLE SYSTEM_ROWS(10);
QUERY PLAN
-------------------------------------------------------------------------------------------------------
Sample Scan on ts_test (cost=0.00..4.10 rows=10 width=18) (actual time=0.069..0.083 rows=10 loops=1)
Sampling: system_rows ('10'::bigint)
Planning time: 0.646 ms
Execution time: 0.159 ms
(4 rows)
The whole query took 0.159 ms. This sampling method is extremely fast comparing with the “ORDER BY random()
” method which took 1956.9 ms.
2- Compare with random()
The following SQL allows us to retrieve random rows with 10% probability
SELECT * FROM ts_test WHERE random() <= 0.1;
This method is faster than ordering by random because it doesn’t sort selected rows. It will return approximate percentage of rows from the table just like BERNOULLI
or SYSTEM
TABLESAMPLE
methods.
Let’s check the EXPLAIN ANALYZE
output of random()
query above:
random=# EXPLAIN ANALYZE SELECT * FROM ts_test WHERE random() <= 0.1;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------
Seq Scan on ts_test (cost=0.00..21369.00 rows=333333 width=18) (actual time=0.089..280.483 rows=100014 loops=1)
Filter: (random() <= '0.1'::double precision)
Rows Removed by Filter: 899986
Planning time: 0.704 ms
Execution time: 367.527 ms
(5 rows)
The query took 367.5 ms. Much better than ORDER BY random()
. But it’s harder to control the exact row count. Let’s compare “random()
” EXPLAIN ANALYZE
output with the EXPLAIN ANALYZE
output of TABLESAMPLE BERNOULLI
query which return approximately 10% of random rows out of the 1M rows table.
random=# EXPLAIN ANALYZE SELECT * FROM ts_test TABLESAMPLE BERNOULLI (10);
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Sample Scan on ts_test (cost=0.00..7369.00 rows=100000 width=18) (actual time=0.015..147.002 rows=100155 loops=1)
Sampling: bernoulli ('10'::real)
Planning time: 0.076 ms
Execution time: 239.289 ms
(4 rows)
We gave 10 as the parameter to BERNOULLI
because we need 10% of all records.
Here we can see that the BERNOULLI
method took 239.289 ms to execute. These two methods are quite similar in how they work, BERNOULLI
is slightly faster as the random selection is all done on lower level. One advantage of using BERNOULLI
compared to WHERE random() <= 0.1
is the REPEATABLE
clause which we described in previous TABLESAMPLE
post.
3- Extra column with a random value
This method suggests adding a new column with randomly assigned values, adding an index to it, performing sorting by that column, and optionally updating their values periodically to randomize the distribution.
This strategy allows mostly repeatable random sampling. It works much faster than the first method but it takes an effort to setup for the first time, and results in a performance cost in insert, update, and delete operations.
Let’s apply this method on our ts_test
table.
First, we will add a new column called randomcolumn
with the double precision type because PostgreSQL’s random()
function returns a number in double precision.
random=# ALTER TABLE ts_test ADD COLUMN randomcolumn DOUBLE PRECISION;
ALTER TABLE
Then we’ll update the new column using random()
function.
random=# \timing
Timing is on.
random=# BEGIN;
BEGIN
Time: 2.071 ms
random=# UPDATE ts_test SET randomcolumn = RANDOM();
UPDATE 1000000
Time: 8483.741 ms
random=# COMMIT;
COMMIT
Time: 2.615 ms
This method has an initial cost for creating a new column and populating that new column with random (0.0-1.0) values, but it’s a one-time cost. In this example, for 1M rows, it took almost 8.5 seconds.
Let’s try to observe if our results are reproducible by querying 100 rows with our new method:
random=# SELECT * FROM ts_test ORDER BY randomcolumn LIMIT 100;
-------+---------------+----------------------
13522 | Record #13522 | 6.4261257648468e-08
671584 | Record #671584 | 6.4261257648468e-07
714012 | Record #714012 | 1.95764005184174e-06
162016 | Record #162016 | 3.44449654221535e-06
106867 | Record #106867 | 3.66196036338806e-06
865669 | Record #865669 | 3.96883115172386e-06
927 | Record #927 | 4.65428456664085e-06
526017 | Record #526017 | 4.65987250208855e-06
98338 | Record #98338 | 4.91179525852203e-06
769625 | Record #769625 | 4.91319224238396e-06
...
462484 | Record #462484 | 9.83504578471184e-05
(100 rows)
When we execute the query above, we mostly get the same result set but this is not guaranteed because we used random()
function for populating randomcolumn
values and in this case more than one column might have the same value. To be sure that we get the same results for each time it runs we should improve our query by adding the ID column to ORDER BY
clause, in this way we ensure that ORDER BY
clause specifies a unique set of rows, because id column has primary key index on it.
Now let’s run the improved query below:
random=# SELECT * FROM ts_test ORDER BY randomcolumn, id LIMIT 100;
id | title | randomcolumn
--------+----------------+----------------------
13522 | Record #13522 | 6.4261257648468e-08
671584 | Record #671584 | 6.4261257648468e-07
714012 | Record #714012 | 1.95764005184174e-06
162016 | Record #162016 | 3.44449654221535e-06
106867 | Record #106867 | 3.66196036338806e-06
865669 | Record #865669 | 3.96883115172386e-06
927 | Record #927 | 4.65428456664085e-06
526017 | Record #526017 | 4.65987250208855e-06
98338 | Record #98338 | 4.91179525852203e-06
769625 | Record #769625 | 4.91319224238396e-06
...
462484 | Record #462484 | 9.83504578471184e-05
(100 rows)
Now we’re sure that we can get reproducible random sample by using this method.
Tip: If you want to use this method for getting random rows, it’s better to create an index for
randomcolumn
. In this blog post we’re focusing to make a comparison withTABLESAMPLE
methods and the other methods. We’re not creating index for this blog post and we’ll compareEXPLAIN ANALYZE
outputs instead of improving all the other methods.
It’s time to see EXPLAIN ANALYZE
results of our final query:
random=# EXPLAIN ANALYZE SELECT * FROM ts_test ORDER BY randomcolumn, id LIMIT 100;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
Limit (cost=55571.28..55571.53 rows=100 width=26) (actual time=1951.811..1952.039 rows=100 loops=1)
-> Sort (cost=55571.28..58071.28 rows=1000000 width=26) (actual time=1951.804..1951.891 rows=100 loops=1)
Sort Key: randomcolumn, id
Sort Method: top-N heapsort Memory: 32kB
-> Seq Scan on ts_test (cost=0.00..17352.00 rows=1000000 width=26) (actual time=0.058..995.011 rows=1000000 loops=1)
Planning time: 0.481 ms
Execution time: 1952.215 ms
(7 rows)
For comparing this method with TABLESAMPLE
methods, let’s pick SYSTEM
method with REPEATABLE
option, since this method gives us reproducible results.
TABLESAMPLE
SYSTEM
method basically does block/page level sampling; it reads and returns random pages from disk. Thus it provides randomness at the page level instead of the tuple level. That’s why it’s hard to control retrieved rows count exactly. If there are no pages picked we might not get any result. The page level sampling is also prone to clustering effect.
TABLESAMPLE
SYNTAX is same for BERNOULLI
and SYSTEM
methods, we’ll specify the percentage like we did in BERNOULLI
method.
Quick Reminder: SYNTAX
SELECT select_expression
FROM table_name
TABLESAMPLE sampling_method ( argument [, ...] ) [ REPEATABLE ( seed ) ]
...
In order to retrieve approximately 100 rows, we need to request 100 * 100 / 1 000 000 = 0.01% of the table’s rows. So our percentage will be 0.01.
Before starting to query, let’s remember how REPEATABLE
works. Basically we’ll choose a seed parameter and we’ll get the same results for each time when we use the same seed with the previous query. If we provide a different seed the query will produce a quite different result set.
Let’s try to see the results with querying.
random=# Select * from ts_test tablesample system (0.01) repeatable (60);
id | title | randomcolumn
--------+----------------+---------------------
659598 | Record #659598 | 0.964113113470376
659599 | Record #659599 | 0.531714483164251
659600 | Record #659600 | 0.477636905387044
659601 | Record #659601 | 0.861940925940871
659602 | Record #659602 | 0.545977566856891
659603 | Record #659603 | 0.326795583125204
659604 | Record #659604 | 0.469275736715645
659605 | Record #659605 | 0.734953186474741
659606 | Record #659606 | 0.41613544523716
...
659732 | Record #659732 | 0.893704727292061
659733 | Record #659733 | 0.847225237637758
(136 rows)
We get 136 rows, as you can consider this number depends on how many rows are stored in a single page.
Now let’s try to run the same query with the same seed number:
random=# Select * from ts_test tablesample system (0.01) repeatable (60);
id | title | randomcolumn
--------+----------------+---------------------
659598 | Record #659598 | 0.964113113470376
659599 | Record #659599 | 0.531714483164251
659600 | Record #659600 | 0.477636905387044
659601 | Record #659601 | 0.861940925940871
659602 | Record #659602 | 0.545977566856891
659603 | Record #659603 | 0.326795583125204
659604 | Record #659604 | 0.469275736715645
659605 | Record #659605 | 0.734953186474741
659606 | Record #659606 | 0.41613544523716
...
659732 | Record #659732 | 0.893704727292061
659733 | Record #659733 | 0.847225237637758
(136 rows)
We get the same result set thanks to REPEATABLE
option. Now we can compare TABLESAMPLE SYSTEM
method with random column method by looking the EXPLAIN ANALYZE
output.
random=# EXPLAIN ANALYZE SELECT * FROM ts_test TABLESAMPLE SYSTEM (0.01) REPEATABLE (60);
QUERY PLAN
---------------------------------------------------------------------------------------------------------
Sample Scan on ts_test (cost=0.00..5.00 rows=100 width=26) (actual time=0.091..0.230 rows=136 loops=1)
Sampling: system ('0.01'::real) REPEATABLE ('60'::double precision)
Planning time: 0.323 ms
Execution time: 0.398 ms
(4 rows)
SYSTEM
method uses sample scan and it’s extremely fast. The only backward of this method is that we cannot guarantee that the provided percentage will result in the expected number of rows.
Conclusion
In this blog post we compared standard TABLESAMPLE
(SYSTEM
, BERNOULLI
) and the tsm_system_rows
module with the older random methods.
Here you can review the findings quickly:
ORDER BY random()
is slow because of sortingrandom() <= 0.1
is similar toBERNOULLI
method- Adding column with random value needs initial work and might lead performance problems
SYSTEM
is fast but it’s hard to control number of rowstsm_system_rows
is fast and can control number of the rows
As a result tsm_system_rows
beats any other method for selecting just few random rows.
But the real winner is definitely TABLESAMPLE
itself. Thanks to its extensibility, custom extensions (i.e. tsm_system_rows
, tsm_system_time
) can be developed using TABLESAMPLE
method functions.
Developer note: More information about how to write custom sampling methods can be found in Writing A Table Sampling Method chapter of PostgreSQL documentation.
Note to the future: We’ll discuss AXLE Project and tsm_system_time extension in our next TABLESAMPLE
blog post.
Leave a Reply
Want to join the discussion?Feel free to contribute!