2ndQuadrant is now part of EDB

Bringing together some of the world's top PostgreSQL experts.

2ndQuadrant | PostgreSQL
Mission Critical Databases
  • Contact us
  • EN
    • FR
    • IT
    • ES
    • DE
    • PT
  • Support & Services
  • Products
  • Downloads
    • Installers
      • Postgres Installer
      • 2UDA – Unified Data Analytics
    • Whitepapers
      • Business Case for PostgreSQL Support
      • Security Best Practices for PostgreSQL
    • Case Studies
      • Performance Tuning
        • BenchPrep
        • tastyworks
      • Distributed Clusters
        • ClickUp
        • European Space Agency (ESA)
        • Telefónica del Sur
        • Animal Logic
      • Database Administration
        • Agilis Systems
      • Professional Training
        • Met Office
        • London & Partners
      • Database Upgrades
        • Alfred Wegener Institute (AWI)
      • Database Migration
        • International Game Technology (IGT)
        • Healthcare Software Solutions (HSS)
        • Navionics
  • Postgres Learning Center
    • Webinars
      • Upcoming Webinars
      • Webinar Library
    • Whitepapers
      • Business Case for PostgreSQL Support
      • Security Best Practices for PostgreSQL
    • Blog
    • Training
      • Course Catalogue
    • Case Studies
      • Performance Tuning
        • BenchPrep
        • tastyworks
      • Distributed Clusters
        • ClickUp
        • European Space Agency (ESA)
        • Telefónica del Sur
        • Animal Logic
      • Database Administration
        • Agilis Systems
      • Professional Training
        • Met Office
        • London & Partners
      • Database Upgrades
        • Alfred Wegener Institute (AWI)
      • Database Migration
        • International Game Technology (IGT)
        • Healthcare Software Solutions (HSS)
        • Navionics
    • Books
      • PostgreSQL 11 Administration Cookbook
      • PostgreSQL 10 Administration Cookbook
      • PostgreSQL High Availability Cookbook – 2nd Edition
      • PostgreSQL 9 Administration Cookbook – 3rd Edition
      • PostgreSQL Server Programming Cookbook – 2nd Edition
      • PostgreSQL 9 Cookbook – Chinese Edition
    • Videos
    • Events
    • PostgreSQL
      • PostgreSQL – History
      • Who uses PostgreSQL?
      • PostgreSQL FAQ
      • PostgreSQL vs MySQL
      • The Business Case for PostgreSQL
      • Security Information
      • Documentation
  • About Us
    • About 2ndQuadrant
    • 2ndQuadrant’s Passion for PostgreSQL
    • News
    • Careers
    • Team Profile
  • Blog
  • Menu Menu
You are here: Home1 / Blog2 / 2ndQuadrant3 / Tablesample and Other Methods for Getting Random Tuples
Gulcin Yildirim

Tablesample and Other Methods for Getting Random Tuples

September 20, 2015/0 Comments/in 2ndQuadrant, Gulcin's PlanetPostgreSQL, PostgreSQL /by Gulcin Yildirim

PostgreSQL’s TABLESAMPLE brings a few more advantages compared to other traditional ways for getting random tuples.

TABLESAMPLE is a SQL SELECT clause and it provides two sampling methods which are SYSTEM and BERNOULLI. With the help of TABLESAMPLE we can easily retrieve random rows from a table. For further reading about TABLESAMPLE you can check the previous blog post.

In this blog post we’ll talk about alternative ways of getting random rows. How people were selecting random rows before TABLESAMPLE, what are the pros and cons of the other methods and what we gained with TABLESAMPLE?

There are awesome blog posts about selecting random rows, you can start reading the following blog posts to gain a deep understanding of this topic.

My Thoughts of Getting Random Row

Getting Random Rows from a Database Table

random_agg()

Let’s compare the traditional ways of getting random rows from a table with the new ways provided by TABLESAMPLE.

Before the TABLESAMPLE clause, there were 3 commonly used methods for randomly selecting rows from a table.

1- Order by random()

For testing purposes we need to create a table and put some data inside of it.

Let’s create ts_test table and insert 1M rows into it:

CREATE TABLE ts_test (
  id SERIAL PRIMARY KEY,
  title TEXT
);

INSERT INTO ts_test (title)
SELECT
    'Record #' || i
FROM
    generate_series(1, 1000000) i;

Considering the following SQL statement for selecting 10 random rows:

SELECT * FROM ts_test ORDER BY random() LIMIT 10;

Causes PostgreSQL to perform a full table scan and also ordering. Therefore this method is not preferred for tables with large number of rows because of performance reasons.

Let’s look into EXPLAIN ANALYZE output of this query above:

random=# EXPLAIN ANALYZE SELECT * FROM ts_test ORDER BY random() LIMIT 10;
 QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
 Limit (cost=33959.03..33959.05 rows=10 width=36) (actual time=1956.786..1956.807 rows=10 loops=1)
 -> Sort (cost=33959.03..35981.18 rows=808863 width=36) (actual time=1956.780..1956.789 rows=10 loops=1)
 Sort Key: (random())
 Sort Method: top-N heapsort Memory: 25kB
 -> Seq Scan on ts_test (cost=0.00..16479.79 rows=808863 width=36) (actual time=0.276..1001.109 rows=1000000 loops=1)
 Planning time: 1.434 ms
 Execution time: 1956.900 ms
(7 rows)

As EXPLAIN ANALYZE points out, selecting 10 out of 1M rows took nearly 2 seconds. The query also used the output of random() as the sort key to order results. Sorting seems to be the most time consuming task here. Let’s execute this with scenario using TABLESAMPLE.

In PostgreSQL 9.5, for getting exact number of rows randomly, we can use the SYSTEM_ROWS sampling method. Provided by the tsm_system_rows contrib module, it allows us to specify how many rows should be returned by sampling. Normally only the requested percentage could be specified with TABLESAMPLE SYSTEM and BERNOULLI methods.

First, we should create tsm_system_rows extension for using this method since both TABLESAMPLE SYSTEM and TABLESAMPLE BERNOULLI methods do not guarantee that the provided percentage will result in the expected number of rows. Please check the previous TABLESAMPLE post to recall why they work like that.

Let’s start by creating tsm_system_rows extension:

random=# CREATE EXTENSION tsm_system_rows;
CREATE EXTENSION

Now let’s compare “ORDER BY random()” EXPLAIN ANALYZE output with the EXPLAIN ANALYZE output of tsm_system_rows query which returns 10 random rows out of a 1M row table.

random=# EXPLAIN ANALYZE SELECT * FROM ts_test TABLESAMPLE SYSTEM_ROWS(10);
 QUERY PLAN
-------------------------------------------------------------------------------------------------------
 Sample Scan on ts_test (cost=0.00..4.10 rows=10 width=18) (actual time=0.069..0.083 rows=10 loops=1)
 Sampling: system_rows ('10'::bigint)
 Planning time: 0.646 ms
 Execution time: 0.159 ms
(4 rows)

The whole query took 0.159 ms. This sampling method is extremely fast comparing with the “ORDER BY random()” method which took 1956.9 ms.

2- Compare with random()

The following SQL allows us to retrieve random rows with 10% probability

SELECT * FROM ts_test WHERE random() <= 0.1;

This method is faster than ordering by random because it doesn’t sort selected rows. It will return approximate percentage of rows from the table just like BERNOULLI or SYSTEM TABLESAMPLE methods.

Let’s check the EXPLAIN ANALYZE output of random() query above:

random=# EXPLAIN ANALYZE SELECT * FROM ts_test WHERE random() <= 0.1;
 QUERY PLAN
------------------------------------------------------------------------------------------------------------------
 Seq Scan on ts_test (cost=0.00..21369.00 rows=333333 width=18) (actual time=0.089..280.483 rows=100014 loops=1)
 Filter: (random() <= '0.1'::double precision)
 Rows Removed by Filter: 899986
 Planning time: 0.704 ms
 Execution time: 367.527 ms
(5 rows)

The query took 367.5 ms. Much better than ORDER BY random(). But it’s harder to control the exact row count. Let’s compare “random()” EXPLAIN ANALYZE output with the EXPLAIN ANALYZE output of TABLESAMPLE BERNOULLI query which return approximately 10% of random rows out of the 1M rows table.

random=# EXPLAIN ANALYZE SELECT * FROM ts_test TABLESAMPLE BERNOULLI (10);
 QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
 Sample Scan on ts_test  (cost=0.00..7369.00 rows=100000 width=18) (actual time=0.015..147.002 rows=100155 loops=1)
   Sampling: bernoulli ('10'::real)
 Planning time: 0.076 ms
 Execution time: 239.289 ms
(4 rows)

We gave 10 as the parameter to BERNOULLI because we need 10% of all records.

Here we can see that the BERNOULLI method took 239.289 ms to execute. These two methods are quite similar in how they work, BERNOULLI is slightly faster as the random selection is all done on lower level. One advantage of using BERNOULLI compared to WHERE random() <= 0.1 is the REPEATABLE clause which we described in previous TABLESAMPLE post.

3- Extra column with a random value

This method suggests adding a new column with randomly assigned values, adding an index to it, performing sorting by that column, and optionally updating their values periodically to randomize the distribution.

This strategy allows mostly repeatable random sampling. It works much faster than the first method but it takes an effort to setup for the first time, and results in a performance cost in insert, update, and delete operations.

Let’s apply this method on our ts_test table.

First, we will add a new column called randomcolumn with the double precision type because PostgreSQL’s random() function returns a number in double precision.

random=# ALTER TABLE ts_test ADD COLUMN randomcolumn DOUBLE PRECISION;
ALTER TABLE

Then we’ll update the new column using random() function.

random=# \timing
Timing is on.
random=# BEGIN;
BEGIN
Time: 2.071 ms
random=# UPDATE ts_test SET randomcolumn = RANDOM();
UPDATE 1000000
Time: 8483.741 ms
random=# COMMIT;
COMMIT
Time: 2.615 ms

This method has an initial cost for creating a new column and populating that new column with random (0.0-1.0) values, but it’s a one-time cost. In this example, for 1M rows, it took almost 8.5 seconds.

Let’s try to observe if our results are reproducible by querying 100 rows with our new method:

random=# SELECT * FROM ts_test ORDER BY randomcolumn LIMIT 100;
 -------+---------------+----------------------
 13522  | Record #13522  | 6.4261257648468e-08
 671584 | Record #671584 | 6.4261257648468e-07
 714012 | Record #714012 | 1.95764005184174e-06
 162016 | Record #162016 | 3.44449654221535e-06
 106867 | Record #106867 | 3.66196036338806e-06
 865669 | Record #865669 | 3.96883115172386e-06
 927    | Record #927    | 4.65428456664085e-06
 526017 | Record #526017 | 4.65987250208855e-06
 98338  | Record #98338  | 4.91179525852203e-06
 769625 | Record #769625 | 4.91319224238396e-06
 ...
 462484 | Record #462484 | 9.83504578471184e-05
 (100 rows)

When we execute the query above, we mostly get the same result set but this is not guaranteed because we used random() function for populating randomcolumn values and in this case more than one column might have the same value. To be sure that we get the same results for each time it runs we should improve our query by adding the ID column to ORDER BY clause, in this way we ensure that ORDER BY clause specifies a unique set of rows, because id column has primary key index on it.

Now let’s run the improved query below:

random=# SELECT * FROM ts_test ORDER BY randomcolumn, id LIMIT 100;
 id | title | randomcolumn
--------+----------------+----------------------
 13522  | Record #13522  | 6.4261257648468e-08
 671584 | Record #671584 | 6.4261257648468e-07
 714012 | Record #714012 | 1.95764005184174e-06
 162016 | Record #162016 | 3.44449654221535e-06
 106867 | Record #106867 | 3.66196036338806e-06
 865669 | Record #865669 | 3.96883115172386e-06
 927    | Record #927    | 4.65428456664085e-06
 526017 | Record #526017 | 4.65987250208855e-06
 98338  | Record #98338  | 4.91179525852203e-06
 769625 | Record #769625 | 4.91319224238396e-06 
 ...
 462484 | Record #462484 | 9.83504578471184e-05
 (100 rows)

Now we’re sure that we can get reproducible random sample by using this method.

Tip: If you want to use this method for getting random rows, it’s better to create an index for randomcolumn. In this blog post we’re focusing to make a comparison with TABLESAMPLE methods and the other methods. We’re not creating index for this blog post and we’ll compare EXPLAIN ANALYZE outputs instead of improving all the other methods.

It’s time to see EXPLAIN ANALYZE results of our final query:

random=# EXPLAIN ANALYZE SELECT * FROM ts_test ORDER BY randomcolumn, id LIMIT 100;
 QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
 Limit (cost=55571.28..55571.53 rows=100 width=26) (actual time=1951.811..1952.039 rows=100 loops=1)
 -> Sort (cost=55571.28..58071.28 rows=1000000 width=26) (actual time=1951.804..1951.891 rows=100 loops=1)
 Sort Key: randomcolumn, id
 Sort Method: top-N heapsort Memory: 32kB
 -> Seq Scan on ts_test (cost=0.00..17352.00 rows=1000000 width=26) (actual time=0.058..995.011 rows=1000000 loops=1)
 Planning time: 0.481 ms
 Execution time: 1952.215 ms
(7 rows)

For comparing this method with TABLESAMPLE methods, let’s pick SYSTEM method with REPEATABLE option, since this method gives us reproducible results.

TABLESAMPLE SYSTEM method basically does block/page level sampling; it reads and returns random pages from disk. Thus it provides randomness at the page level instead of the tuple level. That’s why it’s hard to control retrieved rows count exactly. If there are no pages picked we might not get any result. The page level sampling is also prone to clustering effect.

TABLESAMPLE SYNTAX is same for BERNOULLI and SYSTEM methods, we’ll specify the percentage like we did in BERNOULLI method.

Quick Reminder: SYNTAX

SELECT select_expression
FROM table_name
TABLESAMPLE sampling_method ( argument [, ...] ) [ REPEATABLE ( seed ) ]
...

In order to retrieve approximately 100 rows, we need to request 100 * 100 / 1 000 000 = 0.01% of the table’s rows. So our percentage will be 0.01.

Before starting to query, let’s remember how REPEATABLE works. Basically we’ll choose a seed parameter and we’ll get the same results for each time when we use the same seed with the previous query. If we provide a different seed the query will produce a quite different result set.

Let’s try to see the results with querying.

random=# Select * from ts_test tablesample system (0.01) repeatable (60);
 id | title | randomcolumn
--------+----------------+---------------------
 659598 | Record #659598 | 0.964113113470376
 659599 | Record #659599 | 0.531714483164251
 659600 | Record #659600 | 0.477636905387044
 659601 | Record #659601 | 0.861940925940871
 659602 | Record #659602 | 0.545977566856891
 659603 | Record #659603 | 0.326795583125204
 659604 | Record #659604 | 0.469275736715645
 659605 | Record #659605 | 0.734953186474741
 659606 | Record #659606 | 0.41613544523716
 ...
 659732 | Record #659732 | 0.893704727292061
 659733 | Record #659733 | 0.847225237637758
 (136 rows)

We get 136 rows, as you can consider this number depends on how many rows are stored in a single page.

Now let’s try to run the same query with the same seed number:

random=# Select * from ts_test tablesample system (0.01) repeatable (60);
 id | title | randomcolumn
--------+----------------+---------------------
 659598 | Record #659598 | 0.964113113470376
 659599 | Record #659599 | 0.531714483164251
 659600 | Record #659600 | 0.477636905387044
 659601 | Record #659601 | 0.861940925940871
 659602 | Record #659602 | 0.545977566856891
 659603 | Record #659603 | 0.326795583125204
 659604 | Record #659604 | 0.469275736715645
 659605 | Record #659605 | 0.734953186474741
 659606 | Record #659606 | 0.41613544523716 
 ...
 659732 | Record #659732 | 0.893704727292061
 659733 | Record #659733 | 0.847225237637758
 (136 rows)

We get the same result set thanks to REPEATABLE option. Now we can compare TABLESAMPLE SYSTEM method with random column method by looking the EXPLAIN ANALYZE output.

random=# EXPLAIN ANALYZE SELECT * FROM ts_test TABLESAMPLE SYSTEM (0.01) REPEATABLE (60);
 QUERY PLAN
---------------------------------------------------------------------------------------------------------
 Sample Scan on ts_test (cost=0.00..5.00 rows=100 width=26) (actual time=0.091..0.230 rows=136 loops=1)
 Sampling: system ('0.01'::real) REPEATABLE ('60'::double precision)
 Planning time: 0.323 ms
 Execution time: 0.398 ms
(4 rows)

SYSTEM method uses sample scan and it’s extremely fast. The only backward of this method is that we cannot guarantee that the provided percentage will result in the expected number of rows.

Conclusion

In this blog post we compared standard TABLESAMPLE (SYSTEM, BERNOULLI) and the tsm_system_rows module with the older random methods.

Here you can review the findings quickly:

  • ORDER BY random() is slow because of sorting
  • random() <= 0.1 is similar to BERNOULLI method
  • Adding column with random value needs initial work and might lead performance problems
  • SYSTEM is fast but it’s hard to control number of rows
  • tsm_system_rows is fast and can control number of the rows

As a result tsm_system_rows beats any other method for selecting just few random rows.

But the real winner is definitely TABLESAMPLE itself. Thanks to its extensibility, custom extensions (i.e. tsm_system_rows, tsm_system_time) can be developed using TABLESAMPLE method functions.

Developer note: More information about how to write custom sampling methods can be found in Writing A Table Sampling Method chapter of PostgreSQL documentation.

Note to the future: We’ll discuss AXLE Project and tsm_system_time extension in our next TABLESAMPLE blog post.

Tags: bernoulli, comparison, extensibility, LIMIT, random, REPEATABLE, SELECT, sort, system, SYSTEM_ROWS, SYSTEM_TIME, tablesample, tsm_system_rows, tsm_system_time
Share this entry
  • Share on Facebook
  • Share on Twitter
  • Share on WhatsApp
  • Share on LinkedIn
0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Search

Get in touch with us!

Recent Posts

  • Random Data December 3, 2020
  • Webinar: COMMIT Without Fear – The Beauty of CAMO [Follow Up] November 13, 2020
  • Full-text search since PostgreSQL 8.3 November 5, 2020
  • Random numbers November 3, 2020
  • Webinar: Best Practices for Bulk Data Loading in PostgreSQL [Follow Up] November 2, 2020

Featured External Blogs

Tomas Vondra's Blog

Our Bloggers

  • Simon Riggs
  • Alvaro Herrera
  • Andrew Dunstan
  • Craig Ringer
  • Francesco Canovai
  • Gabriele Bartolini
  • Giulio Calacoci
  • Ian Barwick
  • Marco Nenciarini
  • Mark Wong
  • Pavan Deolasee
  • Petr Jelinek
  • Shaun Thomas
  • Tomas Vondra
  • Umair Shahid

PostgreSQL Cloud

2QLovesPG 2UDA 9.6 backup Barman BDR Business Continuity community conference database DBA development devops disaster recovery greenplum Hot Standby JSON JSONB logical replication monitoring OmniDB open source Orange performance PG12 pgbarman pglogical PG Phriday postgres Postgres-BDR postgres-xl PostgreSQL PostgreSQL 9.6 PostgreSQL10 PostgreSQL11 PostgreSQL 11 PostgreSQL 11 New Features postgresql repmgr Recovery replication security sql wal webinar webinars

Support & Services

24/7 Production Support

Developer Support

Remote DBA for PostgreSQL

PostgreSQL Database Monitoring

PostgreSQL Health Check

PostgreSQL Performance Tuning

Database Security Audit

Upgrade PostgreSQL

PostgreSQL Migration Assessment

Migrate from Oracle to PostgreSQL

Products

HA Postgres Clusters

Postgres-BDR®

2ndQPostgres

pglogical

repmgr

Barman

Postgres Cloud Manager

SQL Firewall

Postgres-XL

OmniDB

Postgres Installer

2UDA

Postgres Learning Center

Introducing Postgres

Blog

Webinars

Books

Videos

Training

Case Studies

Events

About Us

About 2ndQuadrant

What does 2ndQuadrant Mean?

News

Careers 

Team Profile

© 2ndQuadrant Ltd. All rights reserved. | Privacy Policy
  • Twitter
  • LinkedIn
  • Facebook
  • Youtube
  • Mail
Working towards Postgres-XL 9.5 Announcing Barman 1.5.0
Scroll to top
×