2ndQuadrant is now part of EDB

Bringing together some of the world's top PostgreSQL experts.

2ndQuadrant | PostgreSQL
Mission Critical Databases
  • Contact us
  • EN
    • FR
    • IT
    • ES
    • DE
    • PT
  • Support & Services
  • Products
  • Downloads
    • Installers
      • Postgres Installer
      • 2UDA – Unified Data Analytics
    • Whitepapers
      • Business Case for PostgreSQL Support
      • Security Best Practices for PostgreSQL
    • Case Studies
      • Performance Tuning
        • BenchPrep
        • tastyworks
      • Distributed Clusters
        • ClickUp
        • European Space Agency (ESA)
        • Telefónica del Sur
        • Animal Logic
      • Database Administration
        • Agilis Systems
      • Professional Training
        • Met Office
        • London & Partners
      • Database Upgrades
        • Alfred Wegener Institute (AWI)
      • Database Migration
        • International Game Technology (IGT)
        • Healthcare Software Solutions (HSS)
        • Navionics
  • Postgres Learning Center
    • Webinars
      • Upcoming Webinars
      • Webinar Library
    • Whitepapers
      • Business Case for PostgreSQL Support
      • Security Best Practices for PostgreSQL
    • Blog
    • Training
      • Course Catalogue
    • Case Studies
      • Performance Tuning
        • BenchPrep
        • tastyworks
      • Distributed Clusters
        • ClickUp
        • European Space Agency (ESA)
        • Telefónica del Sur
        • Animal Logic
      • Database Administration
        • Agilis Systems
      • Professional Training
        • Met Office
        • London & Partners
      • Database Upgrades
        • Alfred Wegener Institute (AWI)
      • Database Migration
        • International Game Technology (IGT)
        • Healthcare Software Solutions (HSS)
        • Navionics
    • Books
      • PostgreSQL 11 Administration Cookbook
      • PostgreSQL 10 Administration Cookbook
      • PostgreSQL High Availability Cookbook – 2nd Edition
      • PostgreSQL 9 Administration Cookbook – 3rd Edition
      • PostgreSQL Server Programming Cookbook – 2nd Edition
      • PostgreSQL 9 Cookbook – Chinese Edition
    • Videos
    • Events
    • PostgreSQL
      • PostgreSQL – History
      • Who uses PostgreSQL?
      • PostgreSQL FAQ
      • PostgreSQL vs MySQL
      • The Business Case for PostgreSQL
      • Security Information
      • Documentation
  • About Us
    • About 2ndQuadrant
    • 2ndQuadrant’s Passion for PostgreSQL
    • News
    • Careers
    • Team Profile
  • Blog
  • Menu Menu
You are here: Home1 / Blog2 / Featured3 / Tablesample In PostgreSQL 9.5
Gulcin Yildirim

Tablesample In PostgreSQL 9.5

August 7, 2015/0 Comments/in Featured, Gulcin's PlanetPostgreSQL /by Gulcin Yildirim

PostgreSQL 9.5 introduces support for TABLESAMPLE, an SQL SELECT clause that returns a random sample from a table.

SQL:2003 defines two sampling methods: SYSTEM and BERNOULLI.

The SYSTEM method uses random IO whereas BERNOULLI uses sequential IO. SYSTEM is faster, but BERNOULLI gives us a much better random distribution because each tuple (row) has the same probability on being selected.

The SYSTEM sampling method does block/page level sampling; it reads and returns random pages from disk. Thus it provides randomness at the page level instead of the tuple level. The BERNOULLI sampling method does a sequential scan on the whole table and picks tuples randomly.

Sampling methods

Sampling methods

 

SYNTAX

The TABLESAMPLE clause was defined in the SQL:2003 standard. The SYNTAX implemented by PostgreSQL 9.5 is as follows:

SELECT select_expression
FROM table_name
TABLESAMPLE sampling_method ( argument [, ...] ) [ REPEATABLE ( seed ) ]
...

Although it cannot be used for UPDATE or DELETE queries, it can be used with any join query and aggregation.

PostgreSQL supports both sampling methods required by the standard, but the implementation allows for custom sampling methods to be installed as extensions.

Both SYSTEM and BERNOULLI take as an argument the percentage of rows in table_name that are to be returned.

If table_name has N rows and the requested percentage (a floating point value between 0 and 100) is S then:

  • SYSTEM picks (N * S/100) pages where each page is selected with S/100 probability.
  • BERNOULLI picks (N * S/100) rows where each row is selected with S/100 probability.

SYSTEM returns all rows in each selected page, but the number of rows per page depends on the size of each row. That’s why we cannot guarantee that the provided percentage will result in the expected number of rows.

Let’s suppose we have a table called ts_test with 10 million rows. We can find the number of pages this table occupies using the following SQL:

SELECT relpages FROM pg_class WHERE relname = 'ts_test';

And the output will be similar to:

 relpages
----------
   93457
(1 row)

So if 10 million rows are stored in 93457 pages then a single page may be expected to store 10 000 000 / 93 457 = 107 rows.

In order to retrieve approximately 1000 rows, we need to request 1000 * 100 / 10 000 000 = 0.01% of the table’s rows.

But note that every page may not be filled with as many live tuples as it can hold. If you have recently deleted a large number of rows, you will need to run VACUUM FULL to reclaim the unused space and reduce the total relpages used by the table.

On the other hand, the BERNOULLI method works at the row level, so it has a better chance of returning a percentage close to the requested percentage.

The seed parameter to the optional REPEATABLE clause allows us to define a random seed for the sampling process. Given the same seed, PostgreSQL normally returns the same result set. However it may return different rows on certain conditions where the distribution of tuples among pages changes.

One important thing to consider is that any conditionals in the WHERE clause will be applied after the sampling. Which means that the requested number of rows will be picked first and then be filtered using the conditionals expressed in the WHERE clause.

Examples

Let’s create a simple table with two columns:

CREATE TABLE ts_test (
  id SERIAL PRIMARY KEY,
  title TEXT
);

And insert 1M rows into it:

INSERT INTO ts_test (title)
SELECT
    'Record #' || i
FROM
    generate_series(1, 1000000) i;

SYSTEM Table Sampling

In this example our aim is to retrieve the same number of samples as the table gets bigger in each test. We will be observing the amount of increase in the time it takes to retrieve 1000 samples with the SYSTEM sampling method.

The SQL template for this method is like the following:

SELECT * FROM ts_test TABLESAMPLE SYSTEM(percentage)

The percentage parameter will be updated as the number of rows in the ts_test table increases.

For 10000 rows, 1000 samples means that we should pick 10% of the table. This is what the resulting plan looks like:

tablesample_test=# explain analyze select * from ts_test tablesample system(10);
                                                 QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Sample Scan (system) on ts_test  (cost=0.00..34.00 rows=1000 width=16) (actual time=0.022..0.230 rows=1127 loops=1)
 Planning time: 131.779 ms
 Execution time: 0.317 ms
(3 rows)

The planner uses a sample scan node for this query, and PostgreSQL returns 1127 rows (rather than the expected 1000) for these specific set of picked pages. Subsequent executions may return a slightly different number of samples as well.

The execution time is observed as 0.317 ms.

For 1 000 000 rows, the percentage is 0.1% this time.

tablesample_test=# explain analyze select * from ts_test tablesample system(0.1);
                                                 QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
 Sample Scan (system) on ts_test  (cost=0.00..34.00 rows=1000 width=18) (actual time=0.033..0.296 rows=1099 loops=1)
 Planning time: 15.186 ms
 Execution time: 0.377 ms
(3 rows)

The execution time seems to be marginally increased.

Finally, for 10 000 000 rows the EXPLAIN statement shows the following:

tablesample_test=# explain analyze select * from ts_test tablesample system(0.01);
                                                 QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
 Sample Scan (system) on ts_test  (cost=0.00..34.00 rows=1000 width=18) (actual time=0.115..0.914 rows=1099 loops=1)
 Planning time: 0.195 ms
 Execution time: 0.995 ms
(3 rows)

The time cost seems to be the triple of 1 000 000 row-table. However, considering a table with 10 million rows, this increase in execution time may be acceptable for a range of scenarios.

BERNOULLI Table Sampling

In this example, we have a similar goal but this time we will be using the BERNOULLI sampling method.

The SQL template for this is similar as well:

SELECT * FROM ts_test TABLESAMPLE BERNOULLI(percentage)

For 10% of a 10 000 row table:

tablesample_test=# explain analyze select * from ts_test tablesample bernoulli(10);
                                                   QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
 Sample Scan (bernoulli) on ts_test  (cost=0.00..73.00 rows=1000 width=16) (actual time=0.006..0.758 rows=1003 loops=1)
 Planning time: 0.151 ms
 Execution time: 0.820 ms
(3 rows)

Compared to SYSTEM‘s 1127, returned sample count 1003 is much closer to 1000. This reflects results of BERNOULLI‘s sequential scanning characteristics.

Also the execution time is more than doubled. This also reflects the performance difference between BERNOULLI‘s sequential IO and SYSTEM‘s random IO.

For 1 000 000 rows, EXPLAIN reveals a significant increase in the execution time:

tablesample_test=# explain analyze select * from ts_test tablesample bernoulli(0.1);
                                                    QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
 Sample Scan (bernoulli) on ts_test  (cost=0.00..6378.00 rows=1000 width=16) (actual time=0.055..78.135 rows=1042 loops=1)
 Planning time: 0.152 ms
 Execution time: 78.219 ms
(3 rows)

A 100 times increase in the row count resulted in a 95 times increase in the execution time.

We can also observe this behavior for 10M rows:

tablesample_test=# explain analyze select * from ts_test tablesample bernoulli(0.01);
                                                     QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
 Sample Scan (bernoulli) on ts_test  (cost=0.00..63702.00 rows=1000 width=19) (actual time=0.707..779.128 rows=998 loops=1)
 Planning time: 0.094 ms
 Execution time: 779.244 ms
(3 rows)

Both the row count and execution time increased 10 times. Based on these 3 examples, we may say that there may be a linear correlation between the number of rows and the time it takes to sampling.

Repeatable Example

To demonstrate how REPEATABLE works, first we will choose a simple seed parameter. Supposing that we selected 200 as the seed to the random number generator:

SELECT * FROM ts_test TABLESAMPLE SYSTEM (0.001) REPEATABLE (200);

The query above gives us a sample distribution similar to the following:

id title
741 Record #741
742 Record #742
743 Record #743
… …

Now suppose that no writes were done on ts_table during this period. Let’s run the exact query for the second time.

SELECT * FROM ts_test TABLESAMPLE SYSTEM (0.001) REPEATABLE (200);

The result set will be the same as above because the given seed, 200, was kept the same.

id title
741 Record #741
742 Record #742
743 Record #743
… …

However, providing a different seed most likely will produce a quite different result set.

SELECT * FROM ts_test TABLESAMPLE SYSTEM (0.001) REPEATABLE (9000);
id title
556 Record #556
557 Record #557
558 Record #558
… …

Extensibility

The TABLESAMPLE implementation can be extended with custom sampling methods. PostgreSQL 9.5 has an API for sampling functions, and includes two extensions: tsm_system_rows and tsm_system_time.

We will discuss those in greater detail in future blog posts.

Advantages

The premise of Big Data is that you must have detailed data to answer important business questions. But this is not always true, analysing trends and big changes is possible with statistically relevant sample of the data.

This is also true for exploratory research where TABLESAMPLE can help to quickly identify areas for further study.

When table sizes reach terabytes, getting statistically relevant output from the data is the main issue. No matter what optimizations you do, reading TBs of data off a disk will take time. With TABLESAMPLE, sampling is much easier and thanks to the extensibility, it’s possible to ensure the selected sample is relevant for the given use case.

Tags: 9.5, ANALYZE, bernoulli, block, extensibility, page, pg_class, planner, PostgreSQL 9.5, query plan, random, random distribution, relpages, REPEATABLE, row, sample, sample scan, sampling method, SELECT, sql, SQL:2003, system, table sampling, tablesample, tuple
Share this entry
  • Share on Facebook
  • Share on Twitter
  • Share on WhatsApp
  • Share on LinkedIn
0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Search

Get in touch with us!

Recent Posts

  • Random Data December 3, 2020
  • Webinar: COMMIT Without Fear – The Beauty of CAMO [Follow Up] November 13, 2020
  • Full-text search since PostgreSQL 8.3 November 5, 2020
  • Random numbers November 3, 2020
  • Webinar: Best Practices for Bulk Data Loading in PostgreSQL [Follow Up] November 2, 2020

Featured External Blogs

Tomas Vondra's Blog

Our Bloggers

  • Simon Riggs
  • Alvaro Herrera
  • Andrew Dunstan
  • Craig Ringer
  • Francesco Canovai
  • Gabriele Bartolini
  • Giulio Calacoci
  • Ian Barwick
  • Marco Nenciarini
  • Mark Wong
  • Pavan Deolasee
  • Petr Jelinek
  • Shaun Thomas
  • Tomas Vondra
  • Umair Shahid

PostgreSQL Cloud

2QLovesPG 2UDA 9.6 backup Barman BDR Business Continuity community conference database DBA development devops disaster recovery greenplum Hot Standby JSON JSONB logical replication monitoring OmniDB open source Orange performance PG12 pgbarman pglogical PG Phriday postgres Postgres-BDR postgres-xl PostgreSQL PostgreSQL 9.6 PostgreSQL10 PostgreSQL11 PostgreSQL 11 PostgreSQL 11 New Features postgresql repmgr Recovery replication security sql wal webinar webinars

Support & Services

24/7 Production Support

Developer Support

Remote DBA for PostgreSQL

PostgreSQL Database Monitoring

PostgreSQL Health Check

PostgreSQL Performance Tuning

Database Security Audit

Upgrade PostgreSQL

PostgreSQL Migration Assessment

Migrate from Oracle to PostgreSQL

Products

HA Postgres Clusters

Postgres-BDR®

2ndQPostgres

pglogical

repmgr

Barman

Postgres Cloud Manager

SQL Firewall

Postgres-XL

OmniDB

Postgres Installer

2UDA

Postgres Learning Center

Introducing Postgres

Blog

Webinars

Books

Videos

Training

Case Studies

Events

About Us

About 2ndQuadrant

What does 2ndQuadrant Mean?

News

Careers 

Team Profile

© 2ndQuadrant Ltd. All rights reserved. | Privacy Policy
  • Twitter
  • LinkedIn
  • Facebook
  • Youtube
  • Mail
BDR 0.9.2 and BDR-PostgreSQL 9.4.4 released Alpha 2 for 2UDA released
Scroll to top
×