2ndQuadrant is now part of EDB

Bringing together some of the world's top PostgreSQL experts.

2ndQuadrant | PostgreSQL
Mission Critical Databases
  • Contact us
  • Support & Services
    • Support
      • 24/7 PostgreSQL Support
      • Developer Support
    • DBA Services
      • Remote DBA
      • Database Monitoring
    • Consulting Services
      • Health Check
      • Performance Tuning
      • Database Security Audit
      • PostgreSQL Upgrade
    • Migration Services
      • Migrate to PostgreSQL
      • Migration Assessment
  • Products
    • Postgres-BDR ®
    • PostgreSQL High Availability
    • Kubernetes Operators for BDR & PostgreSQL
    • Managed PostgreSQL in the Cloud
    • Installers
      • Postgres Installer
      • 2UDA
    • 2ndQPostgres
    • pglogical
    • Barman
    • repmgr
    • OmniDB
    • SQL Firewall
    • Postgres-XL
  • Downloads
    • Installers
      • Postgres Installer
      • 2UDA – Unified Data Analytics
    • Whitepapers
      • Business Case for PostgreSQL Support
      • AlwaysOn Postgres
      • PostgreSQL with High Availability
      • Security Best Practices
      • BDR
    • Case Studies
      • Performance Tuning
        • BenchPrep
        • tastyworks
      • Distributed Clusters
        • ClickUp
        • European Space Agency (ESA)
        • Telefónica del Sur
        • Animal Logic
      • Database Administration
        • Agilis Systems
      • Professional Training
        • Met Office
        • London & Partners
      • Database Upgrades
        • Alfred Wegener Institute (AWI)
      • Database Migration
        • International Game Technology (IGT)
        • Healthcare Software Solutions (HSS)
        • Navionics
  • Postgres Learning Center
    • Webinars
      • Upcoming Webinars
      • Webinar Library
    • Whitepapers
      • Business Case for PostgreSQL Support
      • AlwaysOn Postgres
      • PostgreSQL with High Availability
      • Security Best Practices
      • BDR
    • Blog
    • Training
      • Course Catalogue
    • Case Studies
      • Performance Tuning
        • BenchPrep
        • tastyworks
      • Distributed Clusters
        • ClickUp
        • European Space Agency (ESA)
        • Telefónica del Sur
        • Animal Logic
      • Database Administration
        • Agilis Systems
      • Professional Training
        • Met Office
        • London & Partners
      • Database Upgrades
        • Alfred Wegener Institute (AWI)
      • Database Migration
        • International Game Technology (IGT)
        • Healthcare Software Solutions (HSS)
        • Navionics
    • Books
      • PostgreSQL 11 Administration Cookbook
      • PostgreSQL 10 Administration Cookbook
      • PostgreSQL High Availability Cookbook – 2nd Edition
      • PostgreSQL 9 Administration Cookbook – 3rd Edition
      • PostgreSQL Server Programming Cookbook – 2nd Edition
      • PostgreSQL 9 Cookbook – Chinese Edition
    • Videos
    • Events
    • PostgreSQL
      • PostgreSQL – History
      • Who uses PostgreSQL?
      • PostgreSQL FAQ
      • PostgreSQL vs MySQL
      • The Business Case for PostgreSQL
      • Security Information
      • Documentation
  • About Us
    • About 2ndQuadrant
    • What Does “2ndQuadrant” Mean?
    • 2ndQuadrant’s Passion for PostgreSQL
    • News
    • Careers
    • Team Profile
  • Blog
  • Menu Menu
You are here: Home1 / Blog2 / 2ndQuadrant3 / On the usefulness of expression indexes
Tomas Vondra

On the usefulness of expression indexes

April 19, 2016/0 Comments/in 2ndQuadrant, PostgreSQL, Tomas' PlanetPostgreSQL /by Tomas Vondra

When teaching PostgreSQL trainings, both on basics and advanced topics, I often find out the attendees have very little idea how powerful the expression indexes may be (if they are aware of them at all). So let me give you a brief overview.

So, let’s say we have a table, with a range of timestamps (yes, we have generate_series function that can generate dates):

CREATE TABLE t AS
SELECT d, repeat(md5(d::text), 10) AS padding
  FROM generate_series(timestamp '1900-01-01',
                       timestamp '2100-01-01',
                       interval '1 day') s(d);
VACUUM ANALYZE t;

The table also includes a padding column, to make it a bit larger. Now, let’s do a simple range query, selecting just one month from the ~200 years included in the table. If you do explain on the query, you’ll see something like this:

EXPLAIN SELECT * FROM t WHERE d BETWEEN '2001-01-01' AND '2001-02-01';

                               QUERY PLAN
------------------------------------------------------------------------
 Seq Scan on t  (cost=0.00..4416.75 rows=32 width=332)
   Filter: ((d >= '2001-01-01 00:00:00'::timestamp without time zone)
        AND (d <= '2001-02-01 00:00:00'::timestamp without time zone))
(2 rows)

and on my laptop, this runs in ~20ms. Not bad, considering this has to walk through the whole table with ~75k rows.

But let’s create an index on the timestamp column (all indexes here are the default type, i.e. btree, unless mentioned explicitly):

CREATE INDEX idx_t_d ON t (d);

And now let’s try to run the query again:

                               QUERY PLAN
------------------------------------------------------------------------
 Index Scan using idx_t_d on t  (cost=0.29..9.97 rows=34 width=332)
   Index Cond: ((d >= '2001-01-01 00:00:00'::timestamp without time zone)
            AND (d <= '2001-02-01 00:00:00'::timestamp without time zone))
(2 rows)

and this runs in 0.5ms, so roughly 40x faster. But that was of course a simple indexes, created directly on the column, not expression index. So let’s assume we instead need to select data from each 1st day of each month, doing a query like this

SELECT * FROM t WHERE EXTRACT(day FROM d) = 1;

which however can’t use the index, as it needs to evaluate an expression on the column while the index is built on the column itself, as shown on the EXPLAIN ANALYZE:

                               QUERY PLAN
------------------------------------------------------------------------
 Seq Scan on t  (cost=0.00..4416.75 rows=365 width=332)
                (actual time=0.045..40.601 rows=2401 loops=1)
   Filter: (date_part('day'::text, d) = '1'::double precision)
   Rows Removed by Filter: 70649
 Planning time: 0.209 ms
 Execution time: 43.018 ms
(5 rows)

So not only this has to do a sequential scan, it also has to do the evaluation, increasing the query duration to 43ms.

The database is unable to use the index for multiple reasons. Indexes (at least btree indexes) rely on querying sorted data, provided by the tree-like structure, and while the range query can benefit from that, the second query (with `extract` call) can’t.

Note: Another issue is that the set of operators supported by indexes (i.e. that can be evaluated on indexes directly) is very limited. And the “extract” function is not supported, so the query can’t work around the ordering issue by using a Bitmap Index Scan.

In theory the database might try to transform the condition into range conditions, but that is extremely difficult and specific to expression. In this case we’d have to generate an infinite number of such “per-day” ranges, because the planner does not really know the min/max timestamps in the table. So the database does not even try.

But while the database does not know how to transform the conditions, developers often do. For example with conditions like

(column + 1) >= 1000

it’s not difficult to rewrite it like this

column >= (1000 - 1)

which works just fine with the indexes.

But what if such transformation is not possible, as for example for the example query

SELECT * FROM t WHERE EXTRACT(day FROM d) = 1;

In this case the developer would have to face the same issue with unknown min/max for the d column, and even then it would generate a lot of ranges.

Well, this blog post is about expression indexes, and so far we have only used regular indexes, built on the column directly. So, let’s create the first expression index:

CREATE INDEX idx_t_expr ON t ((extract(day FROM d)));
ANALYZE t;

which then gives us this explain plan

                               QUERY PLAN
------------------------------------------------------------------------
 Bitmap Heap Scan on t  (cost=47.35..3305.25 rows=2459 width=332)
                        (actual time=2.400..12.539 rows=2401 loops=1)
   Recheck Cond: (date_part('day'::text, d) = '1'::double precision)
   Heap Blocks: exact=2401
   ->  Bitmap Index Scan on idx_t_expr  (cost=0.00..46.73 rows=2459 width=0)
                                (actual time=1.243..1.243 rows=2401 loops=1)
         Index Cond: (date_part('day'::text, d) = '1'::double precision)
 Planning time: 0.374 ms
 Execution time: 17.136 ms
(7 rows)

So while this does not give us the same 40x speedup as the index in the first example, that’s kinda expected as this query returns far more tuples (2401 vs. 32). Moreover those are spread through the whole table and not as localized as in the first example. So it’s a nice 2x speedup, and in many real-world cases you’ll see much larger improvements.

But the ability to use indexes for conditions with complex expressions is not the most interesting information here – that’s kinda the reason why people create expression indexes. But that’s not the only benefit.

If you look at the two explain plans presented above (without and with the expression index), you might notice this:

                               QUERY PLAN
------------------------------------------------------------------------
 Seq Scan on t  (cost=0.00..4416.75 rows=365 width=332)
                (actual time=0.045..40.601 rows=2401 loops=1)
 ...
                               QUERY PLAN
------------------------------------------------------------------------
 Bitmap Heap Scan on t  (cost=47.35..3305.25 rows=2459 width=332)
                        (actual time=2.400..12.539 rows=2401 loops=1)
 ...

Right – creating the expression index significantly improved estimates. Without the index we only have statistics (MCV + histogram) for raw table columns, so the database does not know how to estimate the expression

EXTRACT(day FROM d) = 1

So it instead applies a default estimate for equality conditions, which is 0.5% of all rows – as the table has 73050 rows, we end up with an estimate of just 365 rows. It’s common to see much worse estimation errors in real-world applications.

With the index, however, the database also collected statistics on columns of the index, and in this case the column contains results of the expression. And while planning, the optimizer notices this and produces much better estimate.

This is a huge benefit, and may help with fixing some cases of poor query plans caused by inaccurate estimates. Yet most people are unaware of this handy tool.

And the usefulness of this tool only increased with the introduction of JSONB data type in 9.4, because it’s about the only way to collect statistics about the contents of the JSONB documents.

When indexing JSONB documents, two basic indexing strategies exist. You can either create a GIN/GiST index on the whole document, e.g. like this

CREATE INDEX ON t USING GIN (jsonb_column);

which allows you to query arbitrary paths in the JSONB column, use containment operator to match sub-documents, etc. That’s great, but you still have only the basic per-column statistics, which are
not very useful as the documents are treated as scalar values (and no one matches whole documents or uses range of documents).

Expression indexes, for example created like this:

CREATE INDEX ON t ((jsonb_column->'id'));

will only be useful for the particular expression, i.e. this newly created index will be useful for

SELECT * FROM t WHERE jsonb_column ->> 'id' = 123;

but not for queries accessing other JSON keys, like ‘value’ for example

SELECT * FROM t WHERE jsonb_column ->> 'value' = 'xxxx';

This is not to say that GIN/GiST indexes on the whole document are useless, but you have to choose. Either you create a focused expression index, useful when querying a particular key and with the added benefit of statistics on the expression. Or you create a GIN/GiST index on the whole document, able to handle queries on arbitrary keys, but without the statistics.

However you can have a cake and eat it too, in this case, because you can create both indexes at the same time, and the database will choose which of them to use for individual queries. And you’ll have accurate statistics, thanks to the expression indexes.

Sadly, you can’t eat the whole cake, because expression indexes and GIN/GiST indexes use different conditions

-- expression (btree)
SELECT * FROM t WHERE jsonb_column ->> 'id' = 123;

-- GIN/GiST
SELECT * FROM t WHERE jsonb_column @> '{"id" : 123}';

so the planner can’t use them at the same time – expression indexes for estimation and GIN/GiST for execution.

Tags: btree, expression, GIN, index, PostgreSQL, statistics
Share this entry
  • Share on Facebook
  • Share on Twitter
  • Share on WhatsApp
  • Share on LinkedIn
0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Search

Get in touch with us!

Recent Posts

  • Random Data December 3, 2020
  • Webinar: COMMIT Without Fear – The Beauty of CAMO [Follow Up] November 13, 2020
  • Full-text search since PostgreSQL 8.3 November 5, 2020
  • Random numbers November 3, 2020
  • Webinar: Best Practices for Bulk Data Loading in PostgreSQL [Follow Up] November 2, 2020

Featured External Blogs

Tomas Vondra's Blog

Our Bloggers

  • Simon Riggs
  • Alvaro Herrera
  • Andrew Dunstan
  • Craig Ringer
  • Francesco Canovai
  • Gabriele Bartolini
  • Giulio Calacoci
  • Ian Barwick
  • Marco Nenciarini
  • Mark Wong
  • Pavan Deolasee
  • Petr Jelinek
  • Shaun Thomas
  • Tomas Vondra
  • Umair Shahid

PostgreSQL Cloud

2QLovesPG 2UDA 9.6 backup Barman BDR Business Continuity community conference database DBA development devops disaster recovery greenplum Hot Standby JSON JSONB logical replication monitoring OmniDB open source Orange performance PG12 pgbarman pglogical PG Phriday postgres Postgres-BDR postgres-xl PostgreSQL PostgreSQL 9.6 PostgreSQL10 PostgreSQL11 PostgreSQL 11 PostgreSQL 11 New Features postgresql repmgr Recovery replication security sql wal webinar webinars

Support & Services

24/7 Production Support

Developer Support

Remote DBA for PostgreSQL

PostgreSQL Database Monitoring

PostgreSQL Health Check

PostgreSQL Performance Tuning

Database Security Audit

Upgrade PostgreSQL

PostgreSQL Migration Assessment

Migrate from Oracle to PostgreSQL

Products

HA Postgres Clusters

Postgres-BDR®

2ndQPostgres

pglogical

repmgr

Barman

Postgres Cloud Manager

SQL Firewall

Postgres-XL

OmniDB

Postgres Installer

2UDA

Postgres Learning Center

Introducing Postgres

Blog

Webinars

Books

Videos

Training

Case Studies

Events

About Us

About 2ndQuadrant

What does 2ndQuadrant Mean?

News

Careers 

Team Profile

© 2ndQuadrant Ltd. All rights reserved. | Privacy Policy
  • Twitter
  • LinkedIn
  • Facebook
  • Youtube
  • Mail
9.6 Scariest Patch Tournament Column Store Plans
Scroll to top
×