2ndQuadrant is now part of EDB

Bringing together some of the world's top PostgreSQL experts.

2ndQuadrant | PostgreSQL
Mission Critical Databases
  • Contact us
  • EN
    • FR
    • IT
    • ES
    • DE
    • PT
  • Support & Services
  • Products
  • Downloads
    • Installers
      • Postgres Installer
      • 2UDA – Unified Data Analytics
    • Whitepapers
      • Business Case for PostgreSQL Support
      • Security Best Practices for PostgreSQL
    • Case Studies
      • Performance Tuning
        • BenchPrep
        • tastyworks
      • Distributed Clusters
        • ClickUp
        • European Space Agency (ESA)
        • Telefónica del Sur
        • Animal Logic
      • Database Administration
        • Agilis Systems
      • Professional Training
        • Met Office
        • London & Partners
      • Database Upgrades
        • Alfred Wegener Institute (AWI)
      • Database Migration
        • International Game Technology (IGT)
        • Healthcare Software Solutions (HSS)
        • Navionics
  • Postgres Learning Center
    • Webinars
      • Upcoming Webinars
      • Webinar Library
    • Whitepapers
      • Business Case for PostgreSQL Support
      • Security Best Practices for PostgreSQL
    • Blog
    • Training
      • Course Catalogue
    • Case Studies
      • Performance Tuning
        • BenchPrep
        • tastyworks
      • Distributed Clusters
        • ClickUp
        • European Space Agency (ESA)
        • Telefónica del Sur
        • Animal Logic
      • Database Administration
        • Agilis Systems
      • Professional Training
        • Met Office
        • London & Partners
      • Database Upgrades
        • Alfred Wegener Institute (AWI)
      • Database Migration
        • International Game Technology (IGT)
        • Healthcare Software Solutions (HSS)
        • Navionics
    • Books
      • PostgreSQL 11 Administration Cookbook
      • PostgreSQL 10 Administration Cookbook
      • PostgreSQL High Availability Cookbook – 2nd Edition
      • PostgreSQL 9 Administration Cookbook – 3rd Edition
      • PostgreSQL Server Programming Cookbook – 2nd Edition
      • PostgreSQL 9 Cookbook – Chinese Edition
    • Videos
    • Events
    • PostgreSQL
      • PostgreSQL – History
      • Who uses PostgreSQL?
      • PostgreSQL FAQ
      • PostgreSQL vs MySQL
      • The Business Case for PostgreSQL
      • Security Information
      • Documentation
  • About Us
    • About 2ndQuadrant
    • 2ndQuadrant’s Passion for PostgreSQL
    • News
    • Careers
    • Team Profile
  • Blog
  • Menu Menu
You are here: Home1 / Blog2 / 2ndQuadrant3 / PostgreSQL 12: Implementing K-Nearest Neighbor Space Partitioned Generalized...
PostgreSQL 12: Implementing K-Nearest Neighbor Space Partitioned Generalized Search Tree Indexes
Kirk Roybal

PostgreSQL 12: Implementing K-Nearest Neighbor Space Partitioned Generalized Search Tree Indexes

November 5, 2019/10 Comments/in 2ndQuadrant, Kirk’s PlanetPostgreSQL, PostgreSQL /by Kirk Roybal

The value of indexing

PostgreSQL provides a simple linear distance operator <-> (linear distance). We will use this to find points that are closest to a given location.

PostgreSQL provides a simple linear distance operator the data, and performing no optimizations and having no indexes, we see the following execution plan:

time psql -qtAc "
EXPLAIN (ANALYZE ON, BUFFERS ON)
SELECT name, location
FROM geonames
ORDER BY location <-> '(29.9691,-95.6972)'
LIMIT 5;
"  <-- closing quote
                                      QUERY PLAN
-----------------------------------------------------------------------------------------------------------
Limit  (cost=418749.15..418749.73 rows=5 width=38) 
        (actual time=2553.970..2555.673 rows=5 loops=1)
  Buffers: shared hit=100 read=272836
  ->  Gather Merge  (cost=418749.15..1580358.21 rows=9955954 width=38) 
                    (actual time=2553.969..2555.669 rows=5 loops=1)
        Workers Planned: 2
        Workers Launched: 2
        Buffers: shared hit=100 read=272836
        ->  Sort  (cost=417749.12..430194.06 rows=4977977 width=38)
                 (actual time=2548.220..2548.221 rows=4 loops=3)
              Sort Key: ((location <-> '(29.9691,-95.6972)'::point))
              Sort Method: top-N heapsort  Memory: 25kB
              Worker 0:  Sort Method: top-N heapsort  Memory: 26kB
              Worker 1:  Sort Method: top-N heapsort  Memory: 25kB
              Buffers: shared hit=100 read=272836
              ->  Parallel Seq Scan on geonames  (cost=0.00..335066.71 rows=4977977 width=38) 
                                        (actual time=0.040..1637.884 rows=3982382 loops=3)
                    Buffers: shared hit=6 read=272836
Planning Time: 0.493 ms
Execution Time: 2555.737 ms

real    0m2.595s
user    0m0.011s
sys    0m0.015s

and here are the results: (the same results for all requests, so we’ll omit them later.)

name location
Cypress (29.96911,-95.69717)
Cypress Pointe Baptist Church (29.9732,-95.6873)
Cypress Post Office (29.9743,-95.67953)
Hot Wells (29.95689,-95.68189)
Dry Creek Airport (29.98571,-95.68597)

So, 418749.73 is the OPTIMIZER cost to beat, and it took two and a half seconds (2555.673) to execute that query. This is actually a very good result, using PostgreSQL with no optimizations at all against an 11 million row table. This is also why we selected a larger data set, as there would be very minimal difference using indexes against less than 10 millions rows. Parallel sequential scans are fantastic, but that’s another article.

Adding GiST index

We begin the optimization process by adding a GiST index. Because our example query has a

LIMIT

clause of 5 items, we have a very high selectivity. This will encourage the planner to use an index, so we’ll provide one that works fairly well with geometry data.

time psql -qtAc "CREATE INDEX idx_gist_geonames_location ON geonames USING gist(location);"

The act of creating the index has a bit of an expense.

CREATE INDEX
real    3m1.988s
user    0m0.011s
sys     0m0.014s

And then run the same query again.

time psql -qtAc "
EXPLAIN (ANALYZE ON, BUFFERS ON)
SELECT name, location
FROM geonames
ORDER BY location <-> '(29.9691,-95.6972)'
LIMIT 5;
"
                                      QUERY PLAN
----------------------------------------------------------------------------------
Limit  (cost=0.42..1.16 rows=5 width=38) (actual time=0.797..0.881 rows=5 loops=1)
  Buffers: shared hit=5 read=15
  ->  Index Scan using idx_gist_geonames_location on geonames  
            (cost=0.42..1773715.32 rows=11947145 width=38) 
            (actual time=0.796..0.879 rows=5 loops=1)
        Order By: (location <-> '(29.9691,-95.6972)'::point)
        Buffers: shared hit=5 read=15
Planning Time: 0.768 ms
Execution Time: 0.939 ms

real    0m0.033s
user    0m0.011s
sys     0m0.013s

In this case, we see some pretty dramatic improvement. The estimated cost of the query is only 1.16! Compare that to the original cost of the unoptimized query at 418749.73. The actual time taken was .939 milliseconds (nine tenths of a millisecond), which compares to the 2.5 seconds of the original query. This result took less time to plan, got a dramatically better estimate, and took about 3 orders of magnitude less runtime.

Let’s see if we can do better.

Adding an SP-GiST index

time psql -qtAc "CREATE INDEX idx_spgist_geonames_location ON geonames USING spgist(location);"
CREATE INDEX 

real    1m25.205s
user    0m0.010s
sys        0m0.015s

And then we run the same query again.

time psql -qtAc "
EXPLAIN (ANALYZE ON, BUFFERS ON)
SELECT name, location
FROM geonames
ORDER BY location <-> '(29.9691,-95.6972)'
LIMIT 5;
"
                                      QUERY PLAN
-----------------------------------------------------------------------------------
 Limit  (cost=0.42..1.09 rows=5 width=38) (actual time=0.066..0.323 rows=5 loops=1)
   Buffers: shared hit=47
   ->  Index Scan using idx_spgist_geonames_location on geonames  
            (cost=0.42..1598071.32 rows=11947145 width=38) 
            (actual time=0.065..0.320 rows=5 loops=1)
         Order By: (location <-> '(29.9691,-95.6972)'::point)
         Buffers: shared hit=47
 Planning Time: 0.122 ms
 Execution Time: 0.358 ms
(7 rows)

real    0m0.040s
user    0m0.011s
sys        0m0.015s

Wow! Now using an SP-GiST index, the query cost only 1.09, and executed in .358 milliseconds (a third of a millisecond).

Let’s examine some things about the indexes themselves, and see how they stack up to each other on disk.

Index comparisons

indexname creation time estimate query time indexsize plan time
unindexed 0S 418749.73 2555.673 0 .493
idx_gist_geonames_location 3M 1S 1.16 .939 ms 868 MB .786
idx_spgist_geonames_location 1M 25S 1.09 .358 ms 523 MB .122

Conclusions

So, we see that SP-GiST is twice the speed of GiST in execution, 8x faster to plan, and about 60% of the size on disk. And (relevant to this article) it also supports KNN index searching as of PostgreSQL 12. For this type of operation, we have a clear winner.

Appendices

Setting up the data

For this article, we are going to use the data provided by the GeoNames Gazetteer.
This work is licensed under a Creative Commons Attribution 4.0 License
The Data is provided “as is” without warranty or any representation of accuracy, timeliness or completeness.

Create the structure

We start the process by creating a working directory and a little bit of ETL.

# change to our home directory
cd
mkdir spgist
cd spgist
# get the base data.  
# This file is 350MB.  It will unpack to 1.5GB
# It will expand to 2GB in PostgreSQL,
#    and then you will still need some room for indexes
#  All together, you will need about 
#  3GB of space for this exercise
#  for about 12M rows of data.

psql -qtAc "
CREATE TABLE IF NOT EXISTS geonames (
geonameid           integer primary key
,name               text 
,asciiname          text 
,alternatenames     text 
,latitude           numeric(13,5) 
,longitude          numeric(13,5)
,feature_class      text 
,feature_code       text 
,country            text 
,cc2                text 
,admin1             text 
,admin2             bigint 
,admin3             bigint 
,admin4             bigint 
,population         bigint 
,elevation          bigint 
,dem                bigint 
,timezone           text 
,modification date  );

COMMENT ON COLUMN geonames.geonameid          
 IS ' integer id of record in geonames database';
COMMENT ON COLUMN geonames.name               
 IS ' name of geographical point (utf8) varchar(200)';
COMMENT ON COLUMN geonames.asciiname          
 IS ' name of geographical point in plain ascii characters, varchar(200)';
COMMENT ON COLUMN geonames.alternatenames     
 IS ' alternatenames, comma separated, ascii names automatically transliterated, 
    convenience attribute from alternatename table, varchar(10000)';
COMMENT ON COLUMN geonames.latitude           
 IS ' latitude in decimal degrees (wgs84)';
COMMENT ON COLUMN geonames.longitude          
 IS ' longitude in decimal degrees (wgs84)';
COMMENT ON COLUMN geonames.feature_class      
 IS ' http://www.geonames.org/export/codes.html, char(1)';
COMMENT ON COLUMN geonames.feature_code       
 IS ' http://www.geonames.org/export/codes.html, varchar(10)';
COMMENT ON COLUMN geonames.country            
 IS ' ISO-3166 2-letter country code, 2 characters';
COMMENT ON COLUMN geonames.cc2                
 IS ' alternate country codes, comma separated, ISO-3166 2-letter country code, 
    200 characters';
COMMENT ON COLUMN geonames.admin1             
 IS ' fipscode (subject to change to iso code), see exceptions below, 
    see file admin1Codes.txt for display names of this code; varchar(20)';
COMMENT ON COLUMN geonames.admin2             
 IS ' code for the second administrative division, a county in the US, 
    see file admin2Codes.txt; varchar(80) ';
COMMENT ON COLUMN geonames.admin3             
 IS ' code for third level administrative division, varchar(20)';
COMMENT ON COLUMN geonames.admin4             
 IS ' code for fourth level administrative division, varchar(20)';
COMMENT ON COLUMN geonames.population         
 IS ' bigint (8 byte int) ';
COMMENT ON COLUMN geonames.elevation          
 IS ' in meters, integer';
COMMENT ON COLUMN geonames.dem                
 IS ' digital elevation model, srtm3 or gtopo30, average elevation of 3''x3'' 
    (ca 90mx90m) or 30''x30'' (ca 900mx900m) area in meters, integer. 
    srtm processed by cgiar/ciat.';
COMMENT ON COLUMN geonames.timezone           
 IS ' the iana timezone id (see file timeZone.txt) varchar(40)';
COMMENT ON COLUMN geonames.modification       
 IS ' date of last modification in yyyy-MM-dd format';
"  #<-- Don't forget the closing quote

ETL

wget http://download.geonames.org/export/dump/allCountries.zip
unzip allCountries.zip

# do this, and go get a coffee.  This took nearly an hour
#   there will be a few lines that fail, they don't really matter much
IFS=$'\n'

for line in $(<allCountries.txt)
do

    echo -n "$line" | 
        psql -qtAc
    "COPY geonames FROM STDIN WITH CSV DELIMITER E'\t';"
2> errors.txt
done

Clean up and Set up

Everything else we do from inside psql:

psql
-- This command requires the installation
--  of postgis2 from your OS package manager.
-- For OS/X that was `port install postgresql12-postgis2`
-- it will be something similar on most platforms.
-- (e.g. apt-get install postgresql12-postgis2, 
--  yum -y install postgresql12-postgis2, etc.)
CREATE EXTENSION postgis;
CREATE EXTENSION postgis_topology;

ALTER TABLE geonames ADD COLUMN location point;

-- Go get another cup of coffee, this is going to rewrite the entire table with the new geo column.
UPDATE geonames SET location = ('(' || latitude || ', ' || longitude || ')')::point;

DELETE FROM geonames WHERE latitude IS NULL or longitude IS NULL;
-- DELETE 32   -- In my case, this ETL anomoly was too small
--  to bother fixing the records

-- Bloat removal from the update and delete operations
CLUSTER geonames USING geonames_pkey;
Tags: PG12, postgres, PostgreSQL 12
Share this entry
  • Share on Facebook
  • Share on Twitter
  • Share on WhatsApp
  • Share on LinkedIn
10 replies
  1. Andreas Joseph Krogh
    Andreas Joseph Krogh says:
    November 6, 2019 at 9:12 am

    You misinterpreted the execution-time; “.939 ms” is less than 1 *millisecond*, so the effects of having these indexes are even better than what you describe.

    Reply
  2. Pedro Romano
    Pedro Romano says:
    November 6, 2019 at 12:11 pm

    “The actual time taken was .939 seconds (nine tenths of a second), which compares to the 2.5 seconds of the original query.”

    I believe it’s better than that. The timings on the query plan are shown in milliseconds. “Execution Time: 0.939 ms” is actually nine tenths of a millisecond, not of a second.

    Reply
  3. foo bar
    foo bar says:
    November 7, 2019 at 5:47 pm

    I noticed that the planning and execution times in the explain output are in ms, yet the stated values in the article are in seconds (“nine tenths of a second” –> “nine tenths of a millisecond”) — the speedup is therefore 1000x better than claimed!

    Reply
  4. Miles Elam
    Miles Elam says:
    November 7, 2019 at 6:25 pm

    While the unindexed time is impressive for the lack of index, the GIST and SP-GIST times are not . 939 sec and .358 sec. All times are in *milliseconds*. GIST returns in slightly less than 1ms, not 1s. SP-GIST returns in 1/3 of a millisecond.

    Performance for GIST and SP-GIST are in fact three orders of magnitude better than what you summarized!

    Reply
  5. Kirk Roybal
    Kirk Roybal says:
    November 7, 2019 at 6:48 pm

    Several astute readers have noted that the query times were in milliseconds, not seconds. I gladly accept the fact that the results are 1000x better than anticipated 🙂

    Reply
  6. Sean Brewer
    Sean Brewer says:
    November 13, 2019 at 11:19 pm

    It looks like you used Postgres’ native Point type (https://www.postgresql.org/docs/12/datatype-geometric.html) and not the PostGIS point type?

    Reply
    • Kirk Roybal
      Kirk Roybal says:
      November 14, 2019 at 4:55 pm

      Yes, this rather simple example would work without Postgis. I realized after publication that keeping it simple, well, kept it simpler than I thought.

      Reply
      • Olivier
        Olivier says:
        December 1, 2019 at 6:29 pm

        Then why “CREATE EXTENSION postgis”?

        Reply
  7. Amul
    Amul says:
    May 15, 2020 at 9:20 am

    Hi,

    Thanks for the article.

    Just wanted to check whether this technique is ready for production usage. Or does it need further fine tuning for a low to medium scale?

    Reply
  8. Irshad Khan
    Irshad Khan says:
    September 10, 2020 at 12:31 pm

    SP-GiST indexes cannot be used for sorting and for support of the unique constraint. Additionally, indexes like this cannot be created on several columns (unlike GiST). But it is permitted to use such indexes to support exclusion constraints. The difference from GiST here is that clustering is impossible.

    Reply

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Search

Get in touch with us!

Recent Posts

  • Random Data December 3, 2020
  • Webinar: COMMIT Without Fear – The Beauty of CAMO [Follow Up] November 13, 2020
  • Full-text search since PostgreSQL 8.3 November 5, 2020
  • Random numbers November 3, 2020
  • Webinar: Best Practices for Bulk Data Loading in PostgreSQL [Follow Up] November 2, 2020

Featured External Blogs

Tomas Vondra's Blog

Our Bloggers

  • Simon Riggs
  • Alvaro Herrera
  • Andrew Dunstan
  • Craig Ringer
  • Francesco Canovai
  • Gabriele Bartolini
  • Giulio Calacoci
  • Ian Barwick
  • Marco Nenciarini
  • Mark Wong
  • Pavan Deolasee
  • Petr Jelinek
  • Shaun Thomas
  • Tomas Vondra
  • Umair Shahid

PostgreSQL Cloud

2QLovesPG 2UDA 9.6 backup Barman BDR Business Continuity community conference database DBA development devops disaster recovery greenplum Hot Standby JSON JSONB logical replication monitoring OmniDB open source Orange performance PG12 pgbarman pglogical PG Phriday postgres Postgres-BDR postgres-xl PostgreSQL PostgreSQL 9.6 PostgreSQL10 PostgreSQL11 PostgreSQL 11 PostgreSQL 11 New Features postgresql repmgr Recovery replication security sql wal webinar webinars

Support & Services

24/7 Production Support

Developer Support

Remote DBA for PostgreSQL

PostgreSQL Database Monitoring

PostgreSQL Health Check

PostgreSQL Performance Tuning

Database Security Audit

Upgrade PostgreSQL

PostgreSQL Migration Assessment

Migrate from Oracle to PostgreSQL

Products

HA Postgres Clusters

Postgres-BDR®

2ndQPostgres

pglogical

repmgr

Barman

Postgres Cloud Manager

SQL Firewall

Postgres-XL

OmniDB

Postgres Installer

2UDA

Postgres Learning Center

Introducing Postgres

Blog

Webinars

Books

Videos

Training

Case Studies

Events

About Us

About 2ndQuadrant

What does 2ndQuadrant Mean?

News

Careers 

Team Profile

© 2ndQuadrant Ltd. All rights reserved. | Privacy Policy
  • Twitter
  • LinkedIn
  • Facebook
  • Youtube
  • Mail
Postgres-BDR: It is also about fast safe upgrades Postgres-BDR It is also about fast safe upgrades Webinar: Using SSL with PostgreSQL and pgbouncer [Follow Up]
Scroll to top
×