2ndQuadrant is now part of EDB

Bringing together some of the world's top PostgreSQL experts.

2ndQuadrant | PostgreSQL
Mission Critical Databases
  • Contact us
  • EN
    • FR
    • IT
    • ES
    • DE
    • PT
  • Support & Services
  • Products
  • Downloads
    • Installers
      • Postgres Installer
      • 2UDA – Unified Data Analytics
    • Whitepapers
      • Business Case for PostgreSQL Support
      • Security Best Practices for PostgreSQL
    • Case Studies
      • Performance Tuning
        • BenchPrep
        • tastyworks
      • Distributed Clusters
        • ClickUp
        • European Space Agency (ESA)
        • Telefónica del Sur
        • Animal Logic
      • Database Administration
        • Agilis Systems
      • Professional Training
        • Met Office
        • London & Partners
      • Database Upgrades
        • Alfred Wegener Institute (AWI)
      • Database Migration
        • International Game Technology (IGT)
        • Healthcare Software Solutions (HSS)
        • Navionics
  • Postgres Learning Center
    • Webinars
      • Upcoming Webinars
      • Webinar Library
    • Whitepapers
      • Business Case for PostgreSQL Support
      • Security Best Practices for PostgreSQL
    • Blog
    • Training
      • Course Catalogue
    • Case Studies
      • Performance Tuning
        • BenchPrep
        • tastyworks
      • Distributed Clusters
        • ClickUp
        • European Space Agency (ESA)
        • Telefónica del Sur
        • Animal Logic
      • Database Administration
        • Agilis Systems
      • Professional Training
        • Met Office
        • London & Partners
      • Database Upgrades
        • Alfred Wegener Institute (AWI)
      • Database Migration
        • International Game Technology (IGT)
        • Healthcare Software Solutions (HSS)
        • Navionics
    • Books
      • PostgreSQL 11 Administration Cookbook
      • PostgreSQL 10 Administration Cookbook
      • PostgreSQL High Availability Cookbook – 2nd Edition
      • PostgreSQL 9 Administration Cookbook – 3rd Edition
      • PostgreSQL Server Programming Cookbook – 2nd Edition
      • PostgreSQL 9 Cookbook – Chinese Edition
    • Videos
    • Events
    • PostgreSQL
      • PostgreSQL – History
      • Who uses PostgreSQL?
      • PostgreSQL FAQ
      • PostgreSQL vs MySQL
      • The Business Case for PostgreSQL
      • Security Information
      • Documentation
  • About Us
    • About 2ndQuadrant
    • 2ndQuadrant’s Passion for PostgreSQL
    • News
    • Careers
    • Team Profile
  • Blog
  • Menu Menu
You are here: Home1 / Blog2 / 2ndQuadrant3 / Load data in Postgres-XL at over 9M rows/sec
Pavan Deolasee

Load data in Postgres-XL at over 9M rows/sec

October 13, 2016/6 Comments/in 2ndQuadrant, Pavan's PlanetPostgreSQL /by Pavan Deolasee

We are faced with this question: “What’s the ingestion rate of Postgres-XL?”, and I realised I don’t have a very good answer to that. Since recently we made some good improvements in this area, I was curious to know too. Well, I decided to benchmark.

Hardware and Software

For the tests, I used a Postgres-XL cluster running on EC2 instances. Since COPY has to go through the coordinator, it seemed reasonable to use a compute-optimised c3.8xlarge instance for running coordinator. Similarly, for datanodes,  storage-optimised i2.xlarge instances are more appropriate. Both these instances have attached SSD disks, though i2.xlarge instance has more storage than the c3.8xlarge instance. 

So the next question was how to generate data for the test? We’d used TPC-H benchmark for previous performance tests and it had everything that I needed for this simple, focused test. I just picked up one table from the TPC-H benchmark, the lineitem table. It’s a very wide table, with many columns and a wide VARCHAR column. The ingestion benchmarks I’d seen earlier used a very narrow table, so for fairness I also stripped away a few columns. The resulting table was still much wider, each row being about 90 bytes, including tuple headers.

CREATE TABLE lineitem (
   l_orderkey BIGINT NOT NULL,
   l_partkey BIGINT NOT NULL,
   l_suppkey BIGINT NOT NULL,
   l_linenumber BIGINT NOT NULL,
   l_shipdate DATE NOT NULL,
   l_receiptdate DATE NOT NULL,
   l_shipinstruct CHAR(25) NOT NULL
) DISTRIBUTE BY HASH (l_orderkey);

TPC-H benchmark comes with a data generation tool called dbgen. You can generate data only for a specific table, of required size, and split it into multiple chunks if needed. So I created sample data of 3 billion rows, split into 24 chunks. Since the c3.8xlarge instance has two SSDs attached to it, the data files were split onto the disks to utilise full IO bandwidth at the coordinator side.

for i in `seq 1 12`; do
  dbgen -C 24 -S $i -T L -s 500 -o | cut -d '|' -f 1,2,3,4,11,13,14 > /data1/pgxl/lineitem_part_$i.csv&
done
for i in `seq 13 24`; do
  dbgen -C 24 -S $i -T L -s 500 -o | cut -d '|' -f 1,2,3,4,11,13,14 > /data2/pgxl/lineitem_part_$i.csv&
done

Benchmarks

We ran two benchmarks. First we used a single coordinator to run all 24 COPY processes.

for i in `seq 1 12`; do
  time psql -c "COPY lineitem FROM '/data1/pgxl/lineitem_part_$i.csv' WITH (FORMAT CSV, DELIMITER '|')"&
done
for i in `seq 13 24`; do
  time psql -c "COPY lineitem FROM '/data2/pgxl/lineitem_part_$i.csv' WITH (FORMAT CSV, DELIMITER '|')"&
done

The total size of the table upon load was approximately 283GB. Same tests were repeated on UNLOGGED tables to see if the WAL becomes a bottleneck. The results are very interesting and clearly show that Postgres-XL is able to ingest data at a very high speed. 

Type Time (Sec) Rows/sec MB/sec
Unlogged 458 6,756,856 683
Regular 444 6,550,314 662

So we could load at a rate of 6.7M rows/sec or 683 MB/sec. That’s an impressive rate to load data into a relational database. This also matches performance numbers shown by other PostgreSQL based scale-out solutions, on a comparable hardware.

To see if more performance can be extracted from the datanodes and check if the coordinator is a bottleneck, we added another coordinator to the cluster. This time the COPY processes were equally split between two coordinators, such that each coordinator runs 12 concurrent COPY processes. The results showed a good 34% speed up in load time, loading 3 billion rows in little over 5 minutes. The run with unlogged table topped 9M rows/sec. When measured in terms of number of bytes/hr, this translates into more than 3TB/hr of ingestion rate.

Type Time (Sec) Rows/sec MB/sec
Unlogged 332 9,036,277 913
Regular 354 8,474,700 857

It’s important to note that adding one more coordinator to the cluster improved the ingestion rate. So Postgres-XL’s multi-coordinator model helps in more than one way.

Just in case you’re curious, I also performed a simple aggregate query on the entire table and it finished in 36 seconds. That’s cool 82M rows/sec read rate or 30TB/hr of collective scan rate. (Note: much of the data may have been cached at the datanodes).

postgres=# select count(*) from lineitem ;
 count 
------------
 3000028242
(1 row)
Time: 36252.452 ms

Note: To check if the COPY rate can be sustained over a larger dataset, we repeated the test with a regular table 3 times, without truncating the table, and saw no drop in the performance. The table size at the end of the 3rd run was over 800GB, which is more than the collective RAM on the datanodes.

Even though Postgres-XL didn’t really hit 10M rows/sec in these tests, it looks like only a matter of tweaking a few knobs, doesn’t it? Stay tuned for my next blog as I continue to perform some more tests to see if we can really hit that magic figure of 10M rows/sec.

Conclusions:

  1. Postgres-XL topped 9M rows/sec upload rate. We would like to test further to check if adding more datanodes and/or coordinators can further improve the load rate.
  2. Postgres-XL’s ability to support multiple coordinators is very useful for ingesting a large amount of data.
  3. The results show a nearly 1GB/sec ingestion rate.

These tests were performed on Postgres-XL R1.3. Do reach out to us if you want to learn more about Postgres-XL or if you need help to validate your use case for Postgres-XL.

Edits:

  1. Vladimir asked how many datanodes were used in these tests. The tests were run on a cluster with 16 datanodes, running on i2.xlarge EC2 instances. In addition, we used 1 or 2 coordinators (see above) running on c3.8xlarge instances.
Tags: COPY, ingest, ingestion, open source, performance, postgres-xl, PostgreSQL
Share this entry
  • Share on Facebook
  • Share on Twitter
  • Share on WhatsApp
  • Share on LinkedIn
6 replies
  1. Vladimir
    Vladimir says:
    October 13, 2016 at 3:54 pm

    How many datanodes were used in this test?

    Reply
    • Pavan Deolasee
      Pavan Deolasee says:
      October 13, 2016 at 4:34 pm

      Hi Vladimir,

      These tests were done on 16 datanodes, running on i2.xlarge EC2 instances. I now realise that this crucial information is missing from the blog and I’m editing it to add that. Thanks for spotting.

      Thanks,
      Pavan

      Reply
  2. Vladimir
    Vladimir says:
    October 15, 2016 at 10:12 am

    Any chance to use pglogical as an easy way to replicate data from regular PostgreSQL instances to Postgres-XL?

    Reply
    • craig.ringer
      craig.ringer says:
      October 17, 2016 at 2:38 am

      We’ve looked at it, but as yet the required integration with pglogical into the XL coordinator node isn’t present.

      pglogical applies changes at a low level that bypasses the normal executor, the SPI, etc, so XL’s datanode routing won’t work. At minimum we’d need to teach pglogical how to interact with the XL datanode router. Customer priorities so far have been in other directions, but it’s an idea on the backburner for when time and priorities permit.

      Ideally we’d actually have pglogical interrogate XL’s data router to determine where the data should go and forward it to pglogical instances on each datanode to be applied. So the pglogical on the coordinator would operate as a proxy/router. This would be more efficient but require more changes, since XL would have to “tunnel” pglogical messages through the coordinator/datanode links.

      If this is a priority for you please reach out to [email protected] .

      Reply
  3. Kakoni
    Kakoni says:
    October 20, 2016 at 12:05 pm

    Could you share how you configured datanodes? Or where these with default configs?

    Reply
    • Pavan Deolasee
      Pavan Deolasee says:
      October 23, 2016 at 7:48 am

      I used the following non-default configuration options for the datanodes:

      shared_buffers=4GB
      checkpoint_timeout = 1h
      max_wal_size = 16GB
      fsync=off
      full_page_writes=off
      checkpoint_completion_target=0.9

      Reply

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Search

Get in touch with us!

Recent Posts

  • Random Data December 3, 2020
  • Webinar: COMMIT Without Fear – The Beauty of CAMO [Follow Up] November 13, 2020
  • Full-text search since PostgreSQL 8.3 November 5, 2020
  • Random numbers November 3, 2020
  • Webinar: Best Practices for Bulk Data Loading in PostgreSQL [Follow Up] November 2, 2020

Featured External Blogs

Tomas Vondra's Blog

Our Bloggers

  • Simon Riggs
  • Alvaro Herrera
  • Andrew Dunstan
  • Craig Ringer
  • Francesco Canovai
  • Gabriele Bartolini
  • Giulio Calacoci
  • Ian Barwick
  • Marco Nenciarini
  • Mark Wong
  • Pavan Deolasee
  • Petr Jelinek
  • Shaun Thomas
  • Tomas Vondra
  • Umair Shahid

PostgreSQL Cloud

2QLovesPG 2UDA 9.6 backup Barman BDR Business Continuity community conference database DBA development devops disaster recovery greenplum Hot Standby JSON JSONB logical replication monitoring OmniDB open source Orange performance PG12 pgbarman pglogical PG Phriday postgres Postgres-BDR postgres-xl PostgreSQL PostgreSQL 9.6 PostgreSQL10 PostgreSQL11 PostgreSQL 11 PostgreSQL 11 New Features postgresql repmgr Recovery replication security sql wal webinar webinars

Support & Services

24/7 Production Support

Developer Support

Remote DBA for PostgreSQL

PostgreSQL Database Monitoring

PostgreSQL Health Check

PostgreSQL Performance Tuning

Database Security Audit

Upgrade PostgreSQL

PostgreSQL Migration Assessment

Migrate from Oracle to PostgreSQL

Products

HA Postgres Clusters

Postgres-BDR®

2ndQPostgres

pglogical

repmgr

Barman

Postgres Cloud Manager

SQL Firewall

Postgres-XL

OmniDB

Postgres Installer

2UDA

Postgres Learning Center

Introducing Postgres

Blog

Webinars

Books

Videos

Training

Case Studies

Events

About Us

About 2ndQuadrant

What does 2ndQuadrant Mean?

News

Careers 

Team Profile

© 2ndQuadrant Ltd. All rights reserved. | Privacy Policy
  • Twitter
  • LinkedIn
  • Facebook
  • Youtube
  • Mail
repmgr 3.2 is here with Barman support and Brand New High Availability feat... Back to the Future Part 3: pg_rewind with PostgreSQL 9.6
Scroll to top
×