2ndQuadrant is now part of EDB

Bringing together some of the world's top PostgreSQL experts.

2ndQuadrant | PostgreSQL
Mission Critical Databases
  • Contact us
  • EN
    • FR
    • IT
    • ES
    • DE
    • PT
  • Support & Services
  • Products
  • Downloads
    • Installers
      • Postgres Installer
      • 2UDA – Unified Data Analytics
    • Whitepapers
      • Business Case for PostgreSQL Support
      • Security Best Practices for PostgreSQL
    • Case Studies
      • Performance Tuning
        • BenchPrep
        • tastyworks
      • Distributed Clusters
        • ClickUp
        • European Space Agency (ESA)
        • Telefónica del Sur
        • Animal Logic
      • Database Administration
        • Agilis Systems
      • Professional Training
        • Met Office
        • London & Partners
      • Database Upgrades
        • Alfred Wegener Institute (AWI)
      • Database Migration
        • International Game Technology (IGT)
        • Healthcare Software Solutions (HSS)
        • Navionics
  • Postgres Learning Center
    • Webinars
      • Upcoming Webinars
      • Webinar Library
    • Whitepapers
      • Business Case for PostgreSQL Support
      • Security Best Practices for PostgreSQL
    • Blog
    • Training
      • Course Catalogue
    • Case Studies
      • Performance Tuning
        • BenchPrep
        • tastyworks
      • Distributed Clusters
        • ClickUp
        • European Space Agency (ESA)
        • Telefónica del Sur
        • Animal Logic
      • Database Administration
        • Agilis Systems
      • Professional Training
        • Met Office
        • London & Partners
      • Database Upgrades
        • Alfred Wegener Institute (AWI)
      • Database Migration
        • International Game Technology (IGT)
        • Healthcare Software Solutions (HSS)
        • Navionics
    • Books
      • PostgreSQL 11 Administration Cookbook
      • PostgreSQL 10 Administration Cookbook
      • PostgreSQL High Availability Cookbook – 2nd Edition
      • PostgreSQL 9 Administration Cookbook – 3rd Edition
      • PostgreSQL Server Programming Cookbook – 2nd Edition
      • PostgreSQL 9 Cookbook – Chinese Edition
    • Videos
    • Events
    • PostgreSQL
      • PostgreSQL – History
      • Who uses PostgreSQL?
      • PostgreSQL FAQ
      • PostgreSQL vs MySQL
      • The Business Case for PostgreSQL
      • Security Information
      • Documentation
  • About Us
    • About 2ndQuadrant
    • 2ndQuadrant’s Passion for PostgreSQL
    • News
    • Careers
    • Team Profile
  • Blog
  • Menu Menu
You are here: Home1 / Blog2 / Greenplum3 / Performing parallel ETL with Greenplum’s gpfdist and external ta...
Gabriele Bartolini

Performing parallel ETL with Greenplum’s gpfdist and external tables

June 8, 2011/0 Comments/in Greenplum /by Gabriele Bartolini

One of the coolest features that Greenplum offers to Data warehousing
and Business Intelligence operators as far as ETL
is concerned, is the combination of read only external tables
with gpfdist, Greenplum’s parallel file distribution server.

The typical use case for this solution is parallel data loading of text files
(coming from etherogeneous sources – databases or applications) into a Greenplum
data warehouse. For those of you who want to know more about Greenplum, I suggest that
you visit the Greenplum website, download
the Community Edition,
install it and start testing it.

gpfdist is very similar to an HTTP server. It exposes via TCP/IP a local directory which contains
flat text files (typically delimited or CSV, for read only). These files can be periodically
generated from an operational database via unload/export/dump (depending on your business requirements
this could mean once a day or more frequently). On our typical PostgreSQL
driven operational databases, we normally perform this operation using the
COPY command.

A very simple startup of the file distribution server is:

gpfdist -d ${FLAT_FILES_DIR} -p 8081 -l /tmp/gpfdist.8081.log

where ${FLAT_FILES_DIR} is the directory that contains the flat files.
Note: for the sake of simplicity I have used the 8081 TCP port and a log file in the system temporary directory;
it is advised to read the administration guide for more detailed information on command line options
.

It is clear that the extraction process made available by gpfdist
can take advantage of parallelism at multiple stages:

  • host: you might have different hosts where your files reside – one of our typical setups at 2ndQuadrant
    is a distributed operational database with PL/Proxy on multiple servers
  • file system: depending on your physical disk layout and the number of spindles
  • CPUs
  • network: this might be important on large installations where massive data loading is a priority and a single
    network interface might represent a bottleneck

Greenplum documentation states that a single gpfdist application can serve
files to the Greenplum segments at a rate varying from 200MB/s to 350MB/s (this of course
depends on the server’s setup and the files characteristics).

However, file distribution is only one aspect of the described solution: as every server
application, it needs a client to initiate the extraction process. One way
to do this in Greenplum is through the creation of an external table on the
master, which maps to one or more locations defined with the gpfdist:// protocol. A very
simple example for the sole definition of the external table is:

CREATE EXTERNAL TABLE oltp_sales (
ts TIMESTAMP, customer TEXT, product TEXT, quantity REAL, amount REAL
)
LOCATION ('gpfdist://oltp-server:8081/sales_*')
FORMAT 'TEXT'

This process itself won’t initiate any process. It simply creates a definition for an external table.
Greenplum now is aware that this table is like a virtual table, whose data is served by one (in this case)
gfdist server. A query like the following will initiate parallel extraction:

SELECT * FROM oltp_sales;

Every Greenplum segment will directly connect to gpfdist, concurrently. The server will divide the
file in chunks and distribute the work among the segments. You can easily understand the level of parallelism
that this architecture gives to ETL operators (imagine multiple source servers, multiple network cards in 10Gb
environment, multiple Greenplum segments).

The cool thing of this approach is that it allows to perform in-database transformation operations
via the SQL language, moving the whole ETL process in Greenplum. You can perform dimension lookups directly in
your SELECT through joins, and then INSERT data in your warehouse fact tables.
Greenplum, as well as PostgreSQL, allows users to perform INSERT INTO ... SELECT ...
operations.

I am sure you are now all well aware of the potential of Greenplum’s file distribution server and external tables.
I suggest however that you download the community edition and read the documentation, as it gives
you more detailed information about the gpfdist application and external tables: also
it is very important – for production environment – to put in place error handling procedures.

Greenplum’s gpfdist is a flexible and versatile tool for system integrators
and data warehousing experts. Read-only external tables are only one way of using it, but there are
several others, including gpload utility and the integration with Kettle (Pentaho Data Integration).

For information on Greenplum’s community edition, assistance and help you can subscribe
to the community forums.

Tags: extraction transformation loading, gpfdist, greenplum, massive parallel etl, readonly external tables
Share this entry
  • Share on Facebook
  • Share on Twitter
  • Share on WhatsApp
  • Share on LinkedIn
0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Search

Get in touch with us!

Recent Posts

  • Random Data December 3, 2020
  • Webinar: COMMIT Without Fear – The Beauty of CAMO [Follow Up] November 13, 2020
  • Full-text search since PostgreSQL 8.3 November 5, 2020
  • Random numbers November 3, 2020
  • Webinar: Best Practices for Bulk Data Loading in PostgreSQL [Follow Up] November 2, 2020

Featured External Blogs

Tomas Vondra's Blog

Our Bloggers

  • Simon Riggs
  • Alvaro Herrera
  • Andrew Dunstan
  • Craig Ringer
  • Francesco Canovai
  • Gabriele Bartolini
  • Giulio Calacoci
  • Ian Barwick
  • Marco Nenciarini
  • Mark Wong
  • Pavan Deolasee
  • Petr Jelinek
  • Shaun Thomas
  • Tomas Vondra
  • Umair Shahid

PostgreSQL Cloud

2QLovesPG 2UDA 9.6 backup Barman BDR Business Continuity community conference database DBA development devops disaster recovery greenplum Hot Standby JSON JSONB logical replication monitoring OmniDB open source Orange performance PG12 pgbarman pglogical PG Phriday postgres Postgres-BDR postgres-xl PostgreSQL PostgreSQL 9.6 PostgreSQL10 PostgreSQL11 PostgreSQL 11 PostgreSQL 11 New Features postgresql repmgr Recovery replication security sql wal webinar webinars

Support & Services

24/7 Production Support

Developer Support

Remote DBA for PostgreSQL

PostgreSQL Database Monitoring

PostgreSQL Health Check

PostgreSQL Performance Tuning

Database Security Audit

Upgrade PostgreSQL

PostgreSQL Migration Assessment

Migrate from Oracle to PostgreSQL

Products

HA Postgres Clusters

Postgres-BDR®

2ndQPostgres

pglogical

repmgr

Barman

Postgres Cloud Manager

SQL Firewall

Postgres-XL

OmniDB

Postgres Installer

2UDA

Postgres Learning Center

Introducing Postgres

Blog

Webinars

Books

Videos

Training

Case Studies

Events

About Us

About 2ndQuadrant

What does 2ndQuadrant Mean?

News

Careers 

Team Profile

© 2ndQuadrant Ltd. All rights reserved. | Privacy Policy
  • Twitter
  • LinkedIn
  • Facebook
  • Youtube
  • Mail
EuroPython 2011: “The Python and the Elephant” The cost of free PostgreSQL publicity
Scroll to top
×