2ndQuadrant is now part of EDB

Bringing together some of the world's top PostgreSQL experts.

2ndQuadrant | PostgreSQL
Mission Critical Databases
  • Contact us
  • EN
    • FR
    • IT
    • ES
    • DE
    • PT
  • Support & Services
  • Products
  • Downloads
    • Installers
      • Postgres Installer
      • 2UDA – Unified Data Analytics
    • Whitepapers
      • Business Case for PostgreSQL Support
      • Security Best Practices for PostgreSQL
    • Case Studies
      • Performance Tuning
        • BenchPrep
        • tastyworks
      • Distributed Clusters
        • ClickUp
        • European Space Agency (ESA)
        • Telefónica del Sur
        • Animal Logic
      • Database Administration
        • Agilis Systems
      • Professional Training
        • Met Office
        • London & Partners
      • Database Upgrades
        • Alfred Wegener Institute (AWI)
      • Database Migration
        • International Game Technology (IGT)
        • Healthcare Software Solutions (HSS)
        • Navionics
  • Postgres Learning Center
    • Webinars
      • Upcoming Webinars
      • Webinar Library
    • Whitepapers
      • Business Case for PostgreSQL Support
      • Security Best Practices for PostgreSQL
    • Blog
    • Training
      • Course Catalogue
    • Case Studies
      • Performance Tuning
        • BenchPrep
        • tastyworks
      • Distributed Clusters
        • ClickUp
        • European Space Agency (ESA)
        • Telefónica del Sur
        • Animal Logic
      • Database Administration
        • Agilis Systems
      • Professional Training
        • Met Office
        • London & Partners
      • Database Upgrades
        • Alfred Wegener Institute (AWI)
      • Database Migration
        • International Game Technology (IGT)
        • Healthcare Software Solutions (HSS)
        • Navionics
    • Books
      • PostgreSQL 11 Administration Cookbook
      • PostgreSQL 10 Administration Cookbook
      • PostgreSQL High Availability Cookbook – 2nd Edition
      • PostgreSQL 9 Administration Cookbook – 3rd Edition
      • PostgreSQL Server Programming Cookbook – 2nd Edition
      • PostgreSQL 9 Cookbook – Chinese Edition
    • Videos
    • Events
    • PostgreSQL
      • PostgreSQL – History
      • Who uses PostgreSQL?
      • PostgreSQL FAQ
      • PostgreSQL vs MySQL
      • The Business Case for PostgreSQL
      • Security Information
      • Documentation
  • About Us
    • About 2ndQuadrant
    • 2ndQuadrant’s Passion for PostgreSQL
    • News
    • Careers
    • Team Profile
  • Blog
  • Menu Menu
You are here: Home1 / Blog2 / Greenplum3 / Performing ETL using Kettle with GPFDIST and GPLOAD
Giulio Calacoci

Performing ETL using Kettle with GPFDIST and GPLOAD

November 7, 2011/1 Comment/in Greenplum /by Giulio Calacoci

Scenario:

We have a remote datasource, served by a gpfdist server. We need to import the data in a Greenplum database, while performing some ETL manipulation during the import.

It is possible to accomplish this goal with a simple transformation in a few steps using Kettle.

For the sake of simplicity, I assume that:

  • You know Kettle basics, like connections, jobs, and translations – for further information you can look at my previous article on Kettle and Greenplum
  • You have a very simple test database on Greenplum, with one table. For the purposes of this article I used a small table called “test” with two columns “id” and “md5“.
  • A GPFDIST server running, with a csv file exposed.
  • GPLOAD installed on your machine. You can download it for your architecture from the Greenplum Site : http://www.greenplum.com/community/downloads/

Using external tables

With Greenplum’s external tables and parallel file server, gpfdist, efficient data loads can be achieved.

Using the “SqlScript” component, we can create an external table at the beginning of our transformation.

The “Create External Table“, as shown below creates an external table, named “external_samples”.  Data is provided by one location in our case, but two or more locations on the same ETL server can be used. One instance of gpfdist is running on this server, on port 8081. Double click on the “SQLScript” component and insert the code for the creation of an external table – for example this is our script:

drop external table if exists "public".external_samples;
create external table "public".external_samples(
id int ,
md5 text
)
location('gpfdist://bravo01:8081/testdata.txt')
format 'TEXT' (DELIMITER ',')

this is a simple test table, with 2 columns:

  • id
  • md5

sql-external

 

Using the  “DELIMITER” keyword at the end of the sql script, Greenplum will use the ‘,’ character as delimiter of the columns, like a CSV file.

Now we have a read only table. Drag a “Table input” component and drop it in the design panel. Connect the script component to the input table component, and double click on the table one.

Insert the select statement by specifying the columns you wish to import. For example:

SELECT id,md5 FROM external_samples;

This will perform the extraction of the data from the external table.

 

table-input

This Stream of data can be used for some manipulations. For the sake of simplicity no ETL operations will be performed in this article (for an introduction on ETL with Greenplum and Kettle, you can read my previous article : “ETL with Kettle and Greenplum“), so we can focus on the use of the GPFDIST and GPLOAD components.

Using the Greenplum Load component

From the bulk loading folder, inside the Design tab on the left side of the Kettle window, drag a “Greenplum Load” component, and drop it near the “Table Input” element. Then connect the two elements.

bulkload

 

Now double click on the “Greenplum Loader” component.

Choose, as usal, the connection that has to be used.

gpload-conf

IMPORTANT

Due to a Bug, if you fill the “target schema” field, the component will generate a configuration file with wrong format, so leave the schema field alone, and fill the “Target Table” with the <schema>.<table> syntax.

We are almost there.

We need to map the table fields and the stream fields. However, the external table that we want to create, at the moment doesn’t exist yet on the server. So, we have to map them manually.

On the “localhost names” panel you have to specify which port number gpfdist file distribution program uses – typically 8081 works fine. Be sure to choose a free port. Also you have to insert the IP of the machine running gpload (usually it is localhost or 127.0.0.1). If the host is using several NIC cards then the host name or IP address of each NIC card can be specified.

Schermata 2011-10-20 a 21.14.25

In the “GP Configuration” tab you can insert the path to where the GPload utility is installed. It is possible to define the name of the GPload control file that will be generated and the name of the Data file(s) that will be written for subsequent load operations in the target tables by GPload (you can also specify the encoding and the file column delimiter).

gp-conf-spec

 

Once all the elements are configured and connected, your transformation should look like this:

connection-order

Execute the transformation. Everything should work fine, and the data will be imported in the destination table.

Tags: etl, gpfdist, Gpload, greenplum, Kettle
Share this entry
  • Share on Facebook
  • Share on Twitter
  • Share on WhatsApp
  • Share on LinkedIn
1 reply
  1. Zuendi
    Zuendi says:
    June 4, 2014 at 12:25 pm

    Hi,

    it looks like a great article. The problem is that the PNG´s are missing.
    Can you do something about that?
    Thanks

    Reply

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Search

Get in touch with us!

Recent Posts

  • Random Data December 3, 2020
  • Webinar: COMMIT Without Fear – The Beauty of CAMO [Follow Up] November 13, 2020
  • Full-text search since PostgreSQL 8.3 November 5, 2020
  • Random numbers November 3, 2020
  • Webinar: Best Practices for Bulk Data Loading in PostgreSQL [Follow Up] November 2, 2020

Featured External Blogs

Tomas Vondra's Blog

Our Bloggers

  • Simon Riggs
  • Alvaro Herrera
  • Andrew Dunstan
  • Craig Ringer
  • Francesco Canovai
  • Gabriele Bartolini
  • Giulio Calacoci
  • Ian Barwick
  • Marco Nenciarini
  • Mark Wong
  • Pavan Deolasee
  • Petr Jelinek
  • Shaun Thomas
  • Tomas Vondra
  • Umair Shahid

PostgreSQL Cloud

2QLovesPG 2UDA 9.6 backup Barman BDR Business Continuity community conference database DBA development devops disaster recovery greenplum Hot Standby JSON JSONB logical replication monitoring OmniDB open source Orange performance PG12 pgbarman pglogical PG Phriday postgres Postgres-BDR postgres-xl PostgreSQL PostgreSQL 9.6 PostgreSQL10 PostgreSQL11 PostgreSQL 11 PostgreSQL 11 New Features postgresql repmgr Recovery replication security sql wal webinar webinars

Support & Services

24/7 Production Support

Developer Support

Remote DBA for PostgreSQL

PostgreSQL Database Monitoring

PostgreSQL Health Check

PostgreSQL Performance Tuning

Database Security Audit

Upgrade PostgreSQL

PostgreSQL Migration Assessment

Migrate from Oracle to PostgreSQL

Products

HA Postgres Clusters

Postgres-BDR®

2ndQPostgres

pglogical

repmgr

Barman

Postgres Cloud Manager

SQL Firewall

Postgres-XL

OmniDB

Postgres Installer

2UDA

Postgres Learning Center

Introducing Postgres

Blog

Webinars

Books

Videos

Training

Case Studies

Events

About Us

About 2ndQuadrant

What does 2ndQuadrant Mean?

News

Careers 

Team Profile

© 2ndQuadrant Ltd. All rights reserved. | Privacy Policy
  • Twitter
  • LinkedIn
  • Facebook
  • Youtube
  • Mail
Mapreduce in Greenplum 4.1 Mapreduce in Greenplum 4.1 – 2nd part
Scroll to top
×