2ndQuadrant is now part of EDB

Bringing together some of the world's top PostgreSQL experts.

2ndQuadrant | PostgreSQL
Mission Critical Databases
  • Contact us
  • EN
    • FR
    • IT
    • ES
    • DE
    • PT
  • Support & Services
  • Products
  • Downloads
    • Installers
      • Postgres Installer
      • 2UDA – Unified Data Analytics
    • Whitepapers
      • Business Case for PostgreSQL Support
      • Security Best Practices for PostgreSQL
    • Case Studies
      • Performance Tuning
        • BenchPrep
        • tastyworks
      • Distributed Clusters
        • ClickUp
        • European Space Agency (ESA)
        • Telefónica del Sur
        • Animal Logic
      • Database Administration
        • Agilis Systems
      • Professional Training
        • Met Office
        • London & Partners
      • Database Upgrades
        • Alfred Wegener Institute (AWI)
      • Database Migration
        • International Game Technology (IGT)
        • Healthcare Software Solutions (HSS)
        • Navionics
  • Postgres Learning Center
    • Webinars
      • Upcoming Webinars
      • Webinar Library
    • Whitepapers
      • Business Case for PostgreSQL Support
      • Security Best Practices for PostgreSQL
    • Blog
    • Training
      • Course Catalogue
    • Case Studies
      • Performance Tuning
        • BenchPrep
        • tastyworks
      • Distributed Clusters
        • ClickUp
        • European Space Agency (ESA)
        • Telefónica del Sur
        • Animal Logic
      • Database Administration
        • Agilis Systems
      • Professional Training
        • Met Office
        • London & Partners
      • Database Upgrades
        • Alfred Wegener Institute (AWI)
      • Database Migration
        • International Game Technology (IGT)
        • Healthcare Software Solutions (HSS)
        • Navionics
    • Books
      • PostgreSQL 11 Administration Cookbook
      • PostgreSQL 10 Administration Cookbook
      • PostgreSQL High Availability Cookbook – 2nd Edition
      • PostgreSQL 9 Administration Cookbook – 3rd Edition
      • PostgreSQL Server Programming Cookbook – 2nd Edition
      • PostgreSQL 9 Cookbook – Chinese Edition
    • Videos
    • Events
    • PostgreSQL
      • PostgreSQL – History
      • Who uses PostgreSQL?
      • PostgreSQL FAQ
      • PostgreSQL vs MySQL
      • The Business Case for PostgreSQL
      • Security Information
      • Documentation
  • About Us
    • About 2ndQuadrant
    • 2ndQuadrant’s Passion for PostgreSQL
    • News
    • Careers
    • Team Profile
  • Blog
  • Menu Menu
You are here: Home1 / Blog2 / Greenplum3 / Mapreduce in Greenplum 4.1
Carlo Ascani

Mapreduce in Greenplum 4.1

October 31, 2011/0 Comments/in Greenplum /by Carlo Ascani

Mapreduce is a very trendy software framework. It has been introduced by Google (TM) in 2004.
It is a large topic, and it is not possible to cover all of its aspetcs in a single blog article.
This is a simple introduction to the _mapreduce_ usage in Greenplum 4.1.


## What is mapreduce exactly?
Mapreduce’s main goal is to process highly distributable problems across huge datasets using a large number of computers (nodes).
As you may understand, this suits perfectly with Greenplum, which is at ease with huge distributed datasets and allows the integration
with the SQL language.
Mapreduce consists of two separate steps: _Map_ and _Reduce_.
### Map step
During this step, the main problem is partitioned into smaller sub-problems that are passed to children nodes, recursively.
This process leads to a multi-level tree structure.
### Reduce step
During this step, all the sub-problems solutions are merged to obtain the solution to the initial problem.
## Install Mapreduce in Greenplum
Do you think installing Mapreduce in Greenplum is a difficult task? The answer is no. Mapreduce is already included in Greeplum!
## Let’s get practical
I assume that you have a Greenplum 4.1 system installed.
Run gpmapreduce --version:

$ gpmapreduce --version
gpmapreduce - Greenplum Map/Reduce Driver 1.00b2

Perfect, we can go on.
The main point here is a specially formatted file, that we will convetionally call it test.yml from now on.
As you may guess, that is a YAML file, which defines all parts that are needed by the mapreduce data flow to complete:
* Input Data
* Map Function
* Reduce Function
* Output Data
The Greenplum MapReduce specification file has a specific YAML schema. I invite you to have a look at the AdminGuide for details.
In particular, MapReduce is handled in Chapter 23.
For the sake of this article, we will focus on function definitions.
Let’s start writing a text file named test.yml with the mandatory header:

%YAML 1.1
---
VERSION: 1.0.0.1
DATABASE: dbname
USER: gpadmin
HOST: host

where and are the name of the database and the host where MapReduce will connect to.
### Input Data
Input Data can be obtained in so many ways, in this example we will use an SQL SELECT statement.
Let’s create a table in database to get data from:

$ psql -c "CREATE TABLE mydata AS SELECT i AS x,
floor(random()*100) AS y FROM generate_series(1,5) i" 

This will create a 5 rows table with this structure:

=# d mydata
Table "public.mydata"
Column |       Type       | Modifiers
--------+------------------+-----------
x      | integer          |
y      | double precision |
Distributed by: (x)

The set of rows of this table is our Input Data.
Let’s define it in the MapReduce configuration file, by appending this to test.yml:

DEFINE:
- INPUT:
NAME:  my_input_data
QUERY: SELECT x,y FROM mydata

That is self-explanatory, it just selects all rows from the mydata table as input data for mapreduce.
### Map Function
It is very important to understand that a Map function takes as input *a single row*, and produces *zero or more* output rows.
Map functions can be written in C, Perl or Python.
They reside directly in the YAML configuration file.
Parameters managment varies between programming languages (please consult AdminGuide for details).
Let’s see an example of a map function written in Python. You can append the following to test.yaml:

- MAP:
NAME: my_map_function
LANGUAGE: PYTHON
PARAMETERS: [x integer, y float]
RETURNS: [key text, value float]
FUNCTION: |
yield {'key': 'Sum of x', 'value': x }
yield {'key': 'Sum of y', 'value': y }

As you can see, function source is placed directly in the YAML configuration file.
The function takes x and y as input and returns (yield) x and the sum of x and y.
### The Reduce step
Reduce functions takes a set of rows in input and produces *a single* reduced row.
There are several predefined functions included in Greenplum.
Here’s the list:
* IDENTITY – returns (key, value) pairs unchanged
* SUM – calculates the sum of numeric data
* AVG – calculates the average of numeric data
* COUNT – calculates the count of input data
* MIN – calculates minimum value of numeric data
* MAX – calculates maximum value of numeric data
Let’s apply a REDUCE function to our input data, so append this at test.yml:

EXECUTE:
- RUN:
SOURCE: my_input_data
MAP: my_map_function
REDUCE: SUM

This return values unchanged. It is not very useful practically, but it is enough to show the Reduce step in action and get you started.
Ok, let’s see the complete test.yml:

%YAML 1.1
---
VERSION: 1.0.0.1
DATABASE: test_database
USER: gpadmin
HOST: localhost
DEFINE:
- INPUT:
NAME:  my_input_data
QUERY: SELECT x,y FROM my_data
- MAP:
NAME: my_map_function
LANGUAGE: PYTHON
PARAMETERS: [ x integer , y float ]
RETURNS: [key text, value float]
FUNCTION: |
yield {'key': 'Sum of x', 'value': x }
yield {'key': 'Sum of y', 'value': y }
EXECUTE:
- RUN:
SOURCE: my_input_data
MAP: my_map_function
REDUCE: SUM

Remember that YAML does not use TABS!
It is now possible to execute this Mapreduce job simply running:

$ gpmapreduce -f test.yaml

Results here will most likely be different from yours, due to the usage of the random() function during data generation.
Here’s mine:

mapreduce_2508_run_1
key     |value
--------+-----
Sum of x|   15
Sum of y|  278
(2 rows)

Exactly the sum of all x and y values from input table mydata.
In conclusion, this is just a smattering of how MapReduce works in Greenplum.
MapReduce is a complex and wide topic, and its usage is growing in popularity every day.
Greenplum has an excellent support of it and allows business analytics users to take advantage
of the shared nothing architecture by executing map/reduce functions in a distributed way and by
working on distributed datasets.

Tags: greenplum, mapreduce, postgres, PostgreSQL
Share this entry
  • Share on Facebook
  • Share on Twitter
  • Share on WhatsApp
  • Share on LinkedIn
0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Search

Get in touch with us!

Recent Posts

  • Random Data December 3, 2020
  • Webinar: COMMIT Without Fear – The Beauty of CAMO [Follow Up] November 13, 2020
  • Full-text search since PostgreSQL 8.3 November 5, 2020
  • Random numbers November 3, 2020
  • Webinar: Best Practices for Bulk Data Loading in PostgreSQL [Follow Up] November 2, 2020

Featured External Blogs

Tomas Vondra's Blog

Our Bloggers

  • Simon Riggs
  • Alvaro Herrera
  • Andrew Dunstan
  • Craig Ringer
  • Francesco Canovai
  • Gabriele Bartolini
  • Giulio Calacoci
  • Ian Barwick
  • Marco Nenciarini
  • Mark Wong
  • Pavan Deolasee
  • Petr Jelinek
  • Shaun Thomas
  • Tomas Vondra
  • Umair Shahid

PostgreSQL Cloud

2QLovesPG 2UDA 9.6 backup Barman BDR Business Continuity community conference database DBA development devops disaster recovery greenplum Hot Standby JSON JSONB logical replication monitoring OmniDB open source Orange performance PG12 pgbarman pglogical PG Phriday postgres Postgres-BDR postgres-xl PostgreSQL PostgreSQL 9.6 PostgreSQL10 PostgreSQL11 PostgreSQL 11 PostgreSQL 11 New Features postgresql repmgr Recovery replication security sql wal webinar webinars

Support & Services

24/7 Production Support

Developer Support

Remote DBA for PostgreSQL

PostgreSQL Database Monitoring

PostgreSQL Health Check

PostgreSQL Performance Tuning

Database Security Audit

Upgrade PostgreSQL

PostgreSQL Migration Assessment

Migrate from Oracle to PostgreSQL

Products

HA Postgres Clusters

Postgres-BDR®

2ndQPostgres

pglogical

repmgr

Barman

Postgres Cloud Manager

SQL Firewall

Postgres-XL

OmniDB

Postgres Installer

2UDA

Postgres Learning Center

Introducing Postgres

Blog

Webinars

Books

Videos

Training

Case Studies

Events

About Us

About 2ndQuadrant

What does 2ndQuadrant Mean?

News

Careers 

Team Profile

© 2ndQuadrant Ltd. All rights reserved. | Privacy Policy
  • Twitter
  • LinkedIn
  • Facebook
  • Youtube
  • Mail
ETL with Kettle and Greenplum – Part Two: importing data Performing ETL using Kettle with GPFDIST and GPLOAD
Scroll to top
×