2ndQuadrant is now part of EDB

Bringing together some of the world's top PostgreSQL experts.

2ndQuadrant | PostgreSQL
Mission Critical Databases
  • Contact us
  • EN
    • FR
    • IT
    • ES
    • DE
    • PT
  • Support & Services
  • Products
  • Downloads
    • Installers
      • Postgres Installer
      • 2UDA – Unified Data Analytics
    • Whitepapers
      • Business Case for PostgreSQL Support
      • Security Best Practices for PostgreSQL
    • Case Studies
      • Performance Tuning
        • BenchPrep
        • tastyworks
      • Distributed Clusters
        • ClickUp
        • European Space Agency (ESA)
        • Telefónica del Sur
        • Animal Logic
      • Database Administration
        • Agilis Systems
      • Professional Training
        • Met Office
        • London & Partners
      • Database Upgrades
        • Alfred Wegener Institute (AWI)
      • Database Migration
        • International Game Technology (IGT)
        • Healthcare Software Solutions (HSS)
        • Navionics
  • Postgres Learning Center
    • Webinars
      • Upcoming Webinars
      • Webinar Library
    • Whitepapers
      • Business Case for PostgreSQL Support
      • Security Best Practices for PostgreSQL
    • Blog
    • Training
      • Course Catalogue
    • Case Studies
      • Performance Tuning
        • BenchPrep
        • tastyworks
      • Distributed Clusters
        • ClickUp
        • European Space Agency (ESA)
        • Telefónica del Sur
        • Animal Logic
      • Database Administration
        • Agilis Systems
      • Professional Training
        • Met Office
        • London & Partners
      • Database Upgrades
        • Alfred Wegener Institute (AWI)
      • Database Migration
        • International Game Technology (IGT)
        • Healthcare Software Solutions (HSS)
        • Navionics
    • Books
      • PostgreSQL 11 Administration Cookbook
      • PostgreSQL 10 Administration Cookbook
      • PostgreSQL High Availability Cookbook – 2nd Edition
      • PostgreSQL 9 Administration Cookbook – 3rd Edition
      • PostgreSQL Server Programming Cookbook – 2nd Edition
      • PostgreSQL 9 Cookbook – Chinese Edition
    • Videos
    • Events
    • PostgreSQL
      • PostgreSQL – History
      • Who uses PostgreSQL?
      • PostgreSQL FAQ
      • PostgreSQL vs MySQL
      • The Business Case for PostgreSQL
      • Security Information
      • Documentation
  • About Us
    • About 2ndQuadrant
    • 2ndQuadrant’s Passion for PostgreSQL
    • News
    • Careers
    • Team Profile
  • Blog
  • Menu Menu
You are here: Home1 / Blog2 / Greenplum3 / Mapreduce in Greenplum 4.1 – 2nd part
Carlo Ascani

Mapreduce in Greenplum 4.1 – 2nd part

November 17, 2011/0 Comments/in Greenplum /by Carlo Ascani

Through this article, we are going to complete the MapReduce job started in the [previous article](https://www.2ndquadrant.com/en/2011/10/mapreduce-in-greenplum.html).


## Take up the problem from the previous article
In the [previous article](https://www.2ndquadrant.com/en/2011/10/mapreduce-in-greenplum.html), we left with this MapReduce configuration file:

%YAML 1.1
---
VERSION: 1.0.0.1
DATABASE: test_database
USER: gpadmin
HOST: localhost
DEFINE:
- INPUT:
NAME:  my_input_data
QUERY: SELECT x,y FROM my_data
- MAP:
NAME: my_map_function
LANGUAGE: PYTHON
PARAMETERS: [ x integer , y float ]
RETURNS: [key text, value float]
FUNCTION: |
yield {'key': 'Sum of x', 'value': x }
yield {'key': 'Sum of y', 'value': y }
EXECUTE:
- RUN:
SOURCE: my_input_data
MAP: my_map_function
REDUCE: SUM

Which produces the following output:

key     |value
--------+-----
Sum of x|   15
Sum of y|  278
(2 rows)

Naturally speaking, that job sums all values from two different columns of a test table.
Our goal here, is to use execute a division of these two values, in particular 15 and 278.
Let’s check what the result is with a calculator, just to be sure that the MapReduce job will return the correct value:

$ psql -c "SELECT 15/278::FLOAT AS result" test_database
result
0.0539568345323741
(1 row)

Yes, we use Greenplum as a calculator :).
## Introducing “tasks”
What we are doing here is to define a separate task that performs the sum.
We will use the result of that task as input for a query that actually does the division step.
Let’s see it in practice.
* Remove the EXECUTE part from test.yml. In details, these lines:

EXECUTE:
- RUN:
SOURCE: my_input_data
MAP: my_map_function
REDUCE: SUM

* Define a task, wich is responsible to execute the sum of _x_ and _y_ values. To do that, it reuses the old map function.
Append this to test.yml:

- TASK:
NAME: sums
SOURCE: my_input_data
MAP: my_map_function
REDUCE: SUM

The useful characteristic of tasks is that they can be used as input for further processing stages.
* Define the step that performs the division, actually. It is an SQL SELECT that use the task defined earlier as input. Append this to test.yml:

- INPUT:
NAME: division
QUERY: |
SELECT
(SELECT value FROM sums where key = 'Sum of x') /
(SELECT value FROM sums where key = 'Sum of y')
AS final_division;

As you can see, the FROM clause contains the name of the task defined above: sums.
* Finally, execute the job and displays output. Append this to test.yml:

EXECUTE:
- RUN:
SOURCE: division
TARGET: STDOUT

This step runs the _division_ query and display the result via standard output.
## Put everything together
This is the complete test.yml file:

%YAML 1.1
---
VERSION: 1.0.0.1
DATABASE: test_database
USER: gpadmin
HOST: localhost
DEFINE:
- INPUT:
NAME:  my_input_data
QUERY: SELECT x,y FROM my_data
- MAP:
NAME: my_map_function
LANGUAGE: PYTHON
PARAMETERS: [ x integer , y float ]
RETURNS: [key text, value float]
FUNCTION: |
yield {'key': 'Sum of x', 'value': x }
yield {'key': 'Sum of y', 'value': y }
- TASK:
NAME: sums
SOURCE: my_input_data
MAP: my_map_function
REDUCE: SUM
- INPUT:
NAME: division
QUERY: |
SELECT
(SELECT value FROM sums where key = 'Sum of x') /
(SELECT value FROM sums where key = 'Sum of y')
AS final_division;
EXECUTE:
- RUN:
SOURCE: division
TARGET: STDOUT

Execute the whole job with:

$ gpmapreduce -f test.yml
mapreduce_2235_run_1
final_division
0.0539568345323741
(1 row)

Compare it with the calculator result. Ok, it matches.
## Conclusion
The task is complete. We have calculated sum(x)/sum(y) correctly.
The power of MapReduce is mainly in the number of servers involved in the calculation.
Many servers accomplishes small calculation to get the final result.
Maybe you will not notice the powerful of MapReduce here, but this is a good starting point.

Tags: greenplum, mapreduce, PostgreSQL
Share this entry
  • Share on Facebook
  • Share on Twitter
  • Share on WhatsApp
  • Share on LinkedIn
0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Search

Get in touch with us!

Recent Posts

  • Random Data December 3, 2020
  • Webinar: COMMIT Without Fear – The Beauty of CAMO [Follow Up] November 13, 2020
  • Full-text search since PostgreSQL 8.3 November 5, 2020
  • Random numbers November 3, 2020
  • Webinar: Best Practices for Bulk Data Loading in PostgreSQL [Follow Up] November 2, 2020

Featured External Blogs

Tomas Vondra's Blog

Our Bloggers

  • Simon Riggs
  • Alvaro Herrera
  • Andrew Dunstan
  • Craig Ringer
  • Francesco Canovai
  • Gabriele Bartolini
  • Giulio Calacoci
  • Ian Barwick
  • Marco Nenciarini
  • Mark Wong
  • Pavan Deolasee
  • Petr Jelinek
  • Shaun Thomas
  • Tomas Vondra
  • Umair Shahid

PostgreSQL Cloud

2QLovesPG 2UDA 9.6 backup Barman BDR Business Continuity community conference database DBA development devops disaster recovery greenplum Hot Standby JSON JSONB logical replication monitoring OmniDB open source Orange performance PG12 pgbarman pglogical PG Phriday postgres Postgres-BDR postgres-xl PostgreSQL PostgreSQL 9.6 PostgreSQL10 PostgreSQL11 PostgreSQL 11 PostgreSQL 11 New Features postgresql repmgr Recovery replication security sql wal webinar webinars

Support & Services

24/7 Production Support

Developer Support

Remote DBA for PostgreSQL

PostgreSQL Database Monitoring

PostgreSQL Health Check

PostgreSQL Performance Tuning

Database Security Audit

Upgrade PostgreSQL

PostgreSQL Migration Assessment

Migrate from Oracle to PostgreSQL

Products

HA Postgres Clusters

Postgres-BDR®

2ndQPostgres

pglogical

repmgr

Barman

Postgres Cloud Manager

SQL Firewall

Postgres-XL

OmniDB

Postgres Installer

2UDA

Postgres Learning Center

Introducing Postgres

Blog

Webinars

Books

Videos

Training

Case Studies

Events

About Us

About 2ndQuadrant

What does 2ndQuadrant Mean?

News

Careers 

Team Profile

© 2ndQuadrant Ltd. All rights reserved. | Privacy Policy
  • Twitter
  • LinkedIn
  • Facebook
  • Youtube
  • Mail
Performing ETL using Kettle with GPFDIST and GPLOAD Using Greenplum 4.1 in Ubuntu 11.10
Scroll to top
×