Mapreduce in Greenplum 4.1 – 2nd part

## Take up the problem from the previous article
In the [previous article](https://www.2ndquadrant.com/en/2011/10/mapreduce-in-greenplum.html), we left with this MapReduce configuration file:

``````%YAML 1.1
---
VERSION: 1.0.0.1
DATABASE: test_database
HOST: localhost
DEFINE:
- INPUT:
NAME:  my_input_data
QUERY: SELECT x,y FROM my_data
- MAP:
NAME: my_map_function
LANGUAGE: PYTHON
PARAMETERS: [ x integer , y float ]
RETURNS: [key text, value float]
FUNCTION: |
yield {'key': 'Sum of x', 'value': x }
yield {'key': 'Sum of y', 'value': y }
EXECUTE:
- RUN:
SOURCE: my_input_data
MAP: my_map_function
REDUCE: SUM
``````

Which produces the following output:

``````key     |value
--------+-----
Sum of x|   15
Sum of y|  278
(2 rows)
``````

Naturally speaking, that job sums all values from two different columns of a test table.
Our goal here, is to use execute a division of these two values, in particular `15` and `278`.
Let’s check what the result is with a calculator, just to be sure that the MapReduce job will return the correct value:

``````\$ psql -c "SELECT 15/278::FLOAT AS result" test_database
result
0.0539568345323741
(1 row)
``````

Yes, we use Greenplum as a calculator :).
What we are doing here is to define a separate task that performs the sum.
We will use the result of that task as input for a query that actually does the division step.
Let’s see it in practice.
* Remove the `EXECUTE` part from `test.yml`. In details, these lines:

``````EXECUTE:
- RUN:
SOURCE: my_input_data
MAP: my_map_function
REDUCE: SUM
``````

* Define a task, wich is responsible to execute the sum of _x_ and _y_ values. To do that, it reuses the old map function.
Append this to `test.yml`:

``````- TASK:
NAME: sums
SOURCE: my_input_data
MAP: my_map_function
REDUCE: SUM
``````

The useful characteristic of tasks is that they can be used as input for further processing stages.
* Define the step that performs the division, actually. It is an SQL SELECT that use the task defined earlier as input. Append this to `test.yml`:

``````- INPUT:
NAME: division
QUERY: |
SELECT
(SELECT value FROM sums where key = 'Sum of x') /
(SELECT value FROM sums where key = 'Sum of y')
AS final_division;
``````

As you can see, the `FROM` clause contains the name of the task defined above: `sums`.
* Finally, execute the job and displays output. Append this to `test.yml`:

``````EXECUTE:
- RUN:
SOURCE: division
TARGET: STDOUT
``````

This step runs the _division_ query and display the result via standard output.
## Put everything together
This is the complete `test.yml` file:

``````%YAML 1.1
---
VERSION: 1.0.0.1
DATABASE: test_database
HOST: localhost
DEFINE:
- INPUT:
NAME:  my_input_data
QUERY: SELECT x,y FROM my_data
- MAP:
NAME: my_map_function
LANGUAGE: PYTHON
PARAMETERS: [ x integer , y float ]
RETURNS: [key text, value float]
FUNCTION: |
yield {'key': 'Sum of x', 'value': x }
yield {'key': 'Sum of y', 'value': y }
NAME: sums
SOURCE: my_input_data
MAP: my_map_function
REDUCE: SUM
- INPUT:
NAME: division
QUERY: |
SELECT
(SELECT value FROM sums where key = 'Sum of x') /
(SELECT value FROM sums where key = 'Sum of y')
AS final_division;
EXECUTE:
- RUN:
SOURCE: division
TARGET: STDOUT
``````

Execute the whole job with:

``````\$ gpmapreduce -f test.yml
mapreduce_2235_run_1
final_division
0.0539568345323741
(1 row)
``````

Compare it with the calculator result. Ok, it matches.
## Conclusion
The task is complete. We have calculated `sum(x)/sum(y)` correctly.
The power of MapReduce is mainly in the number of servers involved in the calculation.
Many servers accomplishes small calculation to get the final result.
Maybe you will not notice the powerful of MapReduce here, but this is a good starting point.

Tags:
0 replies