Greenplum CE 4.2 XML support

In this article we will see one of the most interesting features added in Greenplum 4.2:
XML datatype support.

Introduction

This article has an educational goal. Most likely you will not want to use the code presented here in a production system. However, it will give you hints and interesting ideas on how to add XML capabilities (both for loading and – most importantly – for searching) to your Greenplum database.

My intention is to show the potential of XML data type and XML data processing in a concurrent environment.

Goals

We will load XML data from a file into a Greenplum table in a XML field.

We will then perform some XPath queries on that data.

The xpath function operates on XML data, we will write a demo wrapper function which works on a XML file.

For the sake of this example, the XML file must reside on the server: we will use indeed the COPY command to load data from it.

Can I see some code?

Download and extract a dataset in XML format directly from the blog:

$ wget https://www.2ndquadrant.com/wp-content/uploads/2012/07/100songs.tar.gz
$ tar xvf 100songs.tar.gz

Copy the following code in a file named xml.sql:

BEGIN;

CREATE TABLE t (id INTEGER, data XML)
        DISTRIBUTED BY (id);

CREATE OR REPLACE FUNCTION _xpath(
        _xml_file VARCHAR,
        _xpath_query VARCHAR
)
RETURNS TEXT AS
$BODY$
DECLARE
        _is_well_formed BOOLEAN;
        _result TEXT;
BEGIN

        CREATE TEMP TABLE tmp (d TEXT) distributed by (d);
        EXECUTE $$ COPY tmp FROM $$ ||  quote_literal(_xml_file) ;

        INSERT INTO t (data) SELECT xml(d) FROM tmp;
        DROP TABLE tmp;

        SELECT INTO _is_well_formed xml_is_well_formed(text(data)) FROM t;
        IF _is_well_formed = False THEN
                RETURN 'XML document is not well formed';
        END IF;

        SELECT INTO _result xpath(_xpath_query, data ) FROM t;
        RETURN _result;
END;
$BODY$ LANGUAGE plpgsql;

SELECT _xpath('/greenplum/tests/data/100songs.xml',
   '/response/row/row[artist="Tullio De Piscopo"]/title/text()') AS song;

ROLLBACK;

Execute the SQL script with:

$ psql -qtxf xml.sql

You should see:

song | {"Stop Bajon (Primavera)"}

Code in detail

First of all, we have to know the XML document structure:

<response>
    <row>
        <row _id=... _uuid=... _position=... _address=...>
            <theme>...</theme>
            <title>...</title>
            <artist>...</artist>
            <year>...</year>
            <spotify_url url=.../>
        </row>
        ...
    </row>
</response>

The first part of the code creates a table to store XML data:

CREATE TABLE t (id INTEGER, data XML)
        DISTRIBUTED BY (id);

In the function body, we create a temporary table to store the file content as a single TEXT:

CREATE TEMP TABLE tmp (d TEXT) distributed by (d);
EXECUTE $$ COPY tmp FROM $$ ||  quote_literal(_xml_file) ;

Then, we convert the TEXT data into XML data using the xml function:

INSERT INTO t (data) SELECT xml(d) FROM tmp;
DROP TABLE tmp;

And we check if the XML is well formed:

SELECT INTO _is_well_formed xml_is_well_formed(text(data)) FROM t;
IF _is_well_formed = False THEN
    RETURN 'XML document is not well formed';
END IF;

Lastly, we perform the XPath query and return the result:

SELECT INTO _result xpath(_xpath_query, data ) FROM t;
RETURN _result;

The last part of the script shows how to call the function to retrieve data:

-- Get the Tullio De Piscopo's song from the list
SELECT _xpath('/greenplum/tests/data/100songs.xml', '/response/row/row[artist="Tullio De Piscopo"]/title/text()') AS song;

Conclusions

XPath is a technology of mixed feeling. Many hate it, some love it. However, when working with XML structured data type, it is a must.

The good thing with Greenplum 4.2 is that it is now possible to load millions of XML documents in a Greenplum database, have the data distributed in multiples segments and take advantage of parallelism when retrieving data using XPath expressions.

2ndQuadrant is now part of EDB

Greenplum CE 4.2 XML support

Introduction

Goals

Can I see some code?

Code in detail

Conclusions

Leave a Reply

Leave a Reply Cancel reply

Support & Services

Products

Postgres Learning Center

About Us