Greenplum CE 4.2 XML support
In this article we will see one of the most interesting features added in Greenplum 4.2:
XML datatype support.
Introduction
This article has an educational goal. Most likely you will not want to use the code presented here in a production system. However, it will give you hints and interesting ideas on how to add XML capabilities (both for loading and – most importantly – for searching) to your Greenplum database.
My intention is to show the potential of XML data type and XML data processing in a concurrent environment.
Goals
We will load XML data from a file into a Greenplum table in a XML field.
We will then perform some XPath queries on that data.
The xpath
function operates on XML data, we will write a demo wrapper function which works on a XML file.
For the sake of this example, the XML file must reside on the server: we will use indeed the COPY
command to load data from it.
Can I see some code?
Download and extract a dataset in XML format directly from the blog:
$ wget https://www.2ndquadrant.com/wp-content/uploads/2012/07/100songs.tar.gz $ tar xvf 100songs.tar.gz
Copy the following code in a file named xml.sql:
BEGIN; CREATE TABLE t (id INTEGER, data XML) DISTRIBUTED BY (id); CREATE OR REPLACE FUNCTION _xpath( _xml_file VARCHAR, _xpath_query VARCHAR ) RETURNS TEXT AS $BODY$ DECLARE _is_well_formed BOOLEAN; _result TEXT; BEGIN CREATE TEMP TABLE tmp (d TEXT) distributed by (d); EXECUTE $$ COPY tmp FROM $$ || quote_literal(_xml_file) ; INSERT INTO t (data) SELECT xml(d) FROM tmp; DROP TABLE tmp; SELECT INTO _is_well_formed xml_is_well_formed(text(data)) FROM t; IF _is_well_formed = False THEN RETURN 'XML document is not well formed'; END IF; SELECT INTO _result xpath(_xpath_query, data ) FROM t; RETURN _result; END; $BODY$ LANGUAGE plpgsql; SELECT _xpath('/greenplum/tests/data/100songs.xml', '/response/row/row[artist="Tullio De Piscopo"]/title/text()') AS song; ROLLBACK;
Execute the SQL script with:
$ psql -qtxf xml.sql
You should see:
song | {"Stop Bajon (Primavera)"}
Code in detail
First of all, we have to know the XML document structure:
<response> <row> <row _id=... _uuid=... _position=... _address=...> <theme>...</theme> <title>...</title> <artist>...</artist> <year>...</year> <spotify_url url=.../> </row> ... </row> </response>
The first part of the code creates a table to store XML data:
CREATE TABLE t (id INTEGER, data XML) DISTRIBUTED BY (id);
In the function body, we create a temporary table to store the file content as a single TEXT:
CREATE TEMP TABLE tmp (d TEXT) distributed by (d); EXECUTE $$ COPY tmp FROM $$ || quote_literal(_xml_file) ;
Then, we convert the TEXT data into XML data using the xml
function:
INSERT INTO t (data) SELECT xml(d) FROM tmp; DROP TABLE tmp;
And we check if the XML is well formed:
SELECT INTO _is_well_formed xml_is_well_formed(text(data)) FROM t; IF _is_well_formed = False THEN RETURN 'XML document is not well formed'; END IF;
Lastly, we perform the XPath query and return the result:
SELECT INTO _result xpath(_xpath_query, data ) FROM t; RETURN _result;
The last part of the script shows how to call the function to retrieve data:
-- Get the Tullio De Piscopo's song from the list SELECT _xpath('/greenplum/tests/data/100songs.xml', '/response/row/row[artist="Tullio De Piscopo"]/title/text()') AS song;
Conclusions
XPath is a technology of mixed feeling. Many hate it, some love it. However, when working with XML structured data type, it is a must.
The good thing with Greenplum 4.2 is that it is now possible to load millions of XML documents in a Greenplum database, have the data distributed in multiples segments and take advantage of parallelism when retrieving data using XPath expressions.
Leave a Reply
Want to join the discussion?Feel free to contribute!