We will start today with Talend but, in the next weeks, we will proceed with Kettle (Pentaho Data Integration).
In this tutorial we will learn how to import data on a Greenplum database from two CSV files and how to execute a lookup operation, using Talend. For the sake of simplicity, we assume that:
- you have a very simple test database in Greenplum with two tables and a logical 1:N relationship between them: we chose states and users (a user can live in just one state)
- you have two CSV files with the same data structure, one for the states and one for the users
The objective of this article is to demonstrate the concepts using a simple example, so that you can then apply the same techniques to more complex scenarios.
The first step is to download Talend Open Studio ( you can find it on the Talend Site ) and to install it.
Then we need to create a connection to Greenplum. From the Metadata section of Talend’s interface, add a Greenplum connection and insert the credentials. When done, a new object of type DbConnection will be created.
Right-click on it then select “Retrieve Schema“, so that Talend can gather all the needed information about the structure of the users and states tables.
Now it’s time to add the files that contain the data. Immediately below the “Db Connections” icon, you can find the group of delimited files. Click on “Create file delimited“.
In the first screenshot of the wizard, you will be asked to insert a name and a description for the resource being created. In step two, you need to locate the file and set the encoding.
In the third step, depicted in the picture below, you need to define row parsing methods.
You can select the encoding (1), field and row separators (2) and, if necessary, header rows or rows to be skipped (3) – either at the beginning or the end of the file.
It is also possible to ask Talend to auto-detect the structure of the file, by using the first row as header (4).
Click on “Next” then check that everything is fine and finish the process. You will have your file correctly associated to the project. Perform the above operation for both the users and states files. Once you have done, you can proceed with the creation of a job. Drag both files on the job, then select tFileInputDelimited as file type.
From the previously created Greenplum, drag the two tables (in our case states and users) in the editor, then select tGreenplumOutput as type.