I have been thinking for a while now about adding Greenplum support to an open-source application for web analytics that I wrote a few years ago, which is called htMiner and uses PostgreSQL.
In order to do this, I need a multi-CPU environment. While still waiting to get our new servers installed here in our data centre in Italy, I decided to look at Amazon’s Elastic Compute Cloud (EC2) infrastructure. My intention is to do some benchmarking and spot the main differences in terms of performances between Greenplum Single Node Edition and PostgreSQL 8.4, my favourite DBMS.
If you wish to follow this article, you need to have an Amazon AWS account with a valid credit card. Do not worry, this test will only cost you a couple of dollars!
Greenplum SNE is a free version of the Greenplum database, one of the most advanced solutions for data warehousing and analytics, which is based on a shared nothing architecture and allows for data distribution and parallel processing on several nodes (servers).
The Single Node edition of Greenplum is a freely distributed version of Greenplum which can be installed on a single node. On a multi-processor architecture, Greenplum Single Node Edition allows to create multiple segments (usually one per core) and hence to take advantage of parallel processing. Greenplum Single Node Edition can be downloaded for free from the main website.
My intention is to install it on a Large Instance running CentOS Linux 5.4 on Amazon. EC2’s large instance has the following characteristics:
- 7.5 GB of memory
- 4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each)
- 850 GB of local instance storage
- 64-bit platform
I also decided to get a 10GB volume of Elastic Block Store (1 dollar a month), which I will format using the XFS file system. This volume will contain Greenplum data directories (this time I will try with just one single volume – next time I will try with a volume per segment).
The first step is to log into your Amazon AWS management console. Get your 10GB EBS volume and then launch a large instance using the
ami-ebe4cf9f AMI file (AMI stands for Amazon Machine Image), a CentOS 5.4 image file distributed by RightScale for a 64 bit architecture. You may have a different code, as I use a Europe based server.
I then attach the created volume to the instance I just started. The management console informs me that the volume has been attached on
/dev/sdf. I grab the public DNS information and connect to the server via ssh as root, using my EC2 identity.
I install the YUM packages for XFS support, by running:
yum install kmod-xfs.x86_64 xfsprogs xfsdump
I create a primary partition on /dev/sdf using fdisk and format it:
mkfs -t xfs /dev/sdf1
I then add the entry to
/dev/sdf1 /greenplum xfs noatime 0 0
and mount the partition on the
/greenplum mount point:
mkdir /greenplum mount /greenplum
Download Greenplum’s Quickstart guide from the download area. Grab the URL of the 64bit RedHat installation of Greenplum and download it from the EC2 server using
wget (or upload it from your computer using
Follow the instructions on the quickstart guide about preparing your system to Greenplum (in particular kernel settings and limits).
Unzip the Greenplum’s zip file and execute the .bin file. Answer yes to all the questions and Greenplum at the end of the process is installed in the
gpadmin user and set the password:
useradd gpadmin passwd gpadmin
Prepare the data directories for the master and the segments:
mkdir -p /greenplum/master mkdir -p /greenplum/segment1 mkdir -p /greenplum/segment2 chown -R gpadmin:gpadmin /greenplum
gpadmin using the
su command and include
source /usr/local/greenplum-db/greenplum_path.sh into gpadmin’s ~/.bashrc file. Load these settings. Edit the ~/single_host_file file, add
localhost to its contents and launch:
gpssh-exkeys -f ~/single_host_file
~/gp_init_config file with the following content:
ARRAY_NAME="Greenplum" MACHINE_LIST_FILE=/home/gpadmin/single_host_file SEG_PREFIX=gp PORT_BASE=50000 declare -a DATA_DIRECTORY=(/greenplum/segment1 /greenplum/segment2) MASTER_HOSTNAME=localhost MASTER_DIRECTORY=/greenplum/master MASTER_PORT=5432 ENCODING=UNICODE
gpinitsystem -c ~/gp_init_config
At the end of the process, Greenplum SNE edition is installed on your Amazon’s EC2 server running CentOS 5.4. On this server you can test the solution at quite a reasonable price (I was on the server for 7 hours today and I spent only 3 dollars).
I will post a few more articles on this topic in the next few days, and hopefully I will be able to post the first benchmarks too. Enjoy!