Installing Greenplum Single Node Edition on Amazon’s EC2
I have been thinking for a while now about adding Greenplum support to an open-source application for web analytics that I wrote a few years ago, which is called htMiner and uses PostgreSQL.
In order to do this, I need a multi-CPU environment. While still waiting to get our new servers installed here in our data centre in Italy, I decided to look at Amazon’s Elastic Compute Cloud (EC2) infrastructure. My intention is to do some benchmarking and spot the main differences in terms of performances between Greenplum Single Node Edition and PostgreSQL 8.4, my favourite DBMS.
If you wish to follow this article, you need to have an Amazon AWS account with a valid credit card. Do not worry, this test will only cost you a couple of dollars!
Greenplum SNE is a free version of the Greenplum database, one of the most advanced solutions for data warehousing and analytics, which is based on a shared nothing architecture and allows for data distribution and parallel processing on several nodes (servers).
The Single Node edition of Greenplum is a freely distributed version of Greenplum which can be installed on a single node. On a multi-processor architecture, Greenplum Single Node Edition allows to create multiple segments (usually one per core) and hence to take advantage of parallel processing. Greenplum Single Node Edition can be downloaded for free from the main website.
My intention is to install it on a Large Instance running CentOS Linux 5.4 on Amazon. EC2’s large instance has the following characteristics:
- 7.5 GB of memory
- 4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each)
- 850 GB of local instance storage
- 64-bit platform
I also decided to get a 10GB volume of Elastic Block Store (1 dollar a month), which I will format using the XFS file system. This volume will contain Greenplum data directories (this time I will try with just one single volume – next time I will try with a volume per segment).
The first step is to log into your Amazon AWS management console. Get your 10GB EBS volume and then launch a large instance using the ami-ebe4cf9f
AMI file (AMI stands for Amazon Machine Image), a CentOS 5.4 image file distributed by RightScale for a 64 bit architecture. You may have a different code, as I use a Europe based server.
I then attach the created volume to the instance I just started. The management console informs me that the volume has been attached on /dev/sdf
. I grab the public DNS information and connect to the server via ssh as root, using my EC2 identity.
I install the YUM packages for XFS support, by running:
yum install kmod-xfs.x86_64 xfsprogs xfsdump
I create a primary partition on /dev/sdf using fdisk and format it:
mkfs -t xfs /dev/sdf1
I then add the entry to /etc/fstab
:
/dev/sdf1 /greenplum xfs noatime 0 0
and mount the partition on the /greenplum
mount point:
mkdir /greenplum
mount /greenplum
Download Greenplum’s Quickstart guide from the download area. Grab the URL of the 64bit RedHat installation of Greenplum and download it from the EC2 server using wget
(or upload it from your computer using scp
).
Follow the instructions on the quickstart guide about preparing your system to Greenplum (in particular kernel settings and limits).
Unzip the Greenplum’s zip file and execute the .bin file. Answer yes to all the questions and Greenplum at the end of the process is installed in the /usr/local/greenplum-db
directory.
Create the gpadmin
user and set the password:
useradd gpadmin
passwd gpadmin
Prepare the data directories for the master and the segments:
mkdir -p /greenplum/master
mkdir -p /greenplum/segment1
mkdir -p /greenplum/segment2
chown -R gpadmin:gpadmin /greenplum
Become gpadmin
using the su
command and include source /usr/local/greenplum-db/greenplum_path.sh
into gpadmin’s ~/.bashrc file. Load these settings. Edit the ~/single_host_file file, add localhost
to its contents and launch:
gpssh-exkeys -f ~/single_host_file
Create the ~/gp_init_config
file with the following content:
ARRAY_NAME="Greenplum"
MACHINE_LIST_FILE=/home/gpadmin/single_host_file
SEG_PREFIX=gp
PORT_BASE=50000
declare -a DATA_DIRECTORY=(/greenplum/segment1 /greenplum/segment2)
MASTER_HOSTNAME=localhost
MASTER_DIRECTORY=/greenplum/master
MASTER_PORT=5432
ENCODING=UNICODE
Finally launch:
gpinitsystem -c ~/gp_init_config
At the end of the process, Greenplum SNE edition is installed on your Amazon’s EC2 server running CentOS 5.4. On this server you can test the solution at quite a reasonable price (I was on the server for 7 hours today and I spent only 3 dollars).
I will post a few more articles on this topic in the next few days, and hopefully I will be able to post the first benchmarks too. Enjoy!
awesome article !!
Few things you might want to clarify for the newbie:
If you are accessing AWS from windows, do the following:
Install Putty: required to make ssh connection
Install PuttyGen: to convert the AWS private key to Putty private key (.ppk key)
Install pscp: required to transfer greenplum zip file from your windows machine to linux instance created on AWS.
Install link: http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html
Article to convert AWS key to putty key (required before you can fire ssh):
http://docs.amazonwebservices.com/AWSEC2/latest/DeveloperGuide/index.html?generating-a-keypair.html
Also remember to have Source(IP or Group) of 0.0.0.0/0 in ‘Security Groups’ in AWS console. (Connection method: SSH, protocol TCP, from and to port both are 22)
Command for secure copy of greenplum .zip file from your windows to linux is:
pscp -i green* [email protected]:/greenplum
I have to install the GreenPlum DB on to my personal laptop. So can you please suggest me Hardware requirement for this.