Association rules with MADlib in Greenplum
[*MADlib*](http://madlib.net) is an open-source library for scalable in-database analytics which targets the PostgreSQL and the Greenplum databases.
MADlib version 0.2beta needs to be installed properly to follow this article, so we encourage you to read the [official documentation](http://github.com/madlib/madlib/wiki/Installation-Guide-%28v0.2beta%29) to install it in a Greenplum database.
I’m going to show you how to perform Association Rules using MADlib and Greenplum.
A bit of theory first: Association Rules is an unsupervised data mining technique. Its purpose is to find relations between variables in a data set. The most common example is the analysis of market baskets and its purpose is to automatically discover rules in the form of “People that buy A tend to buy B also”.
This algorithm has become famous with Amazon’s “Customers who purchased this item purchased also …” suggestion feature.
In order to show Association Rules basics we’ll go trough an example, based on the [ht://Miner](www.htminer.org) project, an open-source web analytics tool for PostgreSQL developed by 2ndQuadrant’s engineer Gabriele Bartolini. Let’s think of a large website, with several URLs. Through regular expressions we can organise and classify these URLs in human readable categories (ht://Miner has this fancy feature that allows to create several classification hierarchies for URLs).
Our dataset is composed by a transaction\_id representing a user session (conceptually similar to our basket) and a category\_name (the product) representing a particular category that the user has visited during his/her session.
Our goal with Association Rules is to be able to say something like “a user that has visited this category ALSO visited this other category with a percentage of X”.
Fundamental terms are *support* and *confidence*.
Support is the proportion of transactions in the observed data set that contain a particular item (or a set of items). In a rule “A => B” (which reads “if A then B”), confidence can be considered a reasonable estimate of the probability of finding the item B in those transactions that already contain the element A (conditional probability B given A).
Given that Association Rules is an unsupervised technique, any algorithm would return all the possible combinations of products and their “probabilities”. In big datasets, this would require a lot of time and the majority of the results would be meaningless.
It become therefore important to define thresholds for both support and confidence, in order for the underlying algorithms to automatically discard unuseful rules. Hence you will be required to provide these two input filters.
MADlib makes relatively easy to run such a technique once you have a good dataset. Say that our dataset is a view called _my\_dataset_, of this form:
mydb=# \d my_dataset
View "public.my_dataset"
Column | Type | Modifiers
----------+------------------------+-----------
trans_id | integer |
category | character varying(255) |
We want to create a schema called ‘result’ which will hold the discovered association rules (it is not necessary, but it helps a lot when experimenting the library).
mydb=# CREATE SCHEMA result;
CREATE SCHEMA
The assoc\_rules function that MADLib exposes has a simple signature:
SELECT assoc_rules( support FLOAT,
confidence FLOAT,
id_column TEXT,
product_col TEXT,
input_table_or_view TEXT,
result_schema TEXT,
verbose BOOLEAN );
We will give it a go by setting 1% support level and 10% confidence – these values are arbitrary and in real-world analysis they require a data-mining expert with knowledge on the field of application):
htminer=# SELECT assoc_rules (.01, .1, 'trans_id', 'category', 'my_dataset', 'result', true);
The results containing the rules are stored as sparse vectors in the table assoc\_rules inside the schema named result. In our example, results will be:
mydb=# SELECT* FROM output_schema.assoc_rules ;
-[ RECORD 1 ]-+-------------------------------------------
set_list | {62,1,1,1,275}:{0,1,0,1,0}
hash_key | -224024595
subset_x | {64,1,275}:{0,1,0}
subset_y | {62,1,277}:{0,1,0}
support_xy | 0.01243341004627416581
confidence_xy | 0.40355086372360844540
lift_xy | 0.8741503174087748787351031737257072740180
conviction_xy | 0.90259284205400294972
-[ RECORD 2 ]-+-------------------------------------------
set_list | {37,1,24,1,277}:{0,1,0,1,0}
hash_key | -685706771
subset_x | {37,1,302}:{0,1,0}
subset_y | {62,1,277}:{0,1,0}
...
The Association Rules technique is used in many applications areas, from web analytics to marketing. With MADlib, it is possible to take advantage of Greenplum capabilities to manage large amount of data in order to get very large (and therefore significant) data sets and to perform analytics calculations on them with ease.
Leave a Reply
Want to join the discussion?Feel free to contribute!