2ndQuadrant is now part of EDB

Bringing together some of the world's top PostgreSQL experts.

2ndQuadrant | PostgreSQL
Mission Critical Databases
  • Contact us
  • EN
    • FR
    • IT
    • ES
    • DE
    • PT
  • Support & Services
  • Products
  • Downloads
    • Installers
      • Postgres Installer
      • 2UDA – Unified Data Analytics
    • Whitepapers
      • Business Case for PostgreSQL Support
      • Security Best Practices for PostgreSQL
    • Case Studies
      • Performance Tuning
        • BenchPrep
        • tastyworks
      • Distributed Clusters
        • ClickUp
        • European Space Agency (ESA)
        • Telefónica del Sur
        • Animal Logic
      • Database Administration
        • Agilis Systems
      • Professional Training
        • Met Office
        • London & Partners
      • Database Upgrades
        • Alfred Wegener Institute (AWI)
      • Database Migration
        • International Game Technology (IGT)
        • Healthcare Software Solutions (HSS)
        • Navionics
  • Postgres Learning Center
    • Webinars
      • Upcoming Webinars
      • Webinar Library
    • Whitepapers
      • Business Case for PostgreSQL Support
      • Security Best Practices for PostgreSQL
    • Blog
    • Training
      • Course Catalogue
    • Case Studies
      • Performance Tuning
        • BenchPrep
        • tastyworks
      • Distributed Clusters
        • ClickUp
        • European Space Agency (ESA)
        • Telefónica del Sur
        • Animal Logic
      • Database Administration
        • Agilis Systems
      • Professional Training
        • Met Office
        • London & Partners
      • Database Upgrades
        • Alfred Wegener Institute (AWI)
      • Database Migration
        • International Game Technology (IGT)
        • Healthcare Software Solutions (HSS)
        • Navionics
    • Books
      • PostgreSQL 11 Administration Cookbook
      • PostgreSQL 10 Administration Cookbook
      • PostgreSQL High Availability Cookbook – 2nd Edition
      • PostgreSQL 9 Administration Cookbook – 3rd Edition
      • PostgreSQL Server Programming Cookbook – 2nd Edition
      • PostgreSQL 9 Cookbook – Chinese Edition
    • Videos
    • Events
    • PostgreSQL
      • PostgreSQL – History
      • Who uses PostgreSQL?
      • PostgreSQL FAQ
      • PostgreSQL vs MySQL
      • The Business Case for PostgreSQL
      • Security Information
      • Documentation
  • About Us
    • About 2ndQuadrant
    • 2ndQuadrant’s Passion for PostgreSQL
    • News
    • Careers
    • Team Profile
  • Blog
  • Menu Menu
You are here: Home1 / Blog2 / Data Mining3 / Association rules with MADlib in Greenplum
Carlo Ascani

Association rules with MADlib in Greenplum

September 9, 2011/0 Comments/in Data Mining, Greenplum /by Carlo Ascani

[*MADlib*](http://madlib.net) is an open-source library for scalable in-database analytics which targets the PostgreSQL and the Greenplum databases.
MADlib version 0.2beta needs to be installed properly to follow this article, so we encourage you to read the [official documentation](http://github.com/madlib/madlib/wiki/Installation-Guide-%28v0.2beta%29) to install it in a Greenplum database.
I’m going to show you how to perform Association Rules using MADlib and Greenplum.


A bit of theory first: Association Rules is an unsupervised data mining technique. Its purpose is to find relations between variables in a data set. The most common example is the analysis of market baskets and its purpose is to automatically discover rules in the form of “People that buy A tend to buy B also”.
This algorithm has become famous with Amazon’s “Customers who purchased this item purchased also …” suggestion feature.
In order to show Association Rules basics we’ll go trough an example, based on the [ht://Miner](www.htminer.org) project, an open-source web analytics tool for PostgreSQL developed by 2ndQuadrant’s engineer Gabriele Bartolini. Let’s think of a large website, with several URLs. Through regular expressions we can organise and classify these URLs in human readable categories (ht://Miner has this fancy feature that allows to create several classification hierarchies for URLs).
Our dataset is composed by a transaction\_id representing a user session (conceptually similar to our basket) and a category\_name (the product) representing a particular category that the user has visited during his/her session.
Our goal with Association Rules is to be able to say something like “a user that has visited this category ALSO visited this other category with a percentage of X”.
Fundamental terms are *support* and *confidence*.
Support is the proportion of transactions in the observed data set that contain a particular item (or a set of items). In a rule “A => B” (which reads “if A then B”), confidence can be considered a reasonable estimate of the probability of finding the item B in those transactions that already contain the element A (conditional probability B given A).
Given that Association Rules is an unsupervised technique, any algorithm would return all the possible combinations of products and their “probabilities”. In big datasets, this would require a lot of time and the majority of the results would be meaningless.
It become therefore important to define thresholds for both support and confidence, in order for the underlying algorithms to automatically discard unuseful rules. Hence you will be required to provide these two input filters.
MADlib makes relatively easy to run such a technique once you have a good dataset. Say that our dataset is a view called _my\_dataset_, of this form:


mydb=# \d my_dataset
View "public.my_dataset"
Column  |          Type          | Modifiers
----------+------------------------+-----------
trans_id | integer                |
category | character varying(255) |

We want to create a schema called ‘result’ which will hold the discovered association rules (it is not necessary, but it helps a lot when experimenting the library).


mydb=# CREATE SCHEMA result;
CREATE SCHEMA

The assoc\_rules function that MADLib exposes has a simple signature:


SELECT assoc_rules( support FLOAT,
confidence FLOAT,
id_column TEXT,
product_col TEXT,
input_table_or_view TEXT,
result_schema TEXT,
verbose BOOLEAN );

We will give it a go by setting 1% support level and 10% confidence – these values are arbitrary and in real-world analysis they require a data-mining expert with knowledge on the field of application):


htminer=# SELECT assoc_rules (.01, .1, 'trans_id', 'category', 'my_dataset', 'result', true);

The results containing the rules are stored as sparse vectors in the table assoc\_rules inside the schema named result. In our example, results will be:


mydb=# SELECT* FROM output_schema.assoc_rules ;
-[ RECORD 1 ]-+-------------------------------------------
set_list      | {62,1,1,1,275}:{0,1,0,1,0}
hash_key      | -224024595
subset_x      | {64,1,275}:{0,1,0}
subset_y      | {62,1,277}:{0,1,0}
support_xy    | 0.01243341004627416581
confidence_xy | 0.40355086372360844540
lift_xy       | 0.8741503174087748787351031737257072740180
conviction_xy | 0.90259284205400294972
-[ RECORD 2 ]-+-------------------------------------------
set_list      | {37,1,24,1,277}:{0,1,0,1,0}
hash_key      | -685706771
subset_x      | {37,1,302}:{0,1,0}
subset_y      | {62,1,277}:{0,1,0}
...

The Association Rules technique is used in many applications areas, from web analytics to marketing. With MADlib, it is possible to take advantage of Greenplum capabilities to manage large amount of data in order to get very large (and therefore significant) data sets and to perform analytics calculations on them with ease.

Tags: association rules, data mining, greenplum, madlib, postgres, PostgreSQL
Share this entry
  • Share on Facebook
  • Share on Twitter
  • Share on WhatsApp
  • Share on LinkedIn
0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Search

Get in touch with us!

Recent Posts

  • Random Data December 3, 2020
  • Webinar: COMMIT Without Fear – The Beauty of CAMO [Follow Up] November 13, 2020
  • Full-text search since PostgreSQL 8.3 November 5, 2020
  • Random numbers November 3, 2020
  • Webinar: Best Practices for Bulk Data Loading in PostgreSQL [Follow Up] November 2, 2020

Featured External Blogs

Tomas Vondra's Blog

Our Bloggers

  • Simon Riggs
  • Alvaro Herrera
  • Andrew Dunstan
  • Craig Ringer
  • Francesco Canovai
  • Gabriele Bartolini
  • Giulio Calacoci
  • Ian Barwick
  • Marco Nenciarini
  • Mark Wong
  • Pavan Deolasee
  • Petr Jelinek
  • Shaun Thomas
  • Tomas Vondra
  • Umair Shahid

PostgreSQL Cloud

2QLovesPG 2UDA 9.6 backup Barman BDR Business Continuity community conference database DBA development devops disaster recovery greenplum Hot Standby JSON JSONB logical replication monitoring OmniDB open source Orange performance PG12 pgbarman pglogical PG Phriday postgres Postgres-BDR postgres-xl PostgreSQL PostgreSQL 9.6 PostgreSQL10 PostgreSQL11 PostgreSQL 11 PostgreSQL 11 New Features postgresql repmgr Recovery replication security sql wal webinar webinars

Support & Services

24/7 Production Support

Developer Support

Remote DBA for PostgreSQL

PostgreSQL Database Monitoring

PostgreSQL Health Check

PostgreSQL Performance Tuning

Database Security Audit

Upgrade PostgreSQL

PostgreSQL Migration Assessment

Migrate from Oracle to PostgreSQL

Products

HA Postgres Clusters

Postgres-BDR®

2ndQPostgres

pglogical

repmgr

Barman

Postgres Cloud Manager

SQL Firewall

Postgres-XL

OmniDB

Postgres Installer

2UDA

Postgres Learning Center

Introducing Postgres

Blog

Webinars

Books

Videos

Training

Case Studies

Events

About Us

About 2ndQuadrant

What does 2ndQuadrant Mean?

News

Careers 

Team Profile

© 2ndQuadrant Ltd. All rights reserved. | Privacy Policy
  • Twitter
  • LinkedIn
  • Facebook
  • Youtube
  • Mail
Soon! 4-Days Admin Training in German ETL with Talend and Greenplum – Part one: connections
Scroll to top
×