pgpredict - Tutorial
The tutorial below assumes you have already downloaded the pgpredict package from its homepage.
The directory 'pgpredict/examples' contains small code snippets that can be used to test pgpredict along with the short tutorial below.
To generate a table for testing purposes use the 'generate_customers.sql' script.
The number of rows generated can be increased for a Big Data test if enough disk space is available.
Predictive Analytics in Orange
1) Open Orange and load the 'customers' data table with the SQL Table widget.
2) Explore the data (e.g. Box Plot, Distributions, Scatter Map, Scatter Plot, ...).
3) To build predictive models, Orange relies heavily on scikit-learn, which works with in-memory data. So if your data table has many rows, sample and download the data first using the Data Sampler widget.
4) Set the input ('age', 'wage', 'visits') and target ('spent') variables in the Select Columns widget.
The 'id' variable should be moved to Meta Attributes so it is not used in modeling.
Remove the 'premium' variable for now. It can be chosen as a categorical target variable later, for a classification instead of a regression demo.
5) Build a predictive model with the Linear Regression widget.
6) Optionally, the model can be evaluated with the Test & Score widget. Compare the error of the Linear Regression model to that of Mean learner, which predicts the average value for all customers.
7) Export the predictive model to a file with the Export Model widget.
Deploying the Predictive Model in PostgreSQL
Make sure to execute 'pgpredict/predictive_analytics.sql' before first use, to create the schema "orange" and the necessary functions.
The script 'demo_customers.sql' shows how to load the exported model in a PostgreSQL database and make real-time or batch predictions.
The main steps are:
1) Loading the model (provide the model name, version, and filename):
SELECT orange.load_model('customers_spent', '1', 'customers_spent.json');
Because the postgres server process needs access to the model file, make sure the file has appropriate read permissions.
2) Make a prediction based on the input variables:
SELECT spent, orange.predict_numeric(ARRAY[age, wage, visits], 'customers_spent') FROM customers WHERE id = 42; spent | predict_numeric ------------------+----------------- 141.406325643697 | 127.707
While the actual numbers obtained will vary because of random data generation, the actual value of 'spent' and the prediction should be relatively close.
3) Make batch predictions for a set of customers, and compare the predicted and actual values to compute the mean absolute error (MAE):
SELECT avg(abs(spent - orange.predict_numeric(ARRAY[age, wage, visits], 'customers_spent'))) FROM customers WHERE id < 100000; avg ------------------ 16.7719005959281