How to use the KNN Machine Learning Model with 2UDA – PostgreSQL and Orange (Part 1) - 2ndQuadrant

This article gives a step by step guide to utilizing Machine Learning capabilities with 2UDA. In the article, we’ll use an example of Animals to predict whether they are mammals, Birds, Fish or Insects.

Software versions

We’re going to use 2UDA version 11.6-1 to implement the Machine Learning model. 2UDA version 11.6-1 combines:

PostgreSQL 11.6
Orange 3.23.0

You can find the latest version of 2UDA here.

Step 1: Load training dataset into PostgreSQL

The sample dataset that is used to train our model is available at the official Orange GitHub repository here.

Follow these steps to load the training data into PostgreSQL tables:

Connect to PostgreSQL via psql, OmniDB or any other tool that you are familiar with.
Create a table to store our training data. Here it is named as training_data.

CREATE TABLE training_data(
name VARCHAR (100), hair integer, feathers integer, eggs integer, milk integer, airborne integer, aquatic integer, predator integer, toothed integer, backbone integer, breathes integer, venomous integer, fins integer, legs integer, tail integer, domestic integer, catsize integer, type VARCHAR (100) 
);

Insert training data into the table via COPY query. Before executing COPY query make sure that PostgreSQL has required read permissions on the data file otherwise COPY operation will fail.
NOTE: Please make sure you type a tab space in between single quotes after the delimiter keyword.

COPY training_data FROM 'Path_to_training_data_file’ with delimiter '	' csv header;

Please find the screenshot of training dataset bellow

NOTE: Rows two and three of the training dataset in the .tab file contain some meta information. Since it is not needed at this point, it has been removed from the file.

Step 2: Create workflow with Orange

Go to the desktop and double click on the Orange icon.
This is what the start-up page looks like. Select New option and it will create a blank project.

Now you are ready to apply the Machine Learning model on the dataset.

Step 3: Select Machine Learning model to train the data

For this article, k-nearest neighbours (KNN) Machine Learning model is used to train the data. Once the data training process is complete, In the next step test data is passed to the Prediction widget to check the accuracy of predictions.

Step 4: Import training data from PostgreSQL into Orange

This training dataset will be used to train the Machine Learning model.

Drag and Drop SQL Table widget from the Data menu.
Rename widget (optional)
1. Right-click on the SQL Table widget.
2. Select Rename.
Connect with PostgreSQL to load the training dataset:
1. Double click on the Training data widget.
2. Enter credentials to connect to the PostgreSQL database.
3. Press the reload button to load all the available tables from the given database.
4. Select training_data table from Drop down menu and close the pop-up.

Step 5: Add Target column

This step is important because the Machine Learning model will try to predict the data for this target variable/column:

Drag and drop Select Columns widget from the data menu.
Double click on the Select Columns widget.
Search your target column under Features label. Here, type is used as a target variable because we need to see what type a given animal is.
Drag and drop it under Target Variable box and close the pop-up.

Step 6: Columns ranking

You can Rank or Score the training variable/columns according to their correlation with the target column.

Drag and drop Rank widget from the data menu.
Draw a link line from Select columns widget to Rank widget .
Double click on the Rank widget to see the most related columns in the training data table. It will be selecting the top 5 columns by default.

Step 7: Data training

In this step, the Machine Learning Model (KNN) will be trained with the training dataset. Please follow the following steps:

Drag and drop KNN widget from the Model menu.
Draw a link line from Rank widget to KNN widget.

Step 8: Load test dataset into PostgreSQL

A separate test dataset is created to perform predictions. Please follow the steps to load test dataset into PostgreSQL table.

Create a table to store our test data. Here it is named as test_data.

CREATE TABLE test_data(
name VARCHAR (100), hair integer, feathers integer, eggs integer, milk integer, airborne integer, aquatic integer, predator integer, toothed integer, backbone integer, breathes integer, venomous integer, fins integer, legs integer, tail integer, domestic integer, catsize integer, type VARCHAR (100)
);

Insert test data into the test table via COPY query. Before executing COPY query please make sure that PostgreSQL has required read permissions on the data file otherwise COPY operation will fail.

NOTE: Please make sure you type a tab space between single quotes after the delimiter keyword. A question mark is intentionally placed in the type column of the test dataset because we need to figure out the type of a given animal with our Machine Learning model.

COPY test_data FROM 'Path_to_test_data_file’ with delimiter '	' csv header;

Please find the screenshot of test dataset bellow

Step 9: Import the test data from PostgreSQL into Orange

Please follow the following steps to apply the predictions.

Drag and drop SQL Table widget from the data menu.
Rename widget (Optional)
1. Right-click on the SQL Table widget.
2. Select Rename.
Connect with PostgreSQL to load test data.
1. Double click on Test data widget.
2. Connect it with Test data table from PostgreSQL.

Now we are ready to perform predictions.

Step 10: Predictions

Prediction widget will try to predict the test data based on training data from KNN.

Drag and drop Prediction widget from the Evaluate menu.
Draw a link line form Test data widget to Prediction widget.
Draw a link line from KNN widget to Prediction widget.

Step 11: Results

Double click on Prediction widget to view the results.

Understanding the Results

You will see 2 main tables in the prediction window. The table on the left side shows the predicted results, while the table on the right shows the original test data, which was provided for predictions.

Since the KNN model was used to train data so you will see one column named KNN that lists the results.

As we know:

Horse is a Mammal
Trout is a Fish
Turkey is a Bird

So KNN is able to determine all the types correctly.

Predictions Accuracy

If you see the table on the left side in the prediction widget’s output, it has some numbers before the predicted type i.e, 1.00. 0.00 These numbers show the accuracy of the predicted type.

We have used 7 types of animals in the training dataset, so it shows a total number of 7 columns with accuracy values each column will represent 1 type of animal. You can check which column is representing what type of animal by looking at the list available on the left side of your screen under Predicted probabilities for label. If you look at the first row which says Turkey is a Bird. We can see Its accuracy is 1.00 (100% from 2nd column). Same goes with other examples Trout is a Fish and its accuracy is 1.00 (100% from 3rd column).

In this article, we have used the k-nearest neighbours’ algorithm (KNN) to implement Machine Learning model. In the next blog, we will be using the Support Vector Machine (SVM) model.

2ndQuadrant is now part of EDB

How to use the KNN Machine Learning Model with 2UDA – PostgreSQL and Orange (Part 1)