Big Data Analytics: Tablesample, Orange, 2UDA
A lot is being said about the new ‘tablesample’ feature of PostgreSQL 9.5. The ability to retrieve a random sample of data in a short amount of time from a very large table makes it ideal to use tablesample as part of big data analytics. To demonstrate my point, I created 6 tables using the same DDL and then inserted varying amounts of data in them. I ended up with tables of row count 1k, 100k, 1m, 5m, 10m, & 100m. I turned on timings in psql and then ran a simple select count(*) query against each of these table. The time taken by the query to return, along with the size of these respective tables is given below: **** ****
Number of Rows | Time Taken (ms) | Size on Disk (MB) |
1k | 219.706 | 0.23 |
100k | 1302.135 | 24 |
1m | 7696.386 | 195 |
5m | 40691.603 | 951 |
10m | 60012.457 | 1923 |
100m | 801493.319 | 19456 |
**** **** These numbers are from a PostgreSQL 9.5 server running locally on my laptop. As any big data expert will tell you, 100 million rows isn’t exactly a huge amount of data by today’s standards. Even then, I had to wait more than 13 minutes before a simple count query returned with results to me. Just imagine what would happen if I wanted statistics and some mathematical calculations to be done as part of my data analytics routine. Enter tablesample. As Gulcin mentions in her very informative blog here, tablesample can be used to retrieve a random sample of data from a potentially very large table. In conjunction with the extension tsm_system_time, you could also bind the query by time … i.e. tell it to return with whatever number of rows it has gathered (random sampling, of course!) in an X amount of time. Now imagine using this feature to create visualizations on big data very quickly without having to write SQL code. Enter Orange. It is exactly this feature that the team at University of Ljubljana has been able to exploit while creating the 3rd version of their data visualization tool, Orange. What Orange essentially does is that it fetches a random sample of data that it can retrieve within a second from a large table and then create statistical visualizations from this random sample. This enables a data analyst to very quickly be able to visualize statistical patterns in data, no matter how big the table is. Now imagine bringing together Orange & PostgreSQL 9.5 in a crisp, easy-to-use installation package that laymen can easily install on Windows, OSX, or Linux. Enter 2UDA (pronounced tudor). 2ndQuadrant’s Unified Data Analytics tool combines PostgreSQL 9.5 with Orange3 & LibreOffice5 in a nice, easy to use, GUI based installer available for Windows, OSX, & Linux. It has Orange & PostgreSQL pre-configured to work seamlessly with each other and utilize the tablesample feature to its fullest to create useful visualization quickly and efficiently. 2ndQuadrant announced today the Beta release of 2UDA that you can download here. Go head, take a look … it’s free to download and unlimited use. I am sure you will enjoy it 🙂
Very informative Umair. Some useful insights.