2ndQuadrant is now part of EDB

Bringing together some of the world's top PostgreSQL experts.

2ndQuadrant | PostgreSQL
Mission Critical Databases
  • Contact us
  • EN
    • FR
    • IT
    • ES
    • DE
    • PT
  • Support & Services
  • Products
  • Downloads
    • Installers
      • Postgres Installer
      • 2UDA – Unified Data Analytics
    • Whitepapers
      • Business Case for PostgreSQL Support
      • Security Best Practices for PostgreSQL
    • Case Studies
      • Performance Tuning
        • BenchPrep
        • tastyworks
      • Distributed Clusters
        • ClickUp
        • European Space Agency (ESA)
        • Telefónica del Sur
        • Animal Logic
      • Database Administration
        • Agilis Systems
      • Professional Training
        • Met Office
        • London & Partners
      • Database Upgrades
        • Alfred Wegener Institute (AWI)
      • Database Migration
        • International Game Technology (IGT)
        • Healthcare Software Solutions (HSS)
        • Navionics
  • Postgres Learning Center
    • Webinars
      • Upcoming Webinars
      • Webinar Library
    • Whitepapers
      • Business Case for PostgreSQL Support
      • Security Best Practices for PostgreSQL
    • Blog
    • Training
      • Course Catalogue
    • Case Studies
      • Performance Tuning
        • BenchPrep
        • tastyworks
      • Distributed Clusters
        • ClickUp
        • European Space Agency (ESA)
        • Telefónica del Sur
        • Animal Logic
      • Database Administration
        • Agilis Systems
      • Professional Training
        • Met Office
        • London & Partners
      • Database Upgrades
        • Alfred Wegener Institute (AWI)
      • Database Migration
        • International Game Technology (IGT)
        • Healthcare Software Solutions (HSS)
        • Navionics
    • Books
      • PostgreSQL 11 Administration Cookbook
      • PostgreSQL 10 Administration Cookbook
      • PostgreSQL High Availability Cookbook – 2nd Edition
      • PostgreSQL 9 Administration Cookbook – 3rd Edition
      • PostgreSQL Server Programming Cookbook – 2nd Edition
      • PostgreSQL 9 Cookbook – Chinese Edition
    • Videos
    • Events
    • PostgreSQL
      • PostgreSQL – History
      • Who uses PostgreSQL?
      • PostgreSQL FAQ
      • PostgreSQL vs MySQL
      • The Business Case for PostgreSQL
      • Security Information
      • Documentation
  • About Us
    • About 2ndQuadrant
    • 2ndQuadrant’s Passion for PostgreSQL
    • News
    • Careers
    • Team Profile
  • Blog
  • Menu Menu
You are here: Home1 / Blog2 / Mark's PlanetPostgreSQL3 / Adventure in programming languages and simple statistics
Mark Wong

Adventure in programming languages and simple statistics

January 3, 2020/6 Comments/in Mark's PlanetPostgreSQL /by Mark Wong

This is a story about how I found myself trying programming languages.

I’ve been running an OLTP type database test (DBT-2, if you’ve heard of it), and noticed the post-processing reporting scripts for analyzing the database test results were running longer than I thought they would. I was running tests on 16 systems at the same time and one particular script was taking just over 60 seconds to run per system. Although I was running the tests simultaneously, I was processing the data from each system sequentially so that processing time added up. While the lack of parallelism is another aspect that could be improved, that got me thinking about whether the script itself could be improved or if this is going to be the norm for analyzing a 13 million row CSV file that is about 500 MB in size. Here is a sample of the data:


1572490811,d,C,0.009459,140665731475200
1572491011,START,,,
1572490813,n,C,0.004410,140677646186240
1572491601,TERMINATED,,,-99952896

The post-processing script is currently written in R. It’s less than 50 lines of code where half of that is processing the data and the other half is displaying it. The processing is basically identifying phases of the test (ramping up the test, test execution, etc.), aggregating counts of successful and failed transactions, and calculating the mean and quantile of the response times per transaction type. At the time, I thought it would be fun to learn how to do basic statistics with a language like R.

Previously the script was written in Python, but not taking advantage of any advanced data types. The Python script reads the data file line by line and counts transactions as it is reading the file. I dusted off the old script from the repository and found that it took about 25 seconds to process the same file.

That got me to thinking… Do I really need data frames to do this? The Python script has about 3 times as many lines of code that I attribute to being the difference between using data frames and not, as opposed to using Python vs R. 150 lines of code doesn’t like a lot, right? Should I have kept using the Python version of the script? Maybe I should try Pandas?

Ultimately, I didn’t try to answer any of those questions.

Recently Julia caught my eye because I saw something that mentioned how Julia has joined the “Petaflop Club”, which includes C, C++, and Fortran. For the moment I became more interested in seeing what Julia is like before revisiting Python or R.

After building a similar script to R using data frames, and another script similar in idea to the previous Python script, it turns out that using data frames in Julia took about 45 seconds while not using data frames ran in about 20 seconds.

After all of that, I think I created more questions than provided answers. Is reducing code size worth the difference in execution time? How does Pandas compare? Should I try R without data frames? Perhaps I need some help using each of those programming languages better to perform those simple statistics more efficiently? Maybe I should try more programming language?

Tags: julia, python, r
Share this entry
  • Share on Facebook
  • Share on Twitter
  • Share on WhatsApp
  • Share on LinkedIn
6 replies
  1. Pachot Franck
    Pachot Franck says:
    January 3, 2020 at 1:14 am

    Hi, did you try to load it in a database and run the analysis / report with simple SQL? I expect it to be faster.

    Reply
    • Adrian Klaver
      Adrian Klaver says:
      January 5, 2020 at 10:14 pm

      I was thinking along these lines also. Some testing:

      create table mix_log(fld_1 varchar, fld_2 varchar, fld_3 varchar, fld_4 varchar, fld_5 varchar);

      test=# copy mix_log from ‘/home/postgres/mix.log’ WITH (format ‘csv’);
      COPY 13118220
      Time: 28580.395 ms (00:28.580)
      test=# select sum(fld_4::numeric) from mix_log where fld_2 in(‘d’, ‘n’, ‘o’, ‘p’, ‘s’) group by fld_2;
      sum
      —————-
      105826.353223
      1030528.609145
      78114.213360
      909217.899358
      80922.426374
      (5 rows)
      Time: 1984.299 ms (00:01.984)

      So loading the data is the choke point. If that could be done as part of the original test instead of writing to a CSV file then it might pencil out.

      Reply
    • Henrietta
      Henrietta says:
      February 1, 2020 at 11:21 pm

      Like:)

      Reply
  2. Anders
    Anders says:
    January 3, 2020 at 1:38 am

    I would probably start by doing a “time wc -l mix.log” (if on a *nix system) and use that as a lower bound on the processing time on that system. It might be faster processing a compressed file (if you are io-bound).

    Reply
  3. Eli Flanagan
    Eli Flanagan says:
    January 3, 2020 at 5:54 pm

    Hi Mark,
    Thanks for posting; this was a fun read.

    I’ve been curious about the balance between analytic data processing within databases or frameworks compared to ad-hoc processing.
    I’ve been learning the Rust programming language for these kinds of tasks. I’m wondering how it would compare. I know of 2 libraries in Rust that could make it quite easy to implement:

    https://docs.rs/csv/1.1.1/csv/tutorial/index.html – CSV processing library with exemplary tutorials and documentation.
    https://bheisler.github.io/criterion.rs/book/criterion_rs.html – a “statistics-driven” benchmarking library.

    I’m with you. More questions than time to answer!

    Reply
  4. Henrietta
    Henrietta says:
    February 1, 2020 at 11:23 pm

    I am absolutely sure that in general, the sie of the coded does not correlate with the execution time!

    Reply

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Search

Get in touch with us!

Recent Posts

  • Random Data December 3, 2020
  • Webinar: COMMIT Without Fear – The Beauty of CAMO [Follow Up] November 13, 2020
  • Full-text search since PostgreSQL 8.3 November 5, 2020
  • Random numbers November 3, 2020
  • Webinar: Best Practices for Bulk Data Loading in PostgreSQL [Follow Up] November 2, 2020

Featured External Blogs

Tomas Vondra's Blog

Our Bloggers

  • Simon Riggs
  • Alvaro Herrera
  • Andrew Dunstan
  • Craig Ringer
  • Francesco Canovai
  • Gabriele Bartolini
  • Giulio Calacoci
  • Ian Barwick
  • Marco Nenciarini
  • Mark Wong
  • Pavan Deolasee
  • Petr Jelinek
  • Shaun Thomas
  • Tomas Vondra
  • Umair Shahid

PostgreSQL Cloud

2QLovesPG 2UDA 9.6 backup Barman BDR Business Continuity community conference database DBA development devops disaster recovery greenplum Hot Standby JSON JSONB logical replication monitoring OmniDB open source Orange performance PG12 pgbarman pglogical PG Phriday postgres Postgres-BDR postgres-xl PostgreSQL PostgreSQL 9.6 PostgreSQL10 PostgreSQL11 PostgreSQL 11 PostgreSQL 11 New Features postgresql repmgr Recovery replication security sql wal webinar webinars

Support & Services

24/7 Production Support

Developer Support

Remote DBA for PostgreSQL

PostgreSQL Database Monitoring

PostgreSQL Health Check

PostgreSQL Performance Tuning

Database Security Audit

Upgrade PostgreSQL

PostgreSQL Migration Assessment

Migrate from Oracle to PostgreSQL

Products

HA Postgres Clusters

Postgres-BDR®

2ndQPostgres

pglogical

repmgr

Barman

Postgres Cloud Manager

SQL Firewall

Postgres-XL

OmniDB

Postgres Installer

2UDA

Postgres Learning Center

Introducing Postgres

Blog

Webinars

Books

Videos

Training

Case Studies

Events

About Us

About 2ndQuadrant

What does 2ndQuadrant Mean?

News

Careers 

Team Profile

© 2ndQuadrant Ltd. All rights reserved. | Privacy Policy
  • Twitter
  • LinkedIn
  • Facebook
  • Youtube
  • Mail
Webinar: You forgot to put the WHERE in DELETE? [Follow Up] Support for PostgreSQL’s System identifier in Barman Support for PostgreSQL’s System identifier in Barman
Scroll to top
×