Adventure in programming languages and simple statistics
This is a story about how I found myself trying programming languages.
I’ve been running an OLTP type database test (DBT-2, if you’ve heard of it), and noticed the post-processing reporting scripts for analyzing the database test results were running longer than I thought they would. I was running tests on 16 systems at the same time and one particular script was taking just over 60 seconds to run per system. Although I was running the tests simultaneously, I was processing the data from each system sequentially so that processing time added up. While the lack of parallelism is another aspect that could be improved, that got me thinking about whether the script itself could be improved or if this is going to be the norm for analyzing a 13 million row CSV file that is about 500 MB in size. Here is a sample of the data:
1572490811,d,C,0.009459,140665731475200
1572491011,START,,,
1572490813,n,C,0.004410,140677646186240
1572491601,TERMINATED,,,-99952896
The post-processing script is currently written in R. It’s less than 50 lines of code where half of that is processing the data and the other half is displaying it. The processing is basically identifying phases of the test (ramping up the test, test execution, etc.), aggregating counts of successful and failed transactions, and calculating the mean and quantile of the response times per transaction type. At the time, I thought it would be fun to learn how to do basic statistics with a language like R.
Previously the script was written in Python, but not taking advantage of any advanced data types. The Python script reads the data file line by line and counts transactions as it is reading the file. I dusted off the old script from the repository and found that it took about 25 seconds to process the same file.
That got me to thinking… Do I really need data frames to do this? The Python script has about 3 times as many lines of code that I attribute to being the difference between using data frames and not, as opposed to using Python vs R. 150 lines of code doesn’t like a lot, right? Should I have kept using the Python version of the script? Maybe I should try Pandas?
Ultimately, I didn’t try to answer any of those questions.
Recently Julia caught my eye because I saw something that mentioned how Julia has joined the “Petaflop Club”, which includes C, C++, and Fortran. For the moment I became more interested in seeing what Julia is like before revisiting Python or R.
After building a similar script to R using data frames, and another script similar in idea to the previous Python script, it turns out that using data frames in Julia took about 45 seconds while not using data frames ran in about 20 seconds.
After all of that, I think I created more questions than provided answers. Is reducing code size worth the difference in execution time? How does Pandas compare? Should I try R without data frames? Perhaps I need some help using each of those programming languages better to perform those simple statistics more efficiently? Maybe I should try more programming language?
Hi, did you try to load it in a database and run the analysis / report with simple SQL? I expect it to be faster.
I was thinking along these lines also. Some testing:
create table mix_log(fld_1 varchar, fld_2 varchar, fld_3 varchar, fld_4 varchar, fld_5 varchar);
test=# copy mix_log from ‘/home/postgres/mix.log’ WITH (format ‘csv’);
COPY 13118220
Time: 28580.395 ms (00:28.580)
test=# select sum(fld_4::numeric) from mix_log where fld_2 in(‘d’, ‘n’, ‘o’, ‘p’, ‘s’) group by fld_2;
sum
—————-
105826.353223
1030528.609145
78114.213360
909217.899358
80922.426374
(5 rows)
Time: 1984.299 ms (00:01.984)
So loading the data is the choke point. If that could be done as part of the original test instead of writing to a CSV file then it might pencil out.
Like:)
I would probably start by doing a “time wc -l mix.log” (if on a *nix system) and use that as a lower bound on the processing time on that system. It might be faster processing a compressed file (if you are io-bound).
Hi Mark,
Thanks for posting; this was a fun read.
I’ve been curious about the balance between analytic data processing within databases or frameworks compared to ad-hoc processing.
I’ve been learning the Rust programming language for these kinds of tasks. I’m wondering how it would compare. I know of 2 libraries in Rust that could make it quite easy to implement:
https://docs.rs/csv/1.1.1/csv/tutorial/index.html – CSV processing library with exemplary tutorials and documentation.
https://bheisler.github.io/criterion.rs/book/criterion_rs.html – a “statistics-driven” benchmarking library.
I’m with you. More questions than time to answer!
I am absolutely sure that in general, the sie of the coded does not correlate with the execution time!