-
Configuring retention policies in Barman
In our previous article we went through describing what retention policies are and how they can be enforced on your PostgreSQL server backups with Barman 1.2. In this post, we will go through the configuration aspects.
For the sake of simplicity, we assume a typical scenario which involves taking full backups once a week through the “barman backup” command. Suppose you want to automatically keep the latest 4 backups and let Barman automatically delete the old ones (obsolete).
The main configuration option for retention policies in Barman is “retention_policy” which can be defined both at global or server level. If you want all your servers by default to keep the last 4 periodical backups, you need to add in the general section of Barman’s configuration file the following line:
[barman] ... // General settings retention_policy: REDUNDANCY 4
When the next “barman cron” command is executed (every minute if you installed Barman using RPMs or Debian/Ubuntu packages), Barman checks for the number of available full periodical backups for every server, order them in descending chronological order (from the most recent to the oldest one) and deletes backups from the 5th position onwards.
In case you have several servers backed up on the same Barman host and you want to differentiate the retention policy for a specific server, you can simply edit that server configuration section (or file, see “Managing the backup of several PostgreSQL servers with Barman“) and define a different setting:
[malcolm] description = Malcolm Rocks ssh_command = ssh malcolm conninfo = host=malcolm port=5432 user=postgres dbname=postgres retention_policy: REDUNDANCY 8
However, Barman allows systems administrators to manage retention policies based on time, in terms of recovery window and point of recoverability. For example, you can set another server to allow to recover at any point in time in the last 3 months:
[angus] description = Angus Rocks ssh_command = ssh angus conninfo = host=angus port=5432 user=postgres dbname=postgres retention_policy: RECOVERY WINDOW OF 3 MONTHS
Make sure you have enough space on the disk to store all the WAL files for every server you back up, and always monitor “barman check” through your alerting tools (such as Nagios/Icinga/Zabbix/etc.).
Current implementation of retention policies in Barman has some limitations: retention policies are managed only automatically (not manually – this would require to create a “barman delete –obsolete” command, for example) and there is no decoupling yet between full backups and WAL archive transactional logs (we have already thought of the “wal_retention_policy” option, but at the moment it is not handled).
More detailed information on retention policies can be found on Barman’s documentation website.
-
Nearest Big City
In this article, we want to find the town with the greatest number of inhabitants near a given location.

A very localized example
We first need to find and import some data, and I found at the following place a CSV listing of french cities with coordinates and population and some numbers of interest for the exercise here.
To import the data set, we first need a table, then a
COPYcommand:CREATE TABLE lion1906 ( insee text, nom text, altitude integer, code_postal text, longitude double precision, latitude double precision, pop99 bigint, surface double precision ); \copy lion1906 from 'villes.csv' with csv header delimiter ';' encoding 'latin1'
With that data in place, we can find the 10 nearest towns of a random choosing of us, let's pick Villeurbanne which is in the region of Lyon.
select code_postal, nom, pop99 from lion1906 order by point(longitude, latitude) <-> (select point(longitude, latitude) from lion1906 where nom = 'Villeurbanne') limit 10; code_postal | nom | pop99 -------------+------------------------+-------- 69100 | Villeurbanne | 124215 69300 | Caluire-et-Cuire | 41233 69120 | Vaulx-en-Velin | 39154 69580 | Sathonay-Camp | 4336 69140 | Rillieux-la-Pape | 28367 69000 | Lyon | 445452 69500 | Bron | 37369 69580 | Sathonay-Village | 1693 01700 | Neyron | 2157 69660 | Collonges-au-Mont-d'Or | 3420 (10 rows)
We find Lyon in our list in there, and we want the query now to return only that one as it has the greatest number of inhabitants in the list:
with neighbours as ( select code_postal, nom, pop99 from lion1906 order by point(longitude, latitude) <-> (select point(longitude, latitude) from lion1906 where nom = 'Villeurbanne') limit 10 ) select * from neighbours order by pop99 desc limit 1; code_postal | nom | pop99 -------------+------+-------- 69000 | Lyon | 445452 (1 row)
Well, thank you PostgreSQL, that was easy!
Note that you can actually index such queries, that's called a KNN index. PostgreSQL knows how to use some kind of indexes to fetch data matching an expression such as
ORDER BY a <-> b, which allow you to consider a KNN search in your application.Let's get worldwide
The real scope of our exercise is to associate every known town in the world with some big city around, so let's first fetch and import some worldwide data this time, from http://download.maxmind.com/download/worldcities/worldcitiespop.txt.gz.

CREATE TABLE maxmind_worldcities ( country_code text, city_lower text, city_normal text, region_code text DEFAULT '', population INT DEFAULT '0', latitude float8 DEFAULT '0', longitude float8 DEFAULT '0' ); \copy maxmind_worldcities FROM '/tmp/worldcitiespop.txt' WITH DELIMITER ',' QUOTE E'\f' CSV HEADER ENCODING 'LATIN1'; alter table maxmind_worldcities add column loc point; update maxmind_worldcities set loc = point(longitude, latitude);
This time you can see that I created an extra column with the location in there, so that I don't have to compute it each time I need it, like I did before.
Now is the time to test that data set and hopefully fetch the same result as before when we only had french cities loaded:
with neighbours as ( select country_code, city_lower, population from maxmind_worldcities where population is not null order by loc <-> (select loc from maxmind_worldcities where city_lower = 'villeurbanne') limit 10 ) select * from neighbours order by population desc limit 1; country_code | city_lower | population --------------+------------+------------ fr | lyon | 463700 (1 row)
Ok, looks like we're all set for the real problem. Now we want to pick for each of those cities it's nearest neighboor, so here's how to do that:
create index on maxmind_worldcities(country_code, region_code, city_lower); create index on maxmind_worldcities using gist(loc); create table maxmind_neighbours as select country_code, region_code, city_lower, (with neighbours as ( select country_code, city_lower, population from maxmind_worldcities where population is not null and country_code = wc.country_code and region_code = wc.region_code order by loc <-> wc.loc limit 10) select city_lower from neighbours order by population desc limit 1 ) as neighbour from maxmind_worldcities wc ;
To be fair, I have to tell you that this query took almost 2 hours to complete on my laptop here, but as I'm doing that for friend and a blog article, I've been lazy and didn't try to optimise it. It could be using
LATERALfor sure, I don't know if that would help very much with performances: I didn't try.With that in hands we can now check some cities and their biggest neighbours, as in the following query:
select * from maxmind_neighbours where city_lower = 'villeurbanne'; country_code | region_code | city_lower | neighbour --------------+-------------+--------------+----------- fr | B9 | villeurbanne | lyon (1 row)
And looking for New-York City suburbs I did find a chinatown, which is a pretty common smaller town name apparently:
select * from maxmind_neighbours where city_lower = 'chinatown'; country_code | region_code | city_lower | neighbour --------------+-------------+------------+--------------- sb | 08 | chinatown | honiara us | CA | chinatown | san francisco us | DC | chinatown | washington us | HI | chinatown | honolulu us | IL | chinatown | chicago us | MT | chinatown | missoula us | NV | chinatown | reno us | NY | chinatown | new york (8 rows)
Big Cities in the big world

We might need to change some of our views
So, let's see how many smaller towns each of those random big cities have:
select country_code, region_code, neighbour, count(*) from maxmind_neighbours where neighbour in ('london', 'new york', 'moscow', 'paris', 'tokyo', 'sao polo', 'chicago') group by country_code, region_code, neighbour; country_code | region_code | neighbour | count --------------+-------------+-----------+------- gb | H9 | london | 2 jp | 40 | tokyo | 414 us | NY | new york | 131 ca | 08 | london | 16 ru | 48 | moscow | 245 fr | A8 | paris | 16 us | IL | chicago | 13 (7 rows)
And now let's be fair and see where are the cities with the greatest number of towns nearby them, with the following query:
select country_code, region_code, neighbour, count(*) from maxmind_neighbours where neighbour is not null group by country_code, region_code, neighbour order by 4 desc limit 25; country_code | region_code | neighbour | count --------------+-------------+------------+------- cn | 03 | nanchang | 16759 cn | 26 | xian | 12864 id | 18 | kupang | 10715 cn | 24 | taiyuan | 10550 mm | 11 | taunggyi | 10253 id | 38 | makasar | 9471 ir | 15 | ahvaz | 9461 id | 01 | banda aceh | 9161 cn | 14 | lasa | 8841 cn | 15 | lanzhou | 8618 ir | 29 | kerman | 8579 id | 26 | medan | 7787 ir | 04 | iranshahr | 7249 ir | 07 | shiraz | 7219 ma | 55 | agadir | 7121 ir | 42 | mashhad | 7107 af | 08 | gazni | 7011 ir | 33 | tabriz | 6586 cn | 01 | hefei | 6521 bd | 81 | dhaka | 6480 ir | 08 | rasht | 6471 id | 17 | mataram | 6467 id | 33 | cilegon | 6287 af | 23 | qandahar | 6213 cn | 07 | fuzhou | 6089 (25 rows)
-
Retention of backups with Barman
Defining a disaster recovery plan involves defining backup policies. A key aspect of backup policies is to define how long backup data is retained for disaster recovery purposes. This applies to all digital content, including PostgreSQL databases.
Barman 1.2.0 introduces automated management of backup retention policies of PostgreSQL servers.
Retention policies were one of the midterm goals that we had given ourselves when the whole Barman idea began to take shape. I am glad that, thanks to the vision of a French company (which explicitly requested to remain anonymous), we were able to undertake their open-source development.
What are retention policies?
A retention policy is a user-defined set of guidelines and principles that determines how long backups and their related archive logs need to be retained for recovery procedures. In a PostgreSQL database scenario:
- with “backup” we refer to a full physical backup, performed periodically, when the database is online (hence the term “hot” very often used in this context);
- with “archive logs” we refer to Write Ahead Log (WAL) files, responsible for implementing differential backup by continuously archiving every change made to the data files of the database server.
Base backups and archive logs form the so-called “backup catalogue” and allow database administrators to perform Point-In-Time recovery operations.
Further information on PostgreSQL physical backup and continuous archiving can be found in the Postgres documentation.
Why are retention policies so important?
In some countries and environments, it is the law that requires ICT departments to enforce them for data security and protection reasons. In Italy, for instance, the “Codice dell’Amministrazione Digitale” (CAD) requires that public organisations report retention periods in official documents such as their Disaster Recovery Plan.
From a more practical point of view, retention policies improve automation of a backup solution, while reducing the management and configuration complexity.
How are they implemented in Barman?
Barman at any time retains:
- the periodical backups required to satisfy the current retention policy for a given PostgreSQL server;
- the archived WAL files required for the complete recovery of those backups.
On the same topic, you may wish to read one of my previous blog posts about the backup catalogue and the WAL archive in Barman.
Barman users can define a retention policy in terms of:
- backup redundancy (how many periodical backups), or
- a recovery window (how long).
- In case of a retention policy based on redundancy, the administrator decides how many periodical backups to keep.
- On the other hand, a retention policy based on recovery window allows the DBA to specify a period of time (recovery window). Barman ensures retention of backups and/or archived WAL files required for point-in-time recovery to any time during that recovery window.
- Retention policies are managed in Barman by the configuration option ‘retention_policy’ and the ‘barman cron’ command (responsible for maintenance operations). In the next article we will go through the configuration of retention policies with some simple examples.
- Where can I find Barman?
- Barman is an open source application for disaster recovery developed and maintained by 2ndQuadrant. It is written in Python and it can be installed through sources or PyPI. However, RPM packages and Debian packages are available for installation on RHEL/CentOS 5/6, Debian or Ubuntu 12.04 LTS.
- More information on Barman can be found on the website (www.pgbarman.org) and the available documentation.
-
Bulk Replication
In the previous article here we talked about how to properly update more than one row at a time, under the title Batch Update. We did consider performances, including network round trips, and did look at the behavior of our results when used concurrently.

A case where we want to apply the previous article approach is when replicating data with a trigger based solution, such as SkyTools and londiste. Well, maybe not in all cases, we need to have a amount of
UPDATEtrafic worthy of setting up the solution. As soon as we know we're getting to replay important enough batches of events, though, certainly using the batch update tricks makes sense.It so happens that
londiste 3includes the capability to use handlers. Those are plugins written in python (like all the client side code from SkyTools) whose job is to handle the processing of the event batches. Several of them are included in the londiste sources, and one of them is namedbulk.py.Bulk loading data with londiste
To use set in
londiste.ini:handler_modules = londiste.handlers.bulkthen add table with one of those commands:
londiste3 add-table xx --handler="bulk" londiste3 add-table xx --handler="bulk(method=X)"
The default method is
0, and the available methods are the following:correct (
0)- inserts as
COPYinto table - update as
COPYinto temp table and singleUPDATEfrom there - delete as
COPYinto temp table and singleDELETEfrom there
delete (
1)- as correct, but update are done as
DELETEthenCOPY
merged
(2)- as delete, but merge insert rows with update rows
Conclusion

Yes, by using that handler which is provided by default in londiste, you will apply the previous article tricks in your replication solution. And you can even choose to use that for only some of the tables you are replicating.
- inserts as
-
Batch Update
Performance consulting involves some tricks that you have to teach over and over again. One of them is that SQL tends to be so much better at dealing with plenty of rows in a single statement when compared to running as many statements, each one against a single row.

Another kind of Batch to update
So when you need to
UPDATEa bunch of rows from a given source, remember that you can actually use aJOINin the update statement. Either the source of data is already in the database, in which case it's as simple as using theFROMclause in the update statement, or it's not, and we're getting back to that in a minute.UPDATE FROMIt's all about using that
FROMclause in an update statement, right?UPDATE target t SET counter = t.counter + s.counter, FROM source s WHERE t.id = s.idUsing that, you can actually update thousands of rows in our target table in a single statement, and you can't really get faster than that.
Preparing the Batch
Now, if you happen to have the source data in your application process' memory, the previous bits is not doing you any good, you think. Well, the trick is that pushing your in-memory data into the database and then joining against the now local source of data is generally faster than looping in the application and having to do a whole network round trip per row.

What about that round trip?
Let's see how it goes:
CREATE TEMP TABLE source(LIKE target INCLUDING ALL) ON COMMIT DROP; COPY source FROM STDIN; UPDATE target t SET counter = t.counter + s.counter, FROM source s WHERE t.id = s.id
As we're talking about performances, the trick here is to use the COPY protocol to fill in the temporary table we just create to hold our data. So we're now sending the whole data set in a temporary location in the database, then using that as the
UPDATEsource. And that's way faster than doing a separateUPDATEstatement per row in your batch, even for small batches.Also, rather than using the SQL
COPYcommand, you might want to look up the docs of the PostgreSQL driver you are currently using in your application, it certainly includes some higher level facilities to deal with pushing the data into the streaming protocol.Insert or Update
And now sometime some of the rows in the batch have to be updated while some others are new and must be inserted. How do you do that? Well, PostgreSQL 9.1 brings on the table
WITHsupport for all DML queries, which means that you can do the following just fine:WITH upd AS ( UPDATE target t SET counter = t.counter + s.counter, FROM source s WHERE t.id = s.id RETURNING s.id ) INSERT INTO target(id, counter) SELECT id, sum(counter) FROM source s LEFT JOIN upd USING(id) WHERE t.id IS NULL GROUP BY s.id RETURNING t.id
That query here is updating all the rows that are known in both the target and the source and returns what we took from the source in the operation, so that we can do an anti-join in the next step of the query, where we're inserting any row that was not taken care of in the update part of the statement.
Note that when the batch gets to bigger size it's usually better to join against the target table in the
INSERTstatement, because that will have an index on the join key.Concurrency patterns
Now, you will tell me that we just solved the
UPSERTproblem. Well what happens if more than one transaction is trying to do theWITH (UPDATE) INSERTdance at the same time? It's a single statement, so it's a single snapshot. What can go wrong?
Concurrent processing
What happens is that as soon as the concurrent sources contain some data for the same primary key, you get a duplicate key error on the insert. As both the transactions are concurrent, they are seeing the same target table where the new data does not exists, and both will conclude that they need to
INSERTthe new data into the target table.There are two things that you can do to avoid the problem. The first thing is to make it so that you're doing only one batch update at any time, by architecting your application around that constraint. That's the most effective way around the problem, but not the most practical.
The other thing you can do, is force the concurrent transactions to serialize one after the other, using an explicit locking statement:
LOCK TABLE target IN SHARE ROW EXCLUSIVE MODE;
That lock level is not automatically acquired by any PostgreSQL command, so the only way it helps you is when you're doing that for every transaction you want to serialize. When you know you're not at risk (that is, when not playing the insert or update dance), you can omit taking that lock.
Conclusion
The SQL language has its quirks, that's true. It's been made for efficient data processing, and with recent enough PostgreSQL releases you even have some advanced pipelining facilities included in the language. Properly learning how to make the most out of that old component of your programming stack still makes a lot of sense today!
-
Emacs Conference
The Emacs Conference is happening, it's real, and it will take place at the end of this month in London. Check it out, and register at Emacs Conference Event Brite. It's free and there's still some availability.

It's all about Emacs, and it rocks!
We have a great line-up for this conference, which makes me proud to be able to be there. If you've ever been paying attention when using Emacs then you've already heard those names: Sacha Chua is frequently blogging about how she manages to improve her workflow thanks to Emacs Lisp, John Wiegley is a proficient Emacs contributor maybe best known for his ledger Emacs Mode, then we have Luke Gorrie who hacked up SLIME among other things, we also have Nic Ferrier who is starting a revolution in how to use Emacs Lisp with elnode. And more! Including Steve Yegge!
See you there in London.
-
HyperLogLog Unions
In the article from yesterday we talked about PostgreSQL HyperLogLog with some details. The real magic of that extension has been skimmed over though, and needs another very small article all by itself, in case you missed it.

Which Set Operation do you want for counting unique values?
The first query here has the default level of magic in it, really. What happens is that each time we do an update of the HyperLogLog hash value, we update some data which are allowing us to compute its cardinality.
=> select date, #users as daily, pg_column_size(users) as bytes from daily_uniques order by date; date | daily | bytes ------------+------------------+------- 2013-02-22 | 401676.779509985 | 1287 2013-02-23 | 660187.271908359 | 1287 2013-02-24 | 869980.029947449 | 1287 2013-02-25 | 580865.296677817 | 1287 2013-02-26 | 240569.492722719 | 1287 (5 rows)
And has advertized the data is kept in a static sized data structure. The magic here all happens at
hll_add()time, the function you have to call to update the data.Now on to something way more magic!

Are those the aggregates you're looking for?
=> select to_char(date, 'YYYY/MM') as month, round(#hll_union_agg(users)) as monthly from daily_uniques group by 1; month | monthly ---------+--------- 2013/02 | 1960380 (1 row)
The HyperLogLog data structure is allowing the implementation of an union algorithm that will be able to compute how many unique values you happen to have registered in both one day and the next. Extended in its general form, and doing SQL, what you get is an aggregate that you can use in
GROUP BYconstructs and window functions. Did you read about them yet? -
PostgreSQL HyperLogLog
If you've been following along at home the newer statistics developments, you might have heard about this new State of The Art Cardinality Estimation Algorithm called HyperLogLog. This technique is now available for PostgreSQL in the extension postgresql-hll available at https://github.com/aggregateknowledge/postgresql-hll and soon to be in
debian.
How to Compute Cardinality?
Installing postgresql-hll
It's as simple as
CREATE EXTENSION hll;really, even if to get there you must have installed the package on your system. We did some packaging work fordebianand the result should appear soon in a distro near you.Then you also need to keep your data in some table, straight from the documentation we can use that schema:
-- Create the destination table CREATE TABLE daily_uniques ( DATE DATE UNIQUE, users hll );
Then to add some data for which you want to know the cardinality of, it's as simple as in the following
UPDATEstatement:UPDATE daily_uniques SET users = hll_add(users, hll_hash_text('123.123.123.123')) WHERE date = current_date;
So in our example what you see is that we want to decipher how many unique IP addresses we saw, and we do that by first creating a hash of that source data then calling
hll_add()with the current value and the hash result.The current value must be initialized using
hll_empty().Concurrency
The most awake readers among you have already spotted that: using an
UPDATEon the same row over and over again is a good recipe to kill any form of concurrency, so you don't want to do that on your production setup unless you don't care about thoseUPDATE waitingpiling up in your system.The idea is then to fill-in a queue of updates and asynchronously update the
daily_uniquestable from that queue, possibly using thehll_add_aggaggregate that the extension provides, so that you do only oneupdateper batch of values to process.∅: Empty Set and NULL

Yes there's a unicode entry for that, ∅
Now, what happens when the batch of new unique values you want to update from is itself empty? Well I would have expected
hll_add_aggover an empty set to return an emptyhllvalue, the same as returned byhll_empty(), but it turns out it's returningNULLinstead.And then
hll_add(users, NULL)will happily returnNULL. So the nextUPDATEis cancelling all the previous work, which is not nice. We had to cater for that case explicitely in theUPDATEquery that's working from the batch of new values to add to our current HyperLogLog hash entry, and I can't resist to show off one of the most awesome PostgreSQL features here: writable CTE.WITH hll(agg) AS ( SELECT hll_add_agg(hll_hash_text(value)) FROM new_batch ) UPDATE daily_uniques SET users = CASE WHEN hll.agg IS NULL THEN users ELSE hll_union(users, hll.agg) END FROM hll WHERE date = current_date;
That's how you protect against an empty set being turned into a
NULL. I think the real fix would need to be included inpostgresql-hllitself, in making it so that thehll_add_aggaggregate returnshll_empty()on an empty set, and I will report that bug (with that very article as the detailed explanation of it).Using postgresql-hll
When using
postgresql-hllon the production system, we were able to get some good looking numbers from ourdaily_uniquestable:with stats as ( select date, #users as daily, #hll_union_agg(users) over() as total from daily_uniques ) select date, round(daily) as daily, round((daily/total*100)::numeric, 2) as percent from stats order by date; date | daily | percent ------------+--------+--------- 2013-02-22 | 401677 | 25.19 2013-02-23 | 660187 | 41.41 2013-02-24 | 869980 | 54.56 2013-02-25 | 154996 | 9.72 (4 rows)
I coulnd't resist to show off two of my favorite SQL constructs in that example query here, which are the Common Table Expressions (or CTE) and window functions. If that
over()clause reads strange to you, take a minute now and go read about it. Yes, do that now, we're waiting.The data here is showing that we did setup the facility in the middle of the first day, and that the morning's activity is quite low.
Conclusion

When using
postgresql-hllyou need to be careful not to kill your application concurrency abilities, and you need to protect yourself against the ∅ killer too. The other thing to keep in mind is that the numbers you get out of thehlltechnique are estimates within a given precision, and you might want to read some more about what it means for your intended usage of the feature. -
Playing with pgloader
While making progress with both Event Triggers and Extension Templates, I needed to make a little break. My current keeping sane mental exercise seems to mainly involve using Common Lisp, a programming language that ships with about all the building blocks you need.

Yes, that old language brings so much on the table
When using Common Lisp, you have an awesome interactive development environment where you can redefine function and objects while testing them. That means you don't have to quit the interpreter, reload the new version of the code and put the interactive test case together all over again after a change. Just evaluate the change in the interactive environement: functions are compiled incrementally over their previous definition, objects whose classes have changed are migrated live.
See, I just said objects and classes. Common Lisp comes with some advanced Object Oriented Programming facilities named CLOS and MOP where the Java and Python and C++ object models are just a subset of what you're being offered. Hint, those don't have Multiple Dispatch.
And you have a very sophisticated Condition System where Exceptions are just a subset of what you can do (hint: have a look a restarts and tell me you didn't wish your programming language of choice had them). And it continues that way for about any basic building bloc you might want to be using.
Loading data
Back to pgloader will you tell me. Right. I've been spending a couple of evening on hacking on the new version of pgloader in Common Lisp, and wanted to share some preliminary results.

Playing with the loader
The current status of the new pgloader still is pretty rough, if you're not used to develop in Common Lisp you might not find it ready for use yet. I'm still working on the internal APIs and trying to make something clean and easy to use for a developer, and then I will provide some external ways to play with it, user oriented. I missed that step once with the Python based version of the tool, I don't want to do the same errors again this time.
So here's a test run with the current pgloader, on a small enough data set of
226 MBofCSVfiles.time python pgloader.py -R.. --summary -Tc ../pgloader.dbname.conf Table name | duration | size | copy rows | errors ==================================================================== aaaaaaaaaa_aaaa | 2.148s | - | 24595 | 0 bbbbbbbbbb_bbbb...| 0.609s | - | 326 | 0 cccccccccc_cccc...| 2.868s | - | 25126 | 0 dddddddddd_dddd...| 0.638s | - | 8 | 0 eeeeeeeeee_eeee...| 2.874s | - | 36825 | 0 ffffffffff_ffffff | 0.667s | - | 624 | 0 gggggggggg_gggg...| 0.847s | - | 5638 | 0 hhh_hhhhhhh | 9.907s | - | 120159 | 0 iii_iiiiiiiiiiiii | 0.574s | - | 661 | 0 jjjjjjj | 6.647s | - | 30027 | 0 kkk_kkkkkkkkk | 0.439s | - | 12 | 0 lll_llllll | 0.308s | - | 4 | 0 mmmm_mmm | 2.139s | - | 29669 | 0 nnnn_nnnnnn | 8.555s | - | 100197 | 0 oooo_ooooo | 13.781s | - | 93555 | 0 pppp_ppppppp | 8.275s | - | 76457 | 0 qqqq_qqqqqqqqqqqq | 8.568s | - | 126159 | 0 ==================================================================== Total | 01m09.902s | - | 670042 | 0
Streaming data
With the new code in Common Lisp, I could benefit from real multi threading and higher level abstraction to make it easy to use: lparallel is a lib providing exactly what I need here, with workers and queues to communicate data in between them.
What I'm doing is that two threads are separated, one is reading the data from either a
CSVfile or a MySQL database directly, and pushing that data in the queue; while the other thread is pulling data from the queue and writing it into our PostgreSQL database.CL-USER> (pgloader.csv:import-database "dbname" :csv-path-root "/path/to/csv/" :separator #\Tab :quote #\" :escape "\"\"" :null-as ":null:") table name read imported errors time ------------------------------ --------- --------- --------- --------- aaaaaaaaaa_aaaa 24595 24595 0 0.995s bbbbbbbbbb_bbbbbbbbb 326 326 0 0.570s cccccccccc_cccccccccccc 25126 25126 0 1.461s dddddddddd_dddddddddd_dd 8 8 0 0.650s eeeeeeeeee_eeeeeeeeee_eeeeeeee 36825 36825 0 1.664s ffffffffff_ffffff 624 624 0 0.707s gggggggggg_ggggg_gggggggg 5638 5638 0 0.655s hhh_hhhhhhh 120159 120159 0 3.415s iii_iiiiiiiiiiiii 661 661 0 0.420s jjjjjjj 30027 30027 0 2.743s kkk_kkkkkkkkk 12 12 0 0.327s lll_llllll 4 4 0 0.315s mmmm_mmm 29669 29669 0 1.182s nnnn_nnnnnn 100197 100197 0 2.206s oooo_ooooo 93555 93555 0 9.683s pppp_ppppppp 76457 76457 0 5.349s qqqq_qqqqqqqqqqqq 126159 126159 0 2.495s ------------------------------ --------- --------- --------- --------- Total import time 670042 670042 0 34.836s NIL
As you can see the control is still made for interactive developer usage, which is fine for now but will have to change down the road, when the APIs stabilize.
Now, let's compare to reading directly from MySQL:
CL-USER> (pgloader.mysql:stream-database "dbname") table name read imported errors time ------------------------------ --------- --------- --------- --------- aaaaaaaaaa_aaaa 24595 24595 0 0.887s bbbbbbbbbb_bbbbbbbbb 326 326 0 0.617s cccccccccc_cccccccccccc 25126 25126 0 1.497s dddddddddd_dddddddddd_dd 8 8 0 0.582s eeeeeeeeee_eeeeeeeeee_eeeeeeee 36825 36825 0 1.697s ffffffffff_ffffff 624 624 0 0.748s gggggggggg_ggggg_gggggggg 5638 5638 0 0.923s hhh_hhhhhhh 120159 120159 0 3.525s iii_iiiiiiiiiiiii 661 661 0 0.449s jjjjjjj 30027 30027 0 2.546s kkk_kkkkkkkkk 12 12 0 0.330s lll_llllll 4 4 0 0.323s mmmm_mmm 29669 29669 0 1.227s nnnn_nnnnnn 100197 100197 0 2.489s oooo_ooooo 93555 93555 0 9.148s pppp_ppppppp 76457 76457 0 6.713s qqqq_qqqqqqqqqqqq 126159 126159 0 4.571s ------------------------------ --------- --------- --------- --------- Total streaming time 670042 670042 0 38.272s NILThe streaming here is a tad slower than the importing from files. Now if you want to be fair when comparing those, you would have to take into account the time it takes to export the data out from its source. When doing that export/import dance, a quick test shows a timing of
1m4.745s. Now, if we do an export only test, it runs in31.822s. So yes streaming is a good thing to have here.Conclusion
We just got twice as fast as the python version.
Some will say that I'm not comparing fairly to the Python version of pgloader here, because I could have implemented the streaming facility in Python too. Well actually I did, the option are called section_threads and split_file_reading, that you can set so that a reader is pushing data into a set of queues and several workers are feeding each from its own queue. It didn't help with performances at all. Once again, read about the infamous Global Interpreter Lock to understand why not.

So actually it's a fair comparison here where the new code is twice as fast as the previous one, with only some hours of hacking and before spending any time on optimisation. Well, apart from using a producer, a consumer and a queue, which I almost had to have for streaming in between two database connections anyways.
-
Live Upgrading PGQ
Some skytools related new today, it's been a while. For those who where at my FOSDEM's talk about Implementing High Availability you might have heard that I really like working with PGQ. A new version has been released a while ago, and the most recent verion is now
3.1.3, as announced in the Skytools 3.1.3 email.
Upgrade time!
Skytools 3.1.3 enters debian
First news is that Skytools 3.1.3 has been entering debian today (I hope that by the time you reach that URL, it's been updated to show information according to the news here, but I might be early). As there's current a debian freeze to release wheezy (and you can help squash some bugs), this version is only getting uploaded to experimental for now. Thanks to the tireless work of Christoph Berg though, this version is already available from apt.postgresql.org.
Upgrading to PGQ 3
The other news is that I've been testing live upgrade scenario where we want to upgrade from
PGQtoPGQ3, and it works pretty well, and it's quite simple to achieve too. Here's how.So the first thing is to shut down the current ticker process. Then we install the new packages, assuming that you did follow the step in the wiki pointed above, please go read apt.postgresql.org again now if needs be.
pgqadm.py ticker.ini -s sudo apt-get install postgresql-9.1-pgq3 skytools3-ticker skytools3
The ticker is not running anymore, we have the right version of the software installed. Next step is to upgrade the database parts of PGQ:
psql -f /usr/share/skytools3/pgq.upgrade_2.1_to_3.0.sql ... psql -1 -f /usr/share/postgresql/9.1/contrib/pgq.upgrade.sql ...
Of course replace those
...with options such as your actual connection string. I tend to always add-vON_ERROR_STOP=1to all these commands, so that I don't depend on having the right.psqlrcon the particular server I'm connected to. Also remember that if you want to do that for more than one database, you need to actually run that pair of commands for each of them.Now it's time to restart the new ticker. The main changes from the previous one is that it is now a
Cprogram calledpgqdthat knows how to tick for any number of databases, so that you only have to have one instance around per cluster now.sudo /etc/init.d/skytools3 start tail -f /var/log/skytools/pgqd.log
Those two commands are taking for granted that you did prepare the
pgqdsetup the debian and skytools way, by adding your config in/etc/skytools3/pgqd.iniand editing/etc/skytools.iniaccordingly, so that it's automatically taken into account at machine boot.Note that I did actually exercised the procedure above while running a pgbench test replicated with
londiste. Of course the replication has been lagging a little while no ticker was running, and then it catched-up as fast as it could, in that case:INFO {count: 245673, ignored: 0, duration: 422.104366064}Happy Hacking!
So if you have any batch processing needs, remember to consider what PGQ has to offer. And yes if you're running some cron job to compute things out of the database for you, you are doing some batch processing.

Yes, I did search for Transactional Batch Processing
- Page 1 of 33 ( 330 articles )
- >>
