PostgreSQL - The story so far
PostgreSQL (or Postgres) began its life in 1986 as POSTGRES, a research project of the University of California at Berkeley, led by highly influential database researcher Michael Stonebraker. In the design of POSTGRES, Stonebraker’s team sought to improve upon INGRES, an earlier prototype project also led by Stonebraker, principally through the support of user defined types (or “domains”) with arbitrarily complex business rules and other object-relational concepts.
Stonebraker’s team actively developed POSTGRES for eight years, and developed features including rules, procedures and extensible types with indexes. POSTGRES was later commercialised as Illustra, which was later acquired by Informix and integrated into its Universal Server. In 2001, IBM acquired Informix at a cost of $1 Billion.
POSTGRES used its own query language, POSTQUEL. While theoretically superior to the dominant SQL, with a greater depth of expression due to its more advanced theoretical underpinnings, in practice POSTQUEL was not aligned with the needs of industry, which had already standardised on SQL. For that reason, in 1995 two Ph.D. students from Stonebraker’s lab, Andrew Yu and Jolly Chen, replaced POSTQUEL with an extended subset of SQL. POSTGRES was renamed Postgres95.
In 1996, the project garnered significant interest from outside of academia. It became clear that the name Postgres95 would not age well, and the project was renamed PostgreSQL. The PostgreSQL global development group, an international affiliation of database developers principally working within industry was born, and assumed control of the Postgres codebase. PostgreSQL started at version 6, for consistency with the Berkeley versioning, in a nod to the significant contribution made by Stonebraker’s team.
During this initial 6.* era of open source development, many of the features that have come to define PostgreSQL were developed, including:
- Multiversion Concurrency Control. Table level locking was replaced by MVCC, a sophisticated system that prevents readers from blocking writers and writers from blocking readers. MVCC was popularized by Oracle in the early 1980's, and its use in the free PostgreSQL helped improve adoption of the technique in the many other database systems that now support it.
- Major speed enhancements were made. While the project has historically prioritised data integrity and reliability, there were significant gains in performance.
- Improved built-in types, including sophisticated date/time types and support for advanced geometric types.
The approximately 4 or so years and 5 corresponding major versions (7.0 - 7.4) that marked the 7.* era brought many more watersheds. These included:
- Most notably, an initial implementation of WAL (write-ahead logging). WAL is a family of techniques for providing atomicity and durability in database systems. WAL segments are written to disk that describe all changes to the database, in advance of actually having those changes applied.
- OUTER JOINs
- TOAST, a technique for storing larger datums compressed and out-of-line, so that a database table could, for example, be used to store large passages of text efficiently.
- Many procedural languages, including PL/PGSQL, based on Oracle database’s PL/SQL
If the 7.* line was marked by usability enhancements and advanced developer-focused features that by many measures outstripped all of the proprietary database vendors, it was the 8.* line, lasting from 2004 - 2009, that brought features that were previously assumed to be the exclusive purview of two or three megacorporations. WAL - the write-ahead log - was an architecturally integral part of many of these features, particularly clustering and high-availability features. It was and is a continual focus of 2ndQuadrant’s open source contributions to PostgreSQL. The 8.* line is also notable for marking the founding of 2ndQuadrant, which, from the outset and beginning in the 8.0 release, was heavily involved in the PostgreSQL development process.
A talk at PgCon 2011 (Picture: L. Sietinsone)
Initially, the 8.0 release brought an important feature: Savepoints, or sub-transactions, whereby an atomic piece of work (that is, a transaction) can be subdivided into smaller pieces of work that can individually abort atomically, without necessarily aborting the entire enclosing transaction. Savepoints can even be nested arbitrarily. Perhaps an even more important feature introduced in that release was Point-in-Time Recovery, a technique through which it is possible to provide continuous backup of the server, or to recover to some point in the past immediately prior to a failure. It also brought the long-awaited native Windows port.
The 8.1 release in 2005 brought another enterprise-class feature from 2ndQuadrant: Table partitioning through the use of “constraint exclusion”, which is a feature that allows the planner to elide a scan on child tables of a parent table in an inheritance hierarchy, when it can determine that such a scan could not return rows for the query in question.
The 8.2 release in 2006 consolidated the gains of the prior two releases, with much-improved instrumentation of WAL information and various crash recovery features. 8.2 was also notable for being the release that was forked by Greenplum as a basis for their proprietary database product, to serve the niche data warehousing market. Just 4 years later, EMC acquired Greenplum and their eponymous data warehousing solution. While the value of the all-cash acquisition was not made public, Greenplum had reportedly raised about $61 million in funding prior to the acquisition, based on which the technology press independently estimated the value of the deal to be between $100 million and $150 million.
2008’s 8.3 release was a performance watershed. It brought a number of major performance features, including:
- HOT (Heap-only Tuples), an optimisation that allows for the re-claiming of dead tuples not only by the usual VACUUM process, but also at the time of an INSERT or UPDATE provided no changes occur to indexed columns. Apart from considerably improving performance, this optimisation also makes performance more consistent.
- Asynchronous commit, a feature introduced by 2ndQuadrant CTO Simon Riggs to allow transactions to commit asynchronously for performance reasons, which can be used by applications where that trade-off makes sense.
- Re-write of the background writer by PostgreSQL Unites States principal consultant and Postgres performance expert Gregory Smith, to introduce a new just-in-time strategy that improved disk-write efficiency considerably.
- A 2ndQuadrant-led Improvement of the cache eviction strategy, so that large sequential scans no longer unduly force frequently used cached pages out of the buffer cache to the same extent.
- pg_standby module, written by 2ndQuadrant, allows for easy management of warm standbys.
Version 8.4, the final from the 8.* line was released in 2009. It brought further improvements in usability, developer-focused features and performance including:
- More advanced SQL features, such as Windowing functions and Common Table Expressions. Prior to this release, these features were exclusively available in a small number of proprietary database systems.
- Parallel restore: Restore the database from a logical backup in parallel.
- The visibility map, which greatly reduces vacuum overhead for slowly-changing tables.
- The on-disk free space map, which simplified the management of the free space map to the point that it’s something that never has to be considered by end users.
The current 9.* line represents a turning point for the PostgreSQL community, for a number of reasons; principally, in the mind of most users, for 9.0’s introduction of simple, out of the box streaming binary replication. This was a feature principally brought about by the work of 2ndQuadrant, as a logical adjunct of the company’s prior community work on WAL related features, specifically, 2ndQuadrant CTO Simon Riggs’ introduction of the Hot Standby feature.
Simon attends the PostgreSQL developer's meeting 2011 (Picture: O. Bartunov)
Another reason that the 9.* line represents a turning point is because with it, PostgreSQL has gone from being often defined by comparisons with proprietary RDBMSs to being markedly on the cutting edge, surpassing all contenders in a number of important areas. Perhaps most strikingly, 2ndQuadrant’s introduction of synchronous replication in PostgreSQL 9.1 offers something of immense value: zero data-loss replication. In a world-first, users can even control the durability of each transaction, and all durability levels can co-exist in a single application.
The 9.* line also brings a number of world-firsts not directly related to WAL. These include the innovative Serializable Snapshot Isolation feature, based on a recent, well-received academic paper, through which it is possible to achieve the true serializability offered by predicate locking, without causing any blocking to transactions, and thus without any additional deadlocks. It also includes exclusion constraints, through which it is possible to solve “the double-booking problem” (to prevent double-booking of a conference room by having overlapping sessions in the same room), enforced in a similar fashion to a unique constraint without the use of performance killing explicit heavyweight locks. A further example of a world-first innovative feature, available as of PostgreSQL 9.1, is Writeable common table expressions, introduced by 2ndQuadrant consultant Marko Tiikkaja. This non-standard extension to standard SQL Common Table Expressions allows WITH clauses to contain subqueries that not only SELECT, but INSERT, UPDATE and DELETE. This facilitates various useful patterns, such as easily moving data (DELETEing it from one table and INSERTing it into another) within a single query/snapshot.