One of the most frustrating parts of an Oracle to PostgreSQL migration is the fact that both products use the same terminology, but with vastly different meanings. It is very easy to get caught up in a terminology misunderstanding and spend quite some time arguing across one another, only to find out that there wasn’t any disagreement at all. This rather strange argument becomes even more confusing as other add-on technologies are involved. Seeking to reproduce the same functionality across the two systems by adding components can make the transition easier, but strangely makes the dialogue harder.
This article seeks to describe the two architectures in somewhat comparative terms. Along with that comes the understanding that the comparison cannot be made exactly, as the architectures are fundamentally different at the implementation level. At least we will seek to alleviate the largest misunderstandings, and possibly give our new novice PostgreSQL DBA some basic images in their head to think with.
The confusion starts nice and early in the process. Let’s talk about “instances”. In Oracle terms, a system consists of a host (hardware/virtual machine) that has a set of database processes providing access to data. This one-to-one-to-one relationship of hardware to processes to databases encourages the Oracle administrator to speak about the instance at 3 levels fairly glibly. This could mean the hardware, the virtual machine, the processes serving the data, or the data itself as if this were all one big happy installation system. There is such a thing as a shared server configuration, but for the sake of this article, we are going to skip that additional complexity for the sake of brevity.
In the PostgreSQL environment, a host (hardware/virtual machine) may have any number of instances (processes serving databases), which may in turn have multiple databases (directories with data). The PostgreSQL DBA must differentiate between the hardware providing the platform, the instances providing service, and the data. This is a one-to-many-to-many-to-many relationship. In rough terms you may look at the relationship thusly:
|Instance or database
|Host or Virtual Machine
|Instance or database
|Instance or database
|Data Directory or Cluster
|Instance or database
And this brings us to our next disambiguation. The word “cluster”. Unfortunately, the word cluster in PostgreSQL is not unique to any given concept in the PostgreSQL architecture. It can mean “data ordered by an index”, “a collection of databases”, “the act of reorganizing the table data according to the index”, and “a collection of services participating in replication”. In this context, we are referring to a collection of databases. This collection of databases starts at a top level directory. In PostgreSQL terms, the “data directory”. Further directories under this top level are fully complete and separate databases. So, we find out that a single master process (commonly referred to as the “postmaster”) serves any number of user data directories “databases”.
Of course the diagram above is an oversimplification of the Oracle vernacular. It is presented in this form to make the point that the common language is used very loosely and that this habit causes PostgreSQL users to constantly interrupt Oracle users for disambiguation.
The next level of the architecture provides for some more confusion. When we look at the service layer (“instance” in both vernaculars), we see that the processes that make up the service delivery are again similar, as they solve similar conceptual problems. Unfortunately, again, they are not the same, as they solve these problems using different implementations.
|Flush to disk
|External to PG
|Fulfill User requests
|session or backend or planner
- Online Table Shrink and Online Table Redefinition turned on with automatic space management turned on.
- Statistics Refresh invoked after Table Redefinition on Oracle in the same process
- Health check and monitoring processes are left to external utilities like Nagios, Zabbix, Cacti, Munin, Data Dog, etc.
- “Recovery” is a mode of the Postmaster process on startup that replays journal files to catch up with source systems.
- “Session” is a software term that refers to a transaction envelope. “Backend” is a system term for a process, and “planner” is a euphemism for the PostgreSQL query engine. To be a bit more exacting, “Session” is the proper direct translation.
This list of caveats could be infinitely expanded. And these equivalents are not exact by any means. In some cases, the Oracle process may map to several PostgreSQL processes, and vice versa. This diagram should at least serve as a reference for where to get further information on the same general concepts.
The databases consist of file components that provide the persistence of the data. These file components map roughly like this:
|contained in tables
|subdirectories of data dir
|files/dirs in database directory
- The only process controlled by file semaphore for PostgreSQL 12+ is crash recovery. recovery.signal will process WAL, standby.signal will allow queries while in crash recovery mode, and promote.signal instructs PostgreSQL to exit recovery mode and proceed to accept write connections.
- Again, these are rough equivalents. The concept here is that you will begin to understand where to look for PostgreSQL data on disk.
- Within the database, the structure of the data is completely different from the Oracle implementation. Any equivalency at this level becomes misleading at best and is better studied as a separate concept, rather than attempt to cross train.
A quick word about “tablespaces” here. They do not have anywhere near the significance in PostgreSQL that they do in Oracle. In PostgreSQL, tablespaces are just a glorified symbolic link to a directory. That’s it. The only reason to maintain them in the database is so that PostgreSQL can move the data appropriately across containers. This could be useful to put critical data such as indexes or lookup tables on faster storage, for example.
PostgreSQL does not heavily rely on these tablespaces. The common Oracle technique of preemptively defining tablespaces for each object is not a best practice for the PostgreSQL community.
PostgreSQL does not move old rows out of the table data. The old version of the row is left in the pages and leaves for a background cleanup process to deal with. The cleanup strategy necessitates the “autovacuum” process, which kicks off occasionally and marks the old rows as recoverable space. The reclaimed space is then reused by subsequent write transactions such as INSERT and UPDATE.
This strategy for cleanup leaves PostgreSQL without the basic functionality of the REDO log. Current development is underway to remedy this situation. At the time of this writing, PostgreSQL 12.3 is current, and there is hope that this feature will appear in a coming major release.
PostgreSQL 13 will ship in the future with a glossary of terms. Hopefully, as this glossary is expanded, many of the misconceptions surrounding architectural misunderstandings can be alleviated.
There is also a basic description of some of the major processes in an article on the main PostgreSQL website.