2ndQuadrant is now part of EDB

Bringing together some of the world's top PostgreSQL experts.

2ndQuadrant | PostgreSQL
Mission Critical Databases
  • Contact us
  • EN
    • FR
    • IT
    • ES
    • DE
    • PT
  • Support & Services
  • Products
  • Downloads
    • Installers
      • Postgres Installer
      • 2UDA – Unified Data Analytics
    • Whitepapers
      • Business Case for PostgreSQL Support
      • Security Best Practices for PostgreSQL
    • Case Studies
      • Performance Tuning
        • BenchPrep
        • tastyworks
      • Distributed Clusters
        • ClickUp
        • European Space Agency (ESA)
        • Telefónica del Sur
        • Animal Logic
      • Database Administration
        • Agilis Systems
      • Professional Training
        • Met Office
        • London & Partners
      • Database Upgrades
        • Alfred Wegener Institute (AWI)
      • Database Migration
        • International Game Technology (IGT)
        • Healthcare Software Solutions (HSS)
        • Navionics
  • Postgres Learning Center
    • Webinars
      • Upcoming Webinars
      • Webinar Library
    • Whitepapers
      • Business Case for PostgreSQL Support
      • Security Best Practices for PostgreSQL
    • Blog
    • Training
      • Course Catalogue
    • Case Studies
      • Performance Tuning
        • BenchPrep
        • tastyworks
      • Distributed Clusters
        • ClickUp
        • European Space Agency (ESA)
        • Telefónica del Sur
        • Animal Logic
      • Database Administration
        • Agilis Systems
      • Professional Training
        • Met Office
        • London & Partners
      • Database Upgrades
        • Alfred Wegener Institute (AWI)
      • Database Migration
        • International Game Technology (IGT)
        • Healthcare Software Solutions (HSS)
        • Navionics
    • Books
      • PostgreSQL 11 Administration Cookbook
      • PostgreSQL 10 Administration Cookbook
      • PostgreSQL High Availability Cookbook – 2nd Edition
      • PostgreSQL 9 Administration Cookbook – 3rd Edition
      • PostgreSQL Server Programming Cookbook – 2nd Edition
      • PostgreSQL 9 Cookbook – Chinese Edition
    • Videos
    • Events
    • PostgreSQL
      • PostgreSQL – History
      • Who uses PostgreSQL?
      • PostgreSQL FAQ
      • PostgreSQL vs MySQL
      • The Business Case for PostgreSQL
      • Security Information
      • Documentation
  • About Us
    • About 2ndQuadrant
    • 2ndQuadrant’s Passion for PostgreSQL
    • News
    • Careers
    • Team Profile
  • Blog
  • Menu Menu
You are here: Home1 / Blog2 / Uncategorized3 / Lexical Scanning in PostgreSQL
John Naylor

Lexical Scanning in PostgreSQL

August 20, 2020/0 Comments/in Uncategorized /by John Naylor

Like in the parable of the Blind Men and an Elephant, Postgres code can look like many different applications of computer science depending on where you look: operating systems, transactions, distributed computing, linear programming, information storage and retrieval, etc. In a number of places it also looks like a compiler front end.

One might think that lexical scanning (i.e. turning a text stream into a sequence of tokens to be fed into the parser) is a straightforward thing solved many years ago, but in Postgres there are plenty of interesting challenges from an engineering as well as organizational point of view.

For one, the Postgres codebase has nine ".y" Bison files and eleven ".l" Flex files, which cover a multitude of languages and formats, including replication commands, JSON Path expressions, the bootstrap commands used by initdb, and the postgres.conf configuration file. That list doesn’t even count features for which there exist hand written parsers and scanners: JSON, pg_hba.conf, etc.

Even considering only the SQL language, there are no less than four areas of functionality that must scan SQL in different contexts:

  1. Ordinary SQL queries
  2. PL/pgSQL functions containing SQL
  3. Detecting the end of a SQL statement within the psql client
  4. Preprocessing C/C++ language files containing embedded SQL (called ECPG in Postgres)

SQL and PL/pgSQL have shared the same (so-called "core") scanner since PG 9.0, but psql and ECPG have their own scanners, which must match the core scanner in recognition ability, leading to duplicate code that must be maintained separately.

For another, SQL is a huge and still evolving language with a galaxy of corner cases and historical baggage. The SQL standards committee doesn’t have a reference implementation, so it’s anybody’s guess whether some new syntax is parseable without ugly hacks. It doesn’t help that Postgres’ dialect goes above and beyond the standard with many non-standard extensions and relaxed restrictions that make the task even more difficult.

To pick an example relevant to the topic of scanning, this query must throw a syntax error:

select 'foo' 'bar';
ERROR:  syntax error at or near "'bar'"
LINE 1: select 'foo' 'bar';
                     ^

…but this one must concatenate the strings:

select 'foo'  -- comment here
    -- newline and more comments
'bar';
 ?column?
----------
 foobar
(1 row)

…but only SQL-style comments are allowed to split a string, so this must be a syntax error:

select 'foo'
/* C-style comment */
'bar';
ERROR:  syntax error at or near "'bar'"
LINE 3: 'bar';
        ^

Recent improvements

For PG 12, I authored a patch to change the data structure used by the core scanner to recognize SQL keywords to be more cache-friendly. This resulted in a 4% speed increase in a raw parsing microbenchmark (Flex + Bison, excluding further parse analysis). This is not terribly exciting itself, but paved the way for people smarter than me to replace the binary search algorithm with a perfect hash function that upped that to a 20% speed increase over PG 11, which is nice, and led to some interesting discussion on HackerNews.

During the PG 13 development cycle, I did a series of experiments to test the effect of the “-C” Flex option flag, which controls the degree of table compression, allowing speed vs. size tradeoffs. According to the Flex manual, in the general case, “-Cem” generates the smallest and slowest scanner, and “-CF” the largest and fastest. Since version 7.3, the core scanner has used the "-CF" option based on this recommendation, but hadn’t recently been tested as far as I know. In addition, CPU and memory architecture have changed quite a bit since 2002, so I was curious if that recommendation was still valid. Perhaps unsurprisingly, "-CF" is still the fastest option.

Back in version 8.1, a policy was added to Postgres that required the core scanner to be free of backtracking. At the time it was noted that this made scanning 1/3 faster. I tested removing the Flex rules that are there simply to avoid backtracking, and found two baffling facts:

First, the resulting binary was 164 kB smaller. The reason was apparent when looking at the yy_transition array in the output C files before:

struct yy_trans_info
{
flex_int32_t yy_verify;
flex_int32_t yy_nxt;
};
static yyconst struct yy_trans_info yy_transition[37045] = ...

…and after:

struct yy_trans_info
{
flex_int16_t yy_verify;
flex_int16_t yy_nxt;
};
static yyconst struct yy_trans_info yy_transition[31763] = ...

The data type used here depends on the size of the array. In particular, if the number of elements exceeds the maximum value of a 16-bit signed integer, Flex apparently must use 32-bit integers. The extra rules to avoid backtracking are enough to bump it over the threshold where the array elements must double in size.

Second, when I measured raw parsing (remember this includes both Flex and Bison), I either saw no difference, or in some cases only a 3% regression, when using backtracking. In addition to changes in computer architecture and possibly Flex itself in 15 years, one of the reasons for the small difference is that the lion’s share of the time in raw parsing on modern machines is taken by Bison, and profiles do show a lot of cache misses in the Bison code.

In any case, smaller data structures and binary are always a good thing to have, so with a bit of help from Tom Lane I authored a patch to reduce the number of state transitions in the core scanner. The point of this was to ensure use of 16-bit integers as above, while still keeping free of backtracking. This seemed to improve performance slightly, the binary shrank by over 200kB, and the scanner code among core, psql, and ECPG matched more closely, reducing maintenance burden. This will be part of the upcoming PG 13 release.

Possible future work

A lot more could be done in the realm of parsing and scanning in Postgres. Some of the following are pie-in-the-sky projects with uncertain benefit, but still interesting to think about:

  • Use a token queue to enable scanning and parsing tokens in batches. This makes better use of the caches and branch predictor.
  • Make it possible for Postgres extensions to add their own syntax. One could envision a GUC analogous to search_path, containing a list of shared libraries which implement parsers to attempt.
  • Replace Flex with scanners handwritten in C. This would reduce duplication of code and improve performance. This could be made easier by using a generated direct-coded scanner, such as that emitted by re2c, as a starting point. Handwritten scanners are actually quite common in other software projects (including databases), even those that use a parser generator such as Bison. Flex is picky, difficult to debug, and some Postgres constructs (e.g. user-defined operators) are clunky to express with it.
  • Rethink how to preprocess ECPG source files. Currently a perl script transforms the core grammar into the ECPG grammar as part of the build process. This requires a bespoke scanner that must recognize C plus special ECPG commands, as well as duplicate all the rules from the core scanner. It might be better if the two languages were treated separately, with two parsers that switched back and forth as needed. This is a research project.
  • Teach Bison to emit a direct-coded parser, as opposed to the current table-driven state machine. This would require a massive amount of work and expertise, but this would likely result in a large performance gain.

As you can see, even in an area most users and even developers take for granted, there are plenty of avenues for those motivated to explore.

Takeaways

  • It can be fruitful to simply read source code and be curious about the current state of affairs.
  • It often pays to dig down to a lower abstraction layer. Even good tools and libraries shouldn’t be taken for granted as a black box. The more you learn about the details, the more effective you are as a user.
  • Performance measurements can change drastically in different times and contexts.
Share this entry
  • Share on Facebook
  • Share on Twitter
  • Share on WhatsApp
  • Share on LinkedIn
0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Search

Get in touch with us!

Recent Posts

  • Random Data December 3, 2020
  • Webinar: COMMIT Without Fear – The Beauty of CAMO [Follow Up] November 13, 2020
  • Full-text search since PostgreSQL 8.3 November 5, 2020
  • Random numbers November 3, 2020
  • Webinar: Best Practices for Bulk Data Loading in PostgreSQL [Follow Up] November 2, 2020

Featured External Blogs

Tomas Vondra's Blog

Our Bloggers

  • Simon Riggs
  • Alvaro Herrera
  • Andrew Dunstan
  • Craig Ringer
  • Francesco Canovai
  • Gabriele Bartolini
  • Giulio Calacoci
  • Ian Barwick
  • Marco Nenciarini
  • Mark Wong
  • Pavan Deolasee
  • Petr Jelinek
  • Shaun Thomas
  • Tomas Vondra
  • Umair Shahid

PostgreSQL Cloud

2QLovesPG 2UDA 9.6 backup Barman BDR Business Continuity community conference database DBA development devops disaster recovery greenplum Hot Standby JSON JSONB logical replication monitoring OmniDB open source Orange performance PG12 pgbarman pglogical PG Phriday postgres Postgres-BDR postgres-xl PostgreSQL PostgreSQL 9.6 PostgreSQL10 PostgreSQL11 PostgreSQL 11 PostgreSQL 11 New Features postgresql repmgr Recovery replication security sql wal webinar webinars

Support & Services

24/7 Production Support

Developer Support

Remote DBA for PostgreSQL

PostgreSQL Database Monitoring

PostgreSQL Health Check

PostgreSQL Performance Tuning

Database Security Audit

Upgrade PostgreSQL

PostgreSQL Migration Assessment

Migrate from Oracle to PostgreSQL

Products

HA Postgres Clusters

Postgres-BDR®

2ndQPostgres

pglogical

repmgr

Barman

Postgres Cloud Manager

SQL Firewall

Postgres-XL

OmniDB

Postgres Installer

2UDA

Postgres Learning Center

Introducing Postgres

Blog

Webinars

Books

Videos

Training

Case Studies

Events

About Us

About 2ndQuadrant

What does 2ndQuadrant Mean?

News

Careers 

Team Profile

© 2ndQuadrant Ltd. All rights reserved. | Privacy Policy
  • Twitter
  • LinkedIn
  • Facebook
  • Youtube
  • Mail
Postgres Vacuuming Through Pictures [Webinar] Postgres Vacuuming Through Pictures [Webinar] Similarity Queries in PostgreSQL [Webinar] Webinar: Similarity Queries in PostgreSQL [Follow Up]
Scroll to top
×