2ndQuadrant is now part of EDB

Bringing together some of the world's top PostgreSQL experts.

2ndQuadrant | PostgreSQL
Mission Critical Databases
  • Contact us
  • EN
    • FR
    • IT
    • ES
    • DE
    • PT
  • Support & Services
  • Products
  • Downloads
    • Installers
      • Postgres Installer
      • 2UDA – Unified Data Analytics
    • Whitepapers
      • Business Case for PostgreSQL Support
      • Security Best Practices for PostgreSQL
    • Case Studies
      • Performance Tuning
        • BenchPrep
        • tastyworks
      • Distributed Clusters
        • ClickUp
        • European Space Agency (ESA)
        • Telefónica del Sur
        • Animal Logic
      • Database Administration
        • Agilis Systems
      • Professional Training
        • Met Office
        • London & Partners
      • Database Upgrades
        • Alfred Wegener Institute (AWI)
      • Database Migration
        • International Game Technology (IGT)
        • Healthcare Software Solutions (HSS)
        • Navionics
  • Postgres Learning Center
    • Webinars
      • Upcoming Webinars
      • Webinar Library
    • Whitepapers
      • Business Case for PostgreSQL Support
      • Security Best Practices for PostgreSQL
    • Blog
    • Training
      • Course Catalogue
    • Case Studies
      • Performance Tuning
        • BenchPrep
        • tastyworks
      • Distributed Clusters
        • ClickUp
        • European Space Agency (ESA)
        • Telefónica del Sur
        • Animal Logic
      • Database Administration
        • Agilis Systems
      • Professional Training
        • Met Office
        • London & Partners
      • Database Upgrades
        • Alfred Wegener Institute (AWI)
      • Database Migration
        • International Game Technology (IGT)
        • Healthcare Software Solutions (HSS)
        • Navionics
    • Books
      • PostgreSQL 11 Administration Cookbook
      • PostgreSQL 10 Administration Cookbook
      • PostgreSQL High Availability Cookbook – 2nd Edition
      • PostgreSQL 9 Administration Cookbook – 3rd Edition
      • PostgreSQL Server Programming Cookbook – 2nd Edition
      • PostgreSQL 9 Cookbook – Chinese Edition
    • Videos
    • Events
    • PostgreSQL
      • PostgreSQL – History
      • Who uses PostgreSQL?
      • PostgreSQL FAQ
      • PostgreSQL vs MySQL
      • The Business Case for PostgreSQL
      • Security Information
      • Documentation
  • About Us
    • About 2ndQuadrant
    • 2ndQuadrant’s Passion for PostgreSQL
    • News
    • Careers
    • Team Profile
  • Blog
  • Menu Menu
You are here: Home1 / Blog2 / Eisentraut's PlanetPostgreSQL3 / Unicode normalization in PostgreSQL 13
Unicode normalization in PostgreSQL 13
Peter Eisentraut

Unicode normalization in PostgreSQL 13

July 23, 2020/3 Comments/in Eisentraut's PlanetPostgreSQL, PostgreSQL /by Peter Eisentraut

Unicode equivalence

Unicode is a complicated beast. One of its numerous peculiar features is that different sequences of codepoints can be equal. This is not the case in legacy encodings. In LATIN1, for instance, the only thing that is equal to ‘a’ is ‘a’ and the only thing that is equal to ‘ä’ is ‘ä’. In Unicode, however, characters with diacritic marks can often (depending on the particular character) be encoded in different ways: either as a precomposed character, as was done in legacy encodings such as LATIN1, or decomposed, consisting of the base character ‘a’ followed by the diacritic mark ◌̈ here. This is called canonical equivalence. The advantage of having both of these options is that you can, on the one hand, easily convert characters from legacy encodings and, on the other hand, don’t need to add every accent combination to Unicode as a separate character. But this scheme makes things more difficult for software using Unicode.

As long as you are just looking at the resulting character, such as in a browser, you shouldn’t notice a difference and this doesn’t matter to you. However, in a database system where searching and sorting strings is fundamental and performance-critical functionality, things can get complicated.

First, the collation library in use needs to be aware of this. However, most system C libraries including glibc are not. So in glibc, when you look for ‘ä’, you won’t find ‘ä’. See what I did there? The second is encoded differently but probably looks the same to you reading. (At least that is how I entered it. It might have been changed somewhere along the way to your browser.) Confusing. If you use ICU for collations, then this works and is fully supported.

Second, when PostgreSQL compares strings for equality, it just compares the bytes, it does not take into consideration the possibility that the same string can be represented in different ways. This is technically wrong when using Unicode, but it’s a necessary performance optimization. To work around that, you can use nondeterministic collations, a feature introduced in PostgreSQL 12. A collation declared that way will not just compare the bytes but will do any necessary preprocessing to be able to compare or hash strings that might be encoded in different ways. Example:

CREATE COLLATION ndcoll (provider = icu, locale = 'und', deterministic = false);

Normalization forms

So while there are different valid ways to encode certain Unicode characters, it is sometimes useful to convert them all to a consistent form. This is called normalization. There are two normalization forms: fully composed, meaning we convert all codepoint sequences to precomposed characters as much as possible, and fully decomposed, meaning we convert all codepoints to their component pieces (letter plus accent) as much as possible. In Unicode terminology, these forms are known as NFC and NFD, respectively. There are some more details to this, such as putting all the combining characters into a canonical order, but that’s the general idea. The point is, when you convert a Unicode string into one of the normalization forms, then you can compare or hash them bytewise without having to worry about encoding variants. Which one you use doesn’t matter, as long as the whole system agrees on one.

In practice, most of the world uses NFC. And moreover, many systems are faulty in that they don’t handle non-NFC Unicode correctly, including most C libraries’ collation facilities, and even PostgreSQL by default, as mentioned above. So ensuring that all Unicode is converted to NFC is a good way to ensure better interoperability.

Normalization in PostgreSQL

PostgreSQL 13 now contains two new facilities to deal with Unicode normalization: a function to test for normalization, and one to convert to a normalization form. For example:

SELECT 'foo' IS NFC NORMALIZED;
SELECT 'foo' IS NFD NORMALIZED;
SELECT 'foo' IS NORMALIZED;  -- NFC is the default

SELECT NORMALIZE('foo', NFC);
SELECT NORMALIZE('foo', NFD);
SELECT NORMALIZE('foo');  -- NFC is the default

(The syntax is specified in the SQL standard.)

One option is to use this in a domain, for example:

CREATE DOMAIN norm_text AS text CHECK (VALUE IS NORMALIZED);

Note that normalization of arbitrary text is not entirely cheap. So apply this sensibly and only where it really matters.

Note also that normalization is not closed under concatenation. That means that appending two normalized strings does not always result in a normalized string. So even if you carefully apply these functions and also otherwise check that your system only uses normalized strings, they can still "creep in" during legitimate operations. So just assuming that non-normalized strings cannot happen will fail; this issue has to be dealt with properly.

Compatibility characters

There is another use case for normalization. Unicode contains some alternative forms of letters and other characters, for various legacy and compatibility purposes. For example, you can write Fraktur:

SELECT '𝔰𝔬𝔪𝔢𝔫𝔞𝔪𝔢';

Now imagine your application assigns user names or other such identifiers, and there is a user named 'somename' and another one named '𝔰𝔬𝔪𝔢𝔫𝔞𝔪𝔢'. This would at least be confusing, but possibly a security risk. Exploiting such similarities is often used in phishing attacks, fake URLs, and similar concerns. So Unicode contains two additional normalization forms that resolve these similarities and convert such alternative forms into a canonical base letter. These forms are called NFKC and NFKD. They are otherwise the same as NFC and NFD, respectively. For example:

=> select normalize('𝔰𝔬𝔪𝔢𝔫𝔞𝔪𝔢', nfkc);
 normalize
-----------
 somename

Again, using check constraints perhaps as part of a domain can be useful:

CREATE DOMAIN username AS text CHECK (VALUE IS NFKC NORMALIZED OR VALUE IS NFKD NORMALIZED);

(The actual normalization should probably be done in the user interface frontend.)

See also RFC 3454 for a treatment of strings to address such concerns.

Summary

Unicode equivalence issues are often ignored without consequence. In many contexts, most data is in NFC form, so no issues arise. However, ignoring these issues can lead to strange behavior, apparently missing data, and in some situations security risks. So awareness of these issues is important for database designers, and the tools described in this article can be used to deal with them.

Tags: PostgreSQL, postgresql 13, Unicode
Share this entry
  • Share on Facebook
  • Share on Twitter
  • Share on WhatsApp
  • Share on LinkedIn
3 replies
  1. ~flow
    ~flow says:
    July 30, 2020 at 1:08 pm

    > you can write faktur: SELECT ‘????????’;

    Dude that’s Fraktur not faktur also some part of your toolchain is not compatible with Unicode. Presumably allergic to anything beyond U+FFFF.

    Reply
    • Peter Eisentraut
      Peter Eisentraut says:
      August 3, 2020 at 7:35 am

      Fixed, thanks.

      Reply
  2. ~flow
    ~flow says:
    July 31, 2020 at 8:02 am

    Thanks for the lively discussion. Your sample code is still just question marks.

    Reply

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Search

Get in touch with us!

Recent Posts

  • Random Data December 3, 2020
  • Webinar: COMMIT Without Fear – The Beauty of CAMO [Follow Up] November 13, 2020
  • Full-text search since PostgreSQL 8.3 November 5, 2020
  • Random numbers November 3, 2020
  • Webinar: Best Practices for Bulk Data Loading in PostgreSQL [Follow Up] November 2, 2020

Featured External Blogs

Tomas Vondra's Blog

Our Bloggers

  • Simon Riggs
  • Alvaro Herrera
  • Andrew Dunstan
  • Craig Ringer
  • Francesco Canovai
  • Gabriele Bartolini
  • Giulio Calacoci
  • Ian Barwick
  • Marco Nenciarini
  • Mark Wong
  • Pavan Deolasee
  • Petr Jelinek
  • Shaun Thomas
  • Tomas Vondra
  • Umair Shahid

PostgreSQL Cloud

2QLovesPG 2UDA 9.6 backup Barman BDR Business Continuity community conference database DBA development devops disaster recovery greenplum Hot Standby JSON JSONB logical replication monitoring OmniDB open source Orange performance PG12 pgbarman pglogical PG Phriday postgres Postgres-BDR postgres-xl PostgreSQL PostgreSQL 9.6 PostgreSQL10 PostgreSQL11 PostgreSQL 11 PostgreSQL 11 New Features postgresql repmgr Recovery replication security sql wal webinar webinars

Support & Services

24/7 Production Support

Developer Support

Remote DBA for PostgreSQL

PostgreSQL Database Monitoring

PostgreSQL Health Check

PostgreSQL Performance Tuning

Database Security Audit

Upgrade PostgreSQL

PostgreSQL Migration Assessment

Migrate from Oracle to PostgreSQL

Products

HA Postgres Clusters

Postgres-BDR®

2ndQPostgres

pglogical

repmgr

Barman

Postgres Cloud Manager

SQL Firewall

Postgres-XL

OmniDB

Postgres Installer

2UDA

Postgres Learning Center

Introducing Postgres

Blog

Webinars

Books

Videos

Training

Case Studies

Events

About Us

About 2ndQuadrant

What does 2ndQuadrant Mean?

News

Careers 

Team Profile

© 2ndQuadrant Ltd. All rights reserved. | Privacy Policy
  • Twitter
  • LinkedIn
  • Facebook
  • Youtube
  • Mail
Webinar: Being Committed – A Review of Transaction Control Statements... Webinar: Being Committed – A Review of Transaction Control Statements...
Scroll to top
×