Beautiful things, strings.

Postgresql String Functions and Operators.  

This blog today is going to talk about strings: how they are stored, how they are input, and lots of examples of how to use string operators and functions in order to manipulate them. Strings, strings, strings.

What we are not going to cover is regular expressions, although we will use them. The Fine Manual has a very good description of the options on various types of regular expressions. Or you can google your own. Neither will we be covering text indexing–plain or pg_trgm, GiN or GiST or textsearch. These beg for another focused blog.

Character Types

  storage type                    | Description
 ---------------------------------+-------------------------------
 character varying(n), varchar(n) | variable-length with limit
 character(n), char(n)            | fixed-length, blank padded
 text                             | variable unlimited length

The strings, obviously, can be of several different types. Although equivalent to varchar, text is favored as the character type. It is the native PostgreSQL character type. If and only if you truly want one and only one character, use char(1). character(n) does not gain any speed in PostgreSQL and the others are longer to type.

Here is a little poem for us to experiment with, stored as one row per line of a poem with a foreign key reference to the title information.

    create table titles (titleid serial primary key, title text, author text);
insert into titles (title, author) values ('Jabberwocky', 'Lewis Carroll');
create table poems (titleid integer references titles(titleid), lineno serial,
line text, primary key (titleid, lineno) );
\copy poems (titleid, line) from './jabberwocky'

As you can see above in the insert statement, generally text is single quoted string with any doubled single quote reflecting a single quote in the data.

(Note that I am playing fast and loose on the serial.  If this table, poems were to contain many poems, the line numbering would be off.  Unless the serial were restarted when loading a new poem.  And then you could not qualify by only line number.)

If you want to load in data, including character strings, you can use copy or \copy. However, it is up to you to double embedded single quotes and ensure the data matches the table layout. With copy or \copy, use the default delimiter (tabs) or specify another one in the file to make it easier. You can also use csv, binary, text (default) as the format type of the file.

insert into orders (buyer, amt, purch, date)
values( 'Jerry O''Brian', 135, 'Hamster Wheels', now::timestamp );

\copy poems (titleid, line) from './jabberwocky' DELIMITER '    '

Pattern Matching — Boolean

Per   The Fine Manual :

“If you have pattern matching needs that go beyond this, consider writing a user-defined function in Perl or Tcl.”

As a matter of fact any stored procedure language can be used to write a string mangler function for your use.  And you can use the built-in examples from here to add to their strength.

The ESCAPE character defaults to backslash, but can be changed to any other character using the ESCAPE clause in any statement that supports it. The double single quote is still required.

LIKE, ILIKE

Like and ilike use NOT to negate the statement.  The operators, shown below, use ! to negate the statement.

 op   |          description
------+-------------------------------
like | matches characters, case sensitive
ilike| matches characters, case insensitive
~~ | is LIKE
~~* | is ILIKE
% | matches 0 or more characters.
_ | matches any one character

=> select line, 'like '' J_''' as qual
from poems where line like ' J_';
line | qual
------+------
(0 rows)

We were looking for lines where words had a leading ‘J’.  However, for LIKE the match has to be exact.  To make it work correctly, use the ‘%’ at the start and end of the pattern.

=> select line , 'like %bb%' as qual 
from poems where line like '%bb%';
line | qual
-------------------------------------+-----------
"Beware the Jabberwork, my son! | like %bb%
The Jabberwock, with eyes of flame, | like %bb%
And hast thou slain the Jabberwock? | like %bb%
"Beware the Jabberwock, my son! | like %bb%
The Jabberwock, with eyes of flame, | like %bb%
And hast thou slain the Jabberwock? | like %bb%
(6 rows)

-- expect any line containing j or J
=> select lineno, line, 'ilike %J%' as qual
from poems where line ilike '%J%' order by lineno;
lineno | line | qual
--------+-------------------------------------------+-----------
6 | "Beware the Jabberwock, my son! | ilike %J%
7 | The jaws that bite, the claws that catch! | ilike %J%
8 | Beware the Jubjub bird and shun | ilike %J%
17 | The Jabberwock, with eyes of flame, | ilike %J%
26 | And hast thou slain the Jabberwock? | ilike %J%
28 | O frabjous day! Calloh! Callay! | ilike %J%
29 | He chortled in his joy. | ilike %J%
42 | "Beware the Jabberwock, my son! | ilike %J%
43 | The jaws that bite, the claws that catch! | ilike %J%
44 | Beware the Jubjub bird and shun | ilike %J%
53 | The Jabberwock, with eyes of flame, | ilike %J%
62 | And hast thou slain the Jabberwock? | ilike %J%
64 | O frabjous day! Calloh! Callay! | ilike %J%
65 | He chortled in his joy. | ilike %J%
(14 rows)


Note that in this poem, the first stanza and the last
stanza are the same text values but have different line numbers.
You will notice this in other selections of the lines of the poem.

SIMILAR TO: similar to not similar to:

  op   |                          description
-------+----------------------------------------------------------------
| | alternation (or)
% | matches 0 or more characters
_ | matches any one character
* | repetition 0 or more times
+ | repetition 1 or more times
{m} | repetition m times
{m,} | repetition m times or more
{m,n} | repetition m times or more, but not more than n times.

Looking for lines containing double f, double l, double Tum or double Jub

=>select line from poems 
where line similar to '%(f|l|Tum|Jub){2}%';

line
------------------------------------------
''Twas brillig, and the slithy toves
All mimsy were the borgoves
So rested he by the TumTum tree,
And, as in uffish thought he stood,
The Jabberwock, with eyes of flame,
Came whiffling through the tulgey wood,
O frabjous day! Calloh! Callay!
''Twas brillig, and the slithy toves
All mimsy were the borgoves
Lewis Carroll, Through the Looking Glass
(10 rows)


SIMILAR TO and LIKE are pretty easy but rather bulky.  That brings us to the most useful pattern matching: POSIX regular expressions.

Regular Expressions

POSIX functions and operators use regular expressions. Yea! There are two types of regular expressions, EREs (extended RE) as in egrep and BRE (Basic RE) as in ed. The Fine Manual has detailed descriptions of the various types of regular expressions starting in Chapter 9.7.3 of the Fine Manual.  therefore, as written previously, I am going to leave regular expressions to the user.

The following table is lifted from The Fine Manual, 9.7.3 POSIX Regular Expresions Table 9.14.  This table shows the simplicity of the operators for regular expressions, nicely.

 opr | description                                         | example
-----+-----------------------------------------------------+----------------------------
~ | Matches regular expression, case sensitive | 'thomas' ~ '.*thomas.*'
!~* | Does not match regular expression, case insensitive | 'thomas' !~* '.*vadim.*'
!~ | Does not match regular expression, case sensitive | 'thomas' !~ '.*Thomas.*'
~* | Matches regular expression, case insensitive | 'thomas' ~* '.*Thomas.*'
(4 rows)


Given a choice between where column LIKE %value_% and where column ~ value there is no question, imho, that a simple tilde is simpler and clearer.
Here are some examples using various regular expression functions and operators.

=> select lineno, line 
from poems where line ~ 'vorpal' order by lineno;
lineno | line
--------+--------------------------------------
11 | He took his vorpal sword in hand:
22 | The vorpal blade went snicker-snack!
(2 rows)

Apparently no line starts with a lower case character.

=>select count(lineno) 
from poems where line !~ '^[a-z]';
count
-------
36
(1 row)

Let us check for lines that do not have a starting upper case letter. This includes blank lines.

=> select lineno, line 
from poems where line !~ '^[A-Z]' order by lineno;
lineno | line
--------+--------------------------------------
1 | ''Twas brillig, and the slithy toves
5 |
6 | "Beware the Jabberwock, my son!
10 |
15 |
20 |
25 |
30 |
31 | ''Twas brillig, and the slithy toves
35 |
(10 rows)

Here we are looking for the word and, in case insensitive mode. It will include all of the ‘ands’ that begin the line or start with a leading space. Words that include ‘and’ will not be selected.

=> select lineno, line from poems 
where line ~* '(\mand)' order by lineno;
lineno | line
--------+---------------------------------------------
1 | ''Twas brillig, and the slithy toves
2 | Did gyre and gimble in the wabe;
4 | And the mome raths outgrabe.
8 | Beware the Jubjub bird and shun
14 | And stood awhile in thought.
16 | And, as in uffish thought he stood,
19 | And burbled as it came!
21 | One, two! One, two! And through and through
23 | He left it dead, and with its head
26 | And hast thou slain the Jabberwock?
31 | ''Twas brillig, and the slithy toves
32 | Did gyre and gimble in the wabe;
34 | And the mome raths outgrabe.
(13 rows)

Functions

The function substring(string from pattern) returns NULL or the matched value from the leftmost parentheses. We are looking for the remaining words on the line that starts with quotequoteTwas.


=> select lineno, substring( line from '['']{2}Twas (.*$)')
from poems order by lineno;
lineno | substring
--------+-------------------------------
1 | brillig, and the slithy toves
31 | brillig, and the slithy toves
(2 rows)

The function regexp_replace(source, pattern, replacement[,flags]) can reference ‘\n’ as nth (subexpression) and the most common flags are i (case insensitive) and g (apply to all matches in the source) There are more flags available and can be referenced from Chapter 9.7.3 of the Fine Manual.
Here what we want to do is to change all of the ‘ands’ in every line to an & and show the resultant line. The pattern is set to find case insensitive And after a start line or a space and followed by a space or comma.  Using this pattern, we can put back the trailing space or comma and we can be sure that we are not changing ‘ands’ that reside in words.


=> select lineno,
regexp_replace(line, '([^ ]?[Aa]nd([\s,]))', '&\2', 'g') as modified
from poems where line ~* ''and order by lineno;
lineno | modified
--------+-----------------------------------------
1 | ''Twas brillig, & the slithy toves
2 | Did gyre & gimble in the wabe;
4 | & the mome raths outgrabe.
8 | Beware the Jubjub bird & shun
9 | The frumious Bandersnatch!"
11 | He took his vorpal sword in hand:
14 | & stood awhile in thought.
16 | &, as in uffish thought he stood,
19 | & burbled as it came!
21 | One, two! One, two! & through & through
23 | He left it dead, & with its head
26 | & hast thou slain the Jabberwock?
31 | ''Twas brillig, & the slithy toves
32 | Did gyre & gimble in the wabe;
34 | & the mome raths outgrabe.
(15 rows)

The only thing I don’t like about this poem is the killing of
the Jabberwock. So, lets be rogue and change ‘dead’ with ‘instead’.

=> select lineno, line from poems where line ~ 'dead';
lineno | modifier
--------+---------------------------------------
23 | He left it dead, and with its head

update poems set line = regexp_replace( line, (dead), instead, g) where line ~ dead;

=> select lineno,
regexp_replace( line, '(dead)', 'instead', g) as modifier
from poems where line ~ 'dead';
lineno | modifier
--------+---------------------------------------
23 | He left it instead, and with its head
(1 row)

Now that we are mangling the poem, let us swap some words.

=> select lineno, line, 
regexp_replace(line, '(gyre)(.*)(gimble)', '\3\2\1', 'gi') as modified
from poems where line ~ 'gyre';
-[ RECORD 1 ]------------------------------
lineno | 2
line | Did gyre and gimble in the wabe;
modified | Did gimble and gyre in the wabe;
-[ RECORD 2 ]------------------------------
lineno | 32
line | Did gyre and gimble in the wabe;
modified | Did gimble and gyre in the wabe;

Oh, that is not a very good word switch; it does not have the right rhythm. Good thing we weren’t updating but only selecting.

Functions: match and matches

The functions regexp_match(string, pattern [,flags]) and regexp_matches(string, pattern [,flags]) return text array(s) of the matching patterns. Match returns just one text array divided by the parenthetical expressions in the pattern. Matches returns rows of text arrays from the outermost parenthetical expressions that could be sibling expressions. It gives us the entire line.  But let us try it a second time and extract only the matching words using parentheses.

Here we are looking for the line where the vorpal sword snickers. I am eliminating all rows that do not contain vorpal. Notice in the first query, we get two resulting lines; the first doesn’t completely match the pattern so it is empty. The second vorpal does match the pattern and so we get the whole line as a text array.


=> select lineno, regexp_match( line, '.*vorpal.*snicker.*', 'i')
from poems where line ~ 'vorpal';
lineno | regexp_match
--------+------------------------------------------
11 |
22 | {"The vorpal blade went snicker-snack!"}
(2 rows)
=> select lineno, regexp_match( line, '.*(vorpal).*(snicker).*', 'i') 
from poems where line ~ 'vorpal';
lineno | regexp_match
--------+------------------
11 |
22 | {vorpal,snicker}
(2 rows)

So that is how regexp_match works. regexp_matches should bring rows of text array based on your parenthetical sibling patterns. However, I leave it as an exercise for the user to show multiple rows other than the manual example. That requires applying the same pattern more than once. This example picks out four words into one text array and uses two doubled patterns.

=> select lineno, line, 
regexp_matches( line, '(O..)[,\w ]+! (O..)[,\w ]+! And (thro...)[ \w]+ (thro...)', 'g')
from poems where line ~ 'One';
-[ RECORD 1 ]--+--------------------------------------------
lineno | 21
source line | One, two! One, two! And through and through
regexp_matches | {One,One,through,through}

Split Functions

Now we get to some fun functions. We change a text line into an array and then into rows in a table.  When you are writing your own wrangle strings functions, you will find both of these functions  very useful. It is sometimes easier to index into an array or to loop through a table than to try to extract the words that you want from the input text.

First we will do it “plain” with the ‘\s’ flag which tells the function to separate on spaces.  Then we will separate by character with the ‘\s*’ flag. And then we will do it all again with regexp_split_to_table().


=> select lineno, regexp_split_to_array( line, '\s')
from poems where line ~ '^So';
lineno | regexp_split_to_array
--------+--------------------------------------
13 | {So,rested,he,by,the,TumTum,"tree,"}
(1 row)

=> select lineno, regexp_split_to_array( line, '\s')
from poems where lineno = 14
lineno | regexp_split_to_array
--------+--------------------------------
14 | {And,stood,awhile,in,thought.}
(1 row)

=> select lineno, regexp_split_to_array( line, '\s*')
from poems where lineno = 29;
lineno | regexp_split_to_array
--------+-----------------------------------------
29 | {H,e,c,h,o,r,t,l,e,d,i,n,h,i,s,j,o,y,.}
(1 row)

=> select lineno, regexp_split_to_table( line, '\s')
from poems where line ~ '^So';
lineno | regexp_split_to_table
--------+-----------------------
13 | So
13 | rested
13 | he
13 | by
13 | the
13 | TumTum
13 | tree,
(7 rows)

=> select lineno, regexp_split_to_table( line, '\s')
from poems where lineno = 14;
lineno | regexp_split_to_table
--------+-----------------------
14 | And
14 | stood
14 | awhile
14 | in
14 | thought.
(5 rows)

=> select lineno, regexp_split_to_table( line, '\*')
from poems where lineno = 29;
lineno | regexp_split_to_table
--------+-----------------------
29 | H
29 | e
29 | c
29 | h
29 | o
29 | r
29 | t
29 | l
29 | e
29 | d
29 | i
29 | n
29 | h
29 | i
29 | s
29 | j
29 | o
29 | y
29 | .
(19 rows)

These functions and operators, some easy, some whacky, are what you will use when selecting  your data from the database either through SQL or via your own stored procedures.  Look forward to companions to this blog to show datetime functions and show output formatting for both strings and dates.

For those not familiar with this English/jibberish poem you can read the whole thing here Jabberwocky  It is from the book “Through the Looking Glass and What Alice Found There” by Lewis Carroll, 1871.

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *