Choosing a PostgreSQL text search method
(This article is written with reference to PostgreSQL 9.3. If you’re using a newer version please check to make sure any limitations described remain in place.)
PostgreSQL offers several tools for searching and pattern matching text. The challenge is choosing which to use for a job. There’s:
- LIKE and ILIKE SQL pattern matching;
- ~ and ~* operators for mostly-perl-compatible regular expressions;
- full text search with @@, to_tsvector and to_tsquery
- Use of an external search provider like Apache Lucene / Solr.
There’s also SIMILAR TO, but we don’t speak of that in polite company, and PostgreSQL turns it into a regular expression anyway.
Each of the built-in searching options comes with multiple choices of index:
- b-tree indexes with the text_pattern_ops opclass for left anchored (prefix) LIKE and regex searches;
- GiST or GIN pg_trgm indexes for infix and suffix pattern matching searches using KNNGIST;
- GiST and GIN full-text search indexes with options for language and stemming.
It’s not surprising that people get confused.
Rather than starting with “what method should I use to make my search fastest”, I suggest you narrow the field by determining what the semantics of your search requirements are.
In the following descriptions any examples will use the words table created with:
create table words ( word text not null ); \copy words from /usr/share/dict/linux.words
(Your system might have different dictionary files in /usr/share/dict, but the effect will be much the same.)
Simple prefix searches (“Starts with…”)
Are you doing only simple prefix searches that match terms exactly, including punctuation and whitespace, either case sensitive or insensitive? If so, a simple column LIKE 'pattern%' search on a text_pattern_ops index may be suitable:
regress=> CREATE INDEX words_btree_tpo ON words(word text_pattern_ops); regress=> # LIKE prefix search is fast: regress=> EXPLAIN (ANALYZE ON, COSTS OFF, TIMING OFF) SELECT word FROM words WHERE word LIKE 'freck%'; QUERY PLAN ------------------------------------------------------------------------- Index Only Scan using words_btree_tpo on words (actual rows=18 loops=1) Index Cond: ((word ~>=~ 'freck'::text) AND (word ~ # as is left-anchored regexp matching regress=> EXPLAIN (ANALYZE ON, COSTS OFF, TIMING OFF) SELECT word FROM words WHERE word ~ '^freck.*'; QUERY PLAN ------------------------------------------------------------------------- Index Only Scan using words_btree_tpo on words (actual rows=18 loops=1) Index Cond: ((word ~>=~ 'freck'::text) AND (word ~ # ILIKE doesn't use the index though: regress=> EXPLAIN (ANALYZE ON, COSTS OFF, TIMING OFF) SELECT word FROM words WHERE word ILIKE 'freck%'; QUERY PLAN -------------------------------------------- Seq Scan on words (actual rows=18 loops=1) Filter: (word ~~* 'freck%'::text) Rows Removed by Filter: 479810 Total runtime: 339.787 ms (4 rows) regress=> # and neither does a suffix search: regress=> EXPLAIN (ANALYZE ON, COSTS OFF, TIMING OFF) SELECT word FROM words WHERE word LIKE '%freck'; QUERY PLAN ------------------------------------------- Seq Scan on words (actual rows=1 loops=1) Filter: (word ~~ '%freck'::text) Rows Removed by Filter: 479827 Total runtime: 91.125 ms (4 rows)
In newer PostgreSQL versions this works even if you concatenate the wildcard onto a parameter:
regress=> PREPARE ps1(text) AS SELECT word FROM words WHERE word LIKE $1 || '%'; regress=> EXPLAIN (ANALYZE ON, COSTS OFF, TIMING OFF) EXECUTE ps1('freck'); QUERY PLAN ------------------------------------------------------------------------- Index Only Scan using words_btree_tpo on words (actual rows=18 loops=1) Index Cond: ((word ~>=~ 'freck'::text) AND (word ~<~ 'frecl'::text)) Filter: (word ~~ 'freck%'::text) Heap Fetches: 18 Total runtime: 0.060 ms (5 rows)
Variants on prefix search
For case insensitive prefix searches, you can use lower(column) LIKE lower(pattern):
regress=> EXPLAIN (ANALYZE ON, COSTS OFF, TIMING OFF) SELECT word FROM words WHERE lower(word) LIKE lower('freck%'); QUERY PLAN -------------------------------------------------------------------------------------------- Bitmap Heap Scan on words (actual rows=18 loops=1) Filter: (lower(word) ~~ 'freck%'::text) -> Bitmap Index Scan on words_lower_btree_tpo (actual rows=18 loops=1) Index Cond: ((lower(word) ~>=~ 'freck'::text) AND (lower(word) ~<~ 'frecl'::text)) Total runtime: 0.073 ms (5 rows)
citext won’t help you; ILIKE won’t use an index even with the citext data type:
regress=> CREATE TABLE wordsci ( word citext not null ); regress=> \copy wordsci from '/usr/share/dict/linux.words' regress=> create index wordsci_btree_tpo ON wordsci (word text_pattern_ops); regress=> explain SELECT word FROM wordsci WHERE word LIKE 'AIL%'; QUERY PLAN -------------------------------------------------------------- Seq Scan on wordsci (cost=0.00..8463.85 rows=2399 width=10) Filter: (word ~~ 'AIL%'::citext) (2 rows) regress=> explain SELECT word FROM wordsci WHERE word ILIKE 'AIL%'; QUERY PLAN -------------------------------------------------------------- Seq Scan on wordsci (cost=0.00..8463.85 rows=2399 width=10) Filter: (word ~~* 'AIL%'::citext) (2 rows)
If you’re doing only “ends with” searches you can actually index reverse(my_column) with text_pattern_ops and then search for LIKE reverse(my_pattern) in some cases, so you can use the same approach as prefix search. Again, this only works when you’re matching punctuation and spacing exactly.
regress=> CREATE INDEX words_rev_btree_tpo ON words(reverse(word) text_pattern_ops); regress=> # Find words that end with "ent": regress=> EXPLAIN (ANALYZE ON, COSTS OFF, TIMING OFF) SELECT word FROM words WHERE reverse(word) LIKE reverse('%ent'); QUERY PLAN -------------------------------------------------------------------------------------------- Bitmap Heap Scan on words (actual rows=4069 loops=1) Filter: (reverse(word) ~~ 'tne%'::text) -> Bitmap Index Scan on words_rev_btree_tpo (actual rows=4069 loops=1) Index Cond: ((reverse(word) ~>=~ 'tne'::text) AND (reverse(word) ~<~ 'tnf'::text)) Total runtime: 5.680 ms (5 rows)
You can do simple punctuation and spacing normalisation with a user-defined function that transforms the input string using replace or regexp_replace, so you search for my_normalize_func(col) LIKE my_normalize_func('pattern') … but it quickly gets inefficient and clumsy to work like this. It can be a good option if you have unusual or strict search requirements, though.
infix and suffix patterns
If you’re still reading, you probably need to match within the string, not just at the start, or you need to ignore punctuation and formatting differences. If you’ve been writing LIKE 'my%pattern' or LIKE '%word%' then this is you.
An infix wildcard like my%pattern will use a text_pattern_ops btree index, doing a search for my% then re-checking the matches. This is often good enough:
regress=> EXPLAIN (ANALYZE ON, COSTS OFF, TIMING OFF) SELECT word FROM words WHERE word LIKE 'ta%le'; QUERY PLAN ------------------------------------------------------------------------ Bitmap Heap Scan on words (actual rows=59 loops=1) Filter: (word ~~ 'ta%le'::text) Rows Removed by Filter: 2661 -> Bitmap Index Scan on words_btree_tpo (actual rows=2720 loops=1) Index Cond: ((word ~>=~ 'ta'::text) AND (word ~<~ 'tb'::text)) Total runtime: 1.386 ms (6 rows)
… but one with no left-anchored text at all, like %word%, cannot use a regular b-tree. In that case, the next question is whether you need to match partial words or not. If you’re given the search term ruff and need to find truffle then you’ve really only got one option for this kind of mid-word “contains” search: pg_trgm indexes on pattern matching LIKE or ~ searches.
As superuser:
regress=# CREATE EXTENSION pg_trgm;
then:
regress=> CREATE INDEX words_trgm_gin ON words USING GIN(word gin_trgm_ops); regress=> EXPLAIN (ANALYZE ON, COSTS OFF, TIMING OFF) SELECT word FROM words WHERE word LIKE '%ruff%'; QUERY PLAN --------------------------------------------------------------------- Bitmap Heap Scan on words (actual rows=99 loops=1) Recheck Cond: (word ~~ '%ruff%'::text) Rows Removed by Index Recheck: 8 -> Bitmap Index Scan on words_trgm_gin (actual rows=107 loops=1) Index Cond: (word ~~ '%ruff%'::text) Total runtime: 0.511 ms (6 rows)
Full-text search
If you’re looking for whole words or word prefixes you can consider full-text search. For your application:
- Is it OK to ignore the order of search terms? Should dog jumped match jumped dog?
- Do you want to offer a boolean / advanced search with and/or/not queries like jumped & !shark?
- Do you need to stem words to their roots, so that a search for “cats” finds “cat”, etc?
- Are you OK with inexact matches that ignore punctuation and case?
If all of that is “yes”, you probably want to use full-text search. A large variety of dictionaries are offered to customise behaviour with synonym matching, stemming, canonicalization of related terms, etc.
Full-text search supports word-by-word prefix matching, so you can match supercalifragilisticexpialidocious with super:*. It also offers boolean search with and, or, negation, and grouping, leading to such queries as:
SELECT to_tsvector('supercalafragalisticexpialidocious') @@ to_tsquery('(super:*) & !(superawful:* | supper)');
If you don’t want stemming and synonyms you can still use full text search with one of the thesaurus dictionaries or with the ‘simple’ dictionary. Prefix searches still work without stemming.
I won’t demonstrate all the options for full text search here are they’re already well covered elsewhere.
Mixed/multiple language search
Multi-language search is supported by full text search if you have a separate column that identifies the language the text is in. You can then index to_tsvector(language_column, text_column).
Mixed-language text or unknown-language text search is supported by full-text search, but only if you use the simple dictionary, in which case you don’t get stemming.
In conclusion
… it’s complicated. You need to know what you want before you can decide what to use.
You will notice that I barely even touched on performance in quantitative terms throughout this entire discussion. I looked into qualitative factors like which queries can use which index types, and into semantics, but not the details of timing and performance measurement. I didn’t get into GIN vs GiST choice for index types that offer both. That’s a whole separate topic, but one that’s already been discussed elsewhere. In any case the only really acceptable performance guidance is benchmarks you run on your hardware with a simulation of your data set.
At no point did I try to determine whether LIKE or full-text search is faster for a given query. That’s because it usually doesn’t matter; they have different semantics. Which goes faster, a car or a boat? In most cases it doesn’t matter because speed isn’t your main selection criteria, it’s “goes on water” or “goes on land”.
You wrote: “Mixed-language text or unknown-language text search is supported by full-text search, but only if you use the simple dictionary, in which case you don’t get stemming.” => Are you sure about this? Why does simple dictionary prevent stemming? (Side note: AFAIK dictionaries can be strung together with the most general dictionary at the end of the list).
Stemming requires language specific knowledge about words and suffixes. The simple dictionary doesn’t know about languages, so it can’t do much. Try it; compare SELECT to_tsvector(‘simple’, ‘cats’), to_tsvector(‘english’,’cats’);
You’re quite right that you can string multiple dictionaries together and do a lot of other very powerful things with tsearch. I didn’t want to try to cover all that in an already long article, though.
I notice you mention “Use of an external search provider like Apache Lucene / Solr,” but you never get into what they’re good at. Could you touch briefly on what kind of needs would mean you need to go to these external technologies for?
Good point. I’ll try to find the time to elaborate on that in an edit. In brief, they’re good when you need something more sophisticated than PostgreSQL’s tsearch can easily provide.
Multi-table-spanning searches are a pain with full text search in Pg, you need (expensive) triggers to maintain summary tables. Solr can handle that much better at the cost of an index that isn’t always perfectly up to date.
Solr also offers searching of XML, HTML and other structured content in the database and lots more. There’s way too much to discuss here. The short version is that it’s an option when you hit the limits of the built-in offerings.
Thanks, nice post