Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL
I had to write a simple query where I go looking for people's name that start with a B or a D :
SELECT s.name FROM spelers s WHERE s.name LIKE 'B%' OR s.name LIKE 'D%' ORDER BY 1
I was wondering if there is a way to rewrite this to become more performant. So I can avoid
orand / or
Your query is pretty much the optimum. Syntax won't get much shorter, query won't get much faster:
SELECT name FROM spelers WHERE name LIKE 'B%' OR name LIKE 'D%' ORDER BY 1;
If you really want to shorten the syntax, use a regular expression with branches:
... WHERE name ~ '^(B|D).*'
Or slightly faster, with a character class:
... WHERE name ~ '^[BD].*'
A quick test without index yields faster results than for
SIMILAR TOin either case for me.
With an appropriate B-Tree index in place,
LIKEwins this race by orders of magnitude.
Read the basics about pattern matching in the manual.
Index for superior performance
If you are concerned with performance, create an index like this for bigger tables:
CREATE INDEX spelers_name_special_idx ON spelers (name text_pattern_ops);
Makes this kind of query faster by orders of magnitude. Special considerations apply for locale-specific sort order. Read more about operator classes in the manual. If you are using the standard "C" locale (most people don't), a plain index (with default operator class) will do.
Such an index is only good for left-anchored patterns (matching from the start of the string).
SIMILAR TOor regular expressions with basic left-anchored expressions can use this index, too. But not with branches
(B|D)or character classes
[BD](at least in my tests on PostgreSQL 9.0).
Trigram matches or text search use special GIN or GiST indexes.
Overview of pattern matching operators
~(regular expression match) is powerful but more complex and may be slow for anything more than basic expressions.
SIMILAR TOis just pointless. A peculiar halfbreed of
LIKEand regular expressions. I never use it. See below.
% is the "similarity" operator, provided by the additional module
pg_trgm. See below.
@@is the text search operator. See below.
pg_trgm - trigram matching
Beginning with PostgreSQL 9.1 you can facilitate the extension
pg_trgmto provide index support for any
ILIKEpattern (and simple regexp patterns with
~) using a GIN or GiST index.
Details, example and links:
pg_trgmalso provides these operators:
%- the "similarity" operator
%>) - the "word_similarity" operator in Postgres 9.6 or later
%>>) - the "strict_word_similarity" operator in Postgres 11 or later
Is a special type of pattern matching with separate infrastructure and index types. It uses dictionaries and stemming and is a great tool to find words in documents, especially for natural languages.
Prefix matching is also supported:
As well as phrase search since Postgres 9.6:
Additional tools for fuzzy string matching
The additional module fuzzystrmatch offers some more options, but performance is generally inferior to all of the above.
In particular, various implementations of the
levenshtein()function may be instrumental.
Why are regular expressions (
~) always faster than
The answer is simple.
SIMILAR TOexpressions are rewritten into regular expressions internally. So, for every
SIMILAR TOexpression, there is at least one faster regular expression (that saves the overhead of rewriting the expression). There is no performance gain in using
And simple expressions that can be done with
~~) are faster with
SIMILAR TOis only supported in PostgreSQL because it ended up in early drafts of the SQL standard. They still haven't gotten rid of it. But there are plans to remove it and include regexp matches instead - or so I heard.
EXPLAIN ANALYZEreveals it. Just try with any table yourself!
EXPLAIN ANALYZE SELECT * FROM spelers WHERE name SIMILAR TO 'B%';
... Seq Scan on spelers (cost= ... Filter: (name ~ '^(?:B.*)$'::text)
SIMILAR TOhas been rewritten with a regular expression (
Ultimate performance for this particular case
EXPLAIN ANALYZEreveals more. Try, with the afore-mentioned index in place:
EXPLAIN ANALYZE SELECT * FROM spelers WHERE name ~ '^B.*;
... -> Bitmap Heap Scan on spelers (cost= ... Filter: (name ~ '^B.*'::text) -> Bitmap Index Scan on spelers_name_text_pattern_ops_idx (cost= ... Index Cond: ((prod ~>=~ 'B'::text) AND (prod ~<~ 'C'::text))
Internally, with an index that is not locale-aware (
text_pattern_opsor using locale
C) simple left-anchored expressions are rewritten with these text pattern operators:
~<~. This is the case for
The same is true for indexes on
So, applied to the original question, this is the fastest possible way:
SELECT name FROM spelers WHERE name ~>=~ 'B' AND name ~<~ 'C' OR name ~>=~ 'D' AND name ~<~ 'E' ORDER BY 1;
Of course, if you should happen to search for adjacent initials, you can simplify further:
WHERE name ~>=~ 'B' AND name ~<~ 'D' -- strings starting with B or C
The gain over plain use of
~~is tiny. If performance isn't your paramount requirement, you should just stick with the standard operators - arriving at what you already have in the question.
The OP doesn't have an index on name but do you happen to know, if they did, would their original query involve 2 range seeks and `similar` a scan?
@MartinSmith: A quick test with `EXPLAIN ANALYZE` shows 2 bitmap index scans. Multiple bitmap index scans can be combined rather quickly.
Thanks. So would there be any milage with replacing the `OR` with `UNION ALL` or replacing `name LIKE 'B%'` with `name >= 'B' AND name <'C'` in Postgres?
@LucasKauffman - But if you create the index and try again sounds like you may get a different result :-)
I don't have enough rights on the database to create one so unfortunately I can't test it :(
@MartinSmith: `UNION` won't but, yes, combining the ranges into one `WHERE` clause will speed up the query. I have added more to my answer. Of course, you have to take your locale into account. Locale-aware search is always slower.
@LucasKauffman: Try `WHERE name ~ '^[BD].*'`. That's faster than `SIMILAR TO` under all circumstances - as I have explained in my amended answer.
I wonder if the new trigram indexes in 9.1 would speed things up: http://www.depesz.com/index.php/2011/02/19/waiting-for-9-1-faster-likeilike/
@a_horse_with_no_name: I expect not. The new capabilities of pg_tgrm with GIN indexes are a treat for generic text search. A search anchored at the start is already faster than that.
Is there a way to combine regular expression constraints such as ^ or $ with a field name, when you compare two fields? Example: (...) WHERE field1 ~ ^field2;
@SébastienClément: Yes, concatenation. The regexp pattern is a plain string. So: `WHERE field1 ~ ('^' || field2)`. You may or may not want to escape characters with special meaning in `field2`. See: https://stackoverflow.com/a/45741630/939860
@Erwin Brandstetter: fantastic, I had just forgotten the parentheses, thanks!
How about adding a column to the table. Depending on your actual requirements:
person_name_start_with_B_or_D (Boolean) person_name_start_with_char CHAR(1) person_name_start_with VARCHAR(30)
PostgreSQL doesn't support computed columns in base tables a la SQL Server but the new column can be maintained via trigger. Obviously, this new column would be indexed.
Alternatively, an index on an expression would give you the same, cheaper. E.g.:
CREATE INDEX spelers_name_initial_idx ON spelers (left(name, 1));
Queries that match the expression in their conditions can utilize this index.
This way, the performance hit is taken when the data is created or amended, so may only be appropriate for a low activity environment (i.e. much fewer writes than reads).
You could try
SELECT s.name FROM spelers s WHERE s.name SIMILAR TO '(B|D)%' ORDER BY s.name
I've no idea whether or not either the above or your original expression are sargable in Postgres though.
If you create the suggested index would also be interested to hear how this compares with the other options.
SELECT name FROM spelers WHERE name >= 'B' AND name < 'C' UNION ALL SELECT name FROM spelers WHERE name >= 'D' AND name < 'E' ORDER BY name
What I have done in the past, faced with a similar performance issue, is to increment the ASCII character of the last letter, and do a BETWEEN. You then get the best performance, for a subset of the LIKE functionality. Of course, it only works in certain situations, but for ultra-large datasets where you're searching on a name for instance, it makes performance go from abysmal to acceptable.
For checking of initials, I often use casting to
"char"(with the double quotes). It's not portable, but very fast. Internally, it simply detoasts the text and returns the first character, and "char" comparison operations are very fast because the type is 1-byte fixed length:
SELECT s.name FROM spelers s WHERE s.name::"char" =ANY( ARRAY[ "char" 'B', 'D' ] ) ORDER BY 1
Note that casting to
"char"is faster than the
ascii()slution by @Sole021, but it is not UTF8 compatible (or any other encoding for that matter), returning simply the first byte, so should only be used in cases where the comparison is against plain old 7-bit ASCII characters.
There are two methods not mentioned yet for dealing with such cases:
partial (or partitioned - if created for full range manually) index - most useful when only a subset of data is required (for example during some maintenance or temporary for some reporting):
CREATE INDEX ON spelers WHERE name LIKE 'B%'
partitioning the table itself (using the first character as partitioning key) - this technique is especially worth considering in PostgreSQL 10+ (less painful partitioning) and 11+ (partition pruning during query execution).
Moreover, if the data in a table is sorted, one can benefit from using BRIN index (over the first character).