Text Search Dictionaries
Tokens produced by the Apache Cloudberry full text search parser are passed through a chain of dictionaries to produce a normalized term or "lexeme". Different kinds of dictionaries are available to filter and transform tokens in different ways and for different languages.
This section contains the following subtopics:
- About text search dictionaries
- Stop words
- Simple dictionary
- Synonym dictionary
- Thesaurus dictionary
- Ispell dictionary
- SnowBall dictionary
About text search dictionaries
Dictionaries are used to eliminate words that should not be considered in a search (stop words), and to normalize words so that different derived forms of the same word will match. A successfully normalized word is called a lexeme. Aside from improving search quality, normalization and removal of stop words reduces the size of the tsvector representation of a document, thereby improving performance. Normalization does not always have linguistic meaning and usually depends on application semantics.
Some examples of normalization:
- Linguistic - Ispell dictionaries try to reduce input words to a normalized form; stemmer dictionaries remove word endings
- URL locations can be canonicalized to make equivalent URLs match:
http://www.pgsql.ru/db/mw/index.htmlhttp://www.pgsql.ru/db/mw/http://www.pgsql.ru/db/../db/mw/index.html
- Color names can be replaced by their hexadecimal values, for example,
red,green,blue,magenta->FF0000,00FF00,0000FF,FF00FF - If indexing numbers, you can remove some fractional digits to reduce the range of possible numbers, so for example
3.14159265359,3.1415926,3.14will be the same after normalization if only two digits are kept after the decimal point.
A dictionary is a program that accepts a token as input and returns:
- an array of lexemes if the input token is known to the dictionary (notice that one token can produce more than one lexeme)
- a single lexeme with the
TSL_FILTERflag set, to replace the original token with a new token to be passed to subsequent dictionaries (a dictionary that does this is called a filtering dictionary) - an empty array if the dictionary knows the token, but it is a stop word
NULLif the dictionary does not recognize the input token
Apache Cloudberry provides predefined dictionaries for many languages. There are also several predefined templates that can be used to create new dictionaries with custom parameters. Each predefined dictionary template is described below. If no existing template is suitable, it is possible to create new ones; see the contrib/ area of the Apache Cloudberry distribution for examples.
A text search configuration binds a parser together with a set of dictionaries to process the parser's output tokens. For each token type that the parser can return, a separate list of dictionaries is specified by the configuration. When a token of that type is found by the parser, each dictionary in the list is consulted in turn, until some dictionary recognizes it as a known word. If it is identified as a stop word, or if no dictionary recognizes the token, it will be discarded and not indexed or searched for. Normally, the first dictionary that returns a non-NULL output determines the result, and any remaining dictionaries are not consulted; but a filtering dictionary can replace the given word with a modified word, which is then passed to subsequent dictionaries.
The general rule for configuring a list of dictionaries is to place first the most narrow, most specific dictionary, then the more general dictionaries, finishing with a very general dictionary, like a Snowball stemmer or simple, which recognizes everything. For example, for an astronomy-specific search (astro_en configuration) one could bind token type asciiword (ASCII word) to a synonym dictionary of astronomical terms, a general English dictionary and a Snowball English stemmer:
ALTER TEXT SEARCH CONFIGURATION astro_en
ADD MAPPING FOR asciiword WITH astrosyn, english_ispell, english_stem;
A filtering dictionary can be placed anywhere in the list, except at the end where it'd be useless. Filtering dictionaries are useful to partially normalize words to simplify the task of later dictionaries. For example, a filtering dictionary could be used to remove accents from accented letters, as is done by the unaccent module.
Stop words
Stop words are words that are very common, appear in almost every document, and have no discrimination value. Therefore, they can be ignored in the context of full text searching. For example, every English text contains words like a and the, so it is useless to store them in an index. However, stop words do affect the positions in tsvector, which in turn affect ranking:
SELECT to_tsvector('english', 'in the list of stop words');
to_tsvector
----------------------------
'list':3 'stop':5 'word':6
The missing positions 1,2,4 are because of stop words. Ranks calculated for documents with and without stop words are quite different:
SELECT ts_rank_cd (to_tsvector('english', 'in the list of stop words'), to_tsquery('list & stop'));
ts_rank_cd
------------
0.05
SELECT ts_rank_cd (to_tsvector('english', 'list stop words'), to_tsquery('list & stop'));
ts_rank_cd
------------
0.1
It is up to the specific dictionary how it treats stop words. For example, ispell dictionaries first normalize words and then look at the list of stop words, while Snowball stemmers first check the list of stop words. The reason for the different behavior is an attempt to decrease noise.