Full text search in postgresql

5 min readJul 18, 2024

Full text searching or just text search is a process of identifying natural-language documents that satisfy a query, and optionally to sort them by relevance to the query. Although postgresql has ~, ~*, LIKE and ILIKE operator for textual data types but they lack many essential propereties like:

There is no linguistic support .
they provide no ordering of search results.
they tend to be slow because there is no index support.

Full text search allows us to indexing of documents so we can search text rapidly. this process includes :

Parsing Documents into tokens.
converting token into lexemes.
Storing preprocessed documents optimized for searching.

Document

A document is the unit like magazine article, message or other text based information. the text search engine must be able to parse documents and store association of key words with their parent document so that they are used to search for documents that contain query words. the document can be stored as simple text file system. since document can be stored outside a database but accessing it mught requires super user permission or special function support.

In Postgresql a document is normally a text field within a row of database table or combination of multiple fields or a document can be constructed form different parts of database systems.

SELECT COALESC(title,author,body,' ') as document from messages;

Basic Text Matching

In postgresql full text searching is based on the match operator @@ which returns true if a tsvector matches a tsquery.

SELECT 'The quick brown fox jumps over the lazy dog'::tsvector @@ 'fox'::tsquery as state

in above query, a tsquery is not just raw text it contains search terms, which must be already normalized lexemes and may combine multiple terms using AND, OR, NOT and FOLLOWED BY operators. there are also functions like to_tsquery, plainto_tsquery and phraseto_tsquery that are helpful in converting user written text into a proper tsquery. similarly to_tsvector is used to parse and normalize a document string. the above query can also be written as ,

SELECT to_tsvector('The quick brown fox jumps over the lazy dog') @@ to_tsquery('fox') as state

also if we write

SELECT 'The quick brown fox jumps over the lazy dog'::tsvector @@ to_tsquery('jump') as state

we get false, since here is no normalization of word jumps will occur. the elements of tsvector are lexemes, which are assumed already normalized so jumps does not match jump.

Tables and indexes

so far we have only illustrated full text matching using simple constant string. now we use how to search table and using indexes.

Searching a table : We can do a full text search without an index. a simple query to print title and description of contents table that contains ‘music’ .

select title, description from contents where to_tsvector('english',description) @@ to_tsquery('english','music')

so what happen if we change music keyword to musically ?, lets see..

select title, description from contents where to_tsvector('english',description) @@ to_tsquery('english','musically')

it finds same data so whys that ? it is because all words like music, musically reduced to same normalized lexeme.

The query above specifies that the english configuration is to be used to parse and normalize the strings. Alternatively we could omit the configuration parameters

SELECT title, description FROM contents WHERE to_tsvector(description) @@ to_tsquery('musically');

This query will use the configuration set by default_text_search_config.
Although these query work without an index, but this approach is too slow for big databases. so practical use of text searching requires creating indexing.

Creating a Index : we can create index using GIN index to speed up text searches.

CREATE INDEX content_idx on contents USING GIN (to_ts_vector('english',description));

Here, 2 argument version of to_tsvector is used because the index contents must be unaffected by default_text_search_config. if they were affected , the index contents might be inconsistent because different entries could contain tsvectors that were created with different text search configurations, and there would be no way to guess which was which.

Ranking Search Result : It measures how relevant documents are to a particular query. when there are many matching to the document we can get most relevant one at first. postgresql provides two predifined ranking functions whivh are ,

ts_rank([ weights float4[], ] vector tsvector, query tsquery [, normalization integer ]) returns float4
ts_rank_cd([ weights float4[], ] vector tsvector, query tsquery [, normalization integer ]) returns float4


SELECT *
from (
    SELECT
        title,description,
        ts_rank_cd(to_tsvector('english',description), 
        to_tsquery('est')) AS score
    FROM contents
) s
WHERE score > 0
ORDER BY score DESC LIMIT 10

Limitations

The length of each lexeme must be less than 2 kilobytes
The length of a tsvector (lexemes + positions) must be less than 1 megabyte 64
The number of lexemes must be less than 2
Position values in tsvector must be greater than 0 and no more than 16,383
No more than 256 positions per lexeme
The number of nodes (lexemes + operators) in a tsquery must be less than 32,768

Conclusion

PostgreSQL’s full-text search capabilities are robust, scalable, and highly effective for applications requiring advanced text search functionalities. By leveraging these features, developers can implement powerful search solutions that deliver fast and relevant results, enhancing the user experience and meeting the demands of text-heavy applications. Whether for document management systems, content-rich websites, or any application requiring detailed search capabilities, PostgreSQL’s full-text search is a valuable tool in the database management toolkit.