Use pgvector for Vector Similarity Search
pgvector is an open-source plugin for vector similarity search. It supports both exact and approximate nearest neighbor searches, as well as L2 distance, inner product, and cosine distance. For more details, see pgvector/pgvector: Open-source vector similarity search for Postgres.
Apache Cloudberry allows you to use pgvector for data storage, querying, indexing, hybrid searches, and more through SQL statements. This document explains how to use pgvector in Apache Cloudberry.
While the pgvector extension is maintained by the dedicated pgvector community, Cloudberry is committed to ensuring its compatibility and ease of use within our platform. Here is one forked pgvector version, which is contributed by the community members and customized for Cloudberry, but please note that it is not maintained as one official Cloudberry project.
Quick start
Enable the extension (do this once in each database where you want to use it):
CREATE EXTENSION vector;
Create a vector column with 3 dimensions:
CREATE TABLE items (id bigserial PRIMARY KEY, embedding vector(3));
Insert vector data:
INSERT INTO items (embedding) VALUES ('[1,2,3]'), ('[4,5,6]');
Get the nearest neighbors by L2 distance:
SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5;
Note: Use <#>
for inner product and <=>
for cosine distance.
Store data
Create a table with a vector column:
CREATE TABLE items (id bigserial PRIMARY KEY, embedding vector(3));
Or add a vector column to an existing table:
ALTER TABLE items ADD COLUMN embedding vector(3);
Insert vectors:
INSERT INTO items (embedding) VALUES ('[1,2,3]'), ('[4,5,6]');
Insert and update vectors:
INSERT INTO items (id, embedding) VALUES (1, '[1,2,3]'), (2, '[4,5,6]')
ON CONFLICT (id) DO UPDATE SET embedding = EXCLUDED.embedding;
Update vectors:
UPDATE items SET embedding = '[1,2,3]' WHERE id = 1;
Delete vectors:
DELETE FROM items WHERE id = 1;
Query data
Get the nearest neighbors to a vector:
SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5;
The supported distance functions are:
<->
: L2 distance<#>
: negative inner product<=>
: cosine distance
Get the nearest neighbors of a row:
SELECT * FROM items WHERE id != 1 ORDER BY embedding <-> (SELECT embedding FROM items WHERE id = 1) LIMIT 5;
Get rows within a specific distance range:
SELECT * FROM items WHERE embedding <-> '[3,1,2]' < 5;
Note: Using ORDER BY
and LIMIT
together can take advantage of indexing.
Get the distance:
SELECT embedding <-> '[3,1,2]' AS distance FROM items;
For inner product, multiply by -1
(because <#>
returns the negative inner product).
SELECT (embedding <#> '[3,1,2]') * -1 AS inner_product FROM items;
For cosine similarity, use 1
minus the cosine distance.
SELECT 1 - (embedding <=> '[3,1,2]') AS cosine_similarity FROM items;
Calculate the average of vectors:
SELECT AVG(embedding) FROM items;
Calculate the average of a group of vectors:
SELECT category_id, AVG(embedding) FROM items GROUP BY category_id;
Index data
By default, pgvector performs exact nearest neighbor searches, which provides a high recall rate.
If you need a higher recall rate, you can use approximate nearest neighbor search by adding an index, although this might reduce performance. Unlike adding a regular index, after adding an approximate index, queries will return different results.
pgvector supports the following index types:
- HNSW
- IVFFlat
HNSW index
About HNSW HNSW (Hierarchical Navigable Small World) is an efficient algorithm for approximate nearest neighbor search, designed for processing large-scale and high-dimensional datasets.
The basic principles of HNSW are as follows:
- Multi-level graph structure: HNSW organizes data by building a multi-level graph. In this graph, each node represents a data point (or vector), and the edges between nodes reflect their relative proximity in space.
- Search optimization: This multi-level structure allows the search process to skip over many irrelevant data points quickly, narrowing down the neighboring area of the query vector. This greatly improves query efficiency.
HNSW is particularly useful in the following scenarios:
- High-dimensional data: HNSW index is very effective for high-dimensional datasets, because it is good at handling complex proximity relationships in high-dimensional space.
- Large datasets: HNSW index is suitable for large datasets because it balances query speed and recall rate better than many other types of indexes.
Creating an HNSW index takes longer and uses more memory, but it offers better query performance (speed-recall tradeoff). Unlike IVFFlat, HNSW does not require a training step, so you can create the index even when the table has no data.
Add an index for each distance function you plan to use.
Create an HNSW index
Each distance metric has its specific use cases. The choice of which method to use for creating an index depends on the type of search you want to optimize. For example, if your application focuses on getting vectors with similar direction but possibly different magnitudes, an index created with cosine distance might be more suitable. On the other hand, if you are concerned with the straight-line distance between vectors, you should choose an index based on L2 distance.
L2 distance:
CREATE INDEX ON items USING hnsw (embedding vector_l2_ops);
Inner product:
CREATE INDEX ON items USING hnsw (embedding vector_ip_ops);
Cosine distance:
CREATE INDEX ON items USING hnsw (embedding vector_cosine_ops);
The maximum dimension for indexable vectors is 2000.
HNSW index options
m
: The maximum number of connections per layer (16
by default).ef_construction
: The size of the dynamic candidate list used to build the graph (64
by default).
CREATE INDEX ON items USING hnsw (embedding vector_l2_ops) WITH (m = 16, ef_construction = 64);
A larger ef_construction
value provides higher recall but at the cost of index build time and insert speed.
HNSW index query options
Specify the size of the dynamic candidate list for searches (40
by default). A larger value improves recall but reduces speed.
SET hnsw.ef_search = 100;
Use SET LOCAL
within a transaction to apply it to a single query:
BEGIN;
SET LOCAL hnsw.ef_search = 100;
SELECT ...
COMMIT;
HNSW index build time
Index build speed increases greatly when the internal graph structure of the HNSW index fits into maintenance_work_mem
.
SET maintenance_work_mem = '8GB';
If the graph no longer fits, you will receive a notification:
NOTICE: hnsw graph no longer fits into maintenance_work_mem after 100000 tuples
DETAIL: Building will take significantly more time.
HINT: Increase maintenance_work_mem to speed up builds.