Use pgvector for Vector Similarity Search
pgvector is an open-source plugin for vector similarity search. It supports both exact and approximate nearest neighbor searches, as well as L2 distance, inner product, and cosine distance. For more details, see pgvector/pgvector: Open-source vector similarity search for Postgres.
Apache Cloudberry allows you to use pgvector for data storage, querying, indexing, hybrid searches, and more through SQL statements. This document explains how to use pgvector in Apache Cloudberry.
While the pgvector extension is maintained by the dedicated pgvector community, Cloudberry is committed to ensuring its compatibility and ease of use within our platform. Here is one forked pgvector version, which is contributed by the community members and customized for Cloudberry, but please note that it is not maintained as one official Cloudberry project.
Quick start
Enable the extension (do this once in each database where you want to use it):
CREATE EXTENSION vector;
Create a vector column with 3 dimensions:
CREATE TABLE items (id bigserial PRIMARY KEY, embedding vector(3));
Insert vector data:
INSERT INTO items (embedding) VALUES ('[1,2,3]'), ('[4,5,6]');
Get the nearest neighbors by L2 distance:
SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5;
Note: Use <#>
for inner product and <=>
for cosine distance.
Store data
Create a table with a vector column:
CREATE TABLE items (id bigserial PRIMARY KEY, embedding vector(3));
Or add a vector column to an existing table:
ALTER TABLE items ADD COLUMN embedding vector(3);
Insert vectors:
INSERT INTO items (embedding) VALUES ('[1,2,3]'), ('[4,5,6]');
Insert and update vectors:
INSERT INTO items (id, embedding) VALUES (1, '[1,2,3]'), (2, '[4,5,6]')
ON CONFLICT (id) DO UPDATE SET embedding = EXCLUDED.embedding;
Update vectors:
UPDATE items SET embedding = '[1,2,3]' WHERE id = 1;
Delete vectors:
DELETE FROM items WHERE id = 1;
Query data
Get the nearest neighbors to a vector:
SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5;
The supported distance functions are:
<->
: L2 distance<#>
: negative inner product<=>
: cosine distance
Get the nearest neighbors of a row:
SELECT * FROM items WHERE id != 1 ORDER BY embedding <-> (SELECT embedding FROM items WHERE id = 1) LIMIT 5;
Get rows within a specific distance range:
SELECT * FROM items WHERE embedding <-> '[3,1,2]' < 5;
Note: Using ORDER BY
and LIMIT
together can take advantage of indexing.
Get the distance:
SELECT embedding <-> '[3,1,2]' AS distance FROM items;
For inner product, multiply by -1
(because <#>
returns the negative inner product).
SELECT (embedding <#> '[3,1,2]') * -1 AS inner_product FROM items;
For cosine similarity, use 1
minus the cosine distance.
SELECT 1 - (embedding <=> '[3,1,2]') AS cosine_similarity FROM items;
Calculate the average of vectors:
SELECT AVG(embedding) FROM items;
Calculate the average of a group of vectors:
SELECT category_id, AVG(embedding) FROM items GROUP BY category_id;
Index data
By default, pgvector performs exact nearest neighbor searches, which provides a high recall rate.
If you need a higher recall rate, you can use approximate nearest neighbor search by adding an index, although this might reduce performance. Unlike adding a regular index, after adding an approximate index, queries will return different results.
pgvector supports the following index types:
- HNSW
- IVFFlat