<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[VectorChord | Vector Search in Postgres]]></title><description><![CDATA[Scalable, Low-latency and Hybrid-enabled Vector Search in Postgres. successor of pgvecto.rs]]></description><link>https://blog.vectorchord.ai</link><generator>RSS for Node</generator><lastBuildDate>Thu, 16 Apr 2026 11:00:23 GMT</lastBuildDate><atom:link href="https://blog.vectorchord.ai/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[VectorChord 1.1: Native 4/8 Bit Vector Types and Per-Index Query Defaults]]></title><description><![CDATA[We’re excited to announce the release of VectorChord 1.1 as we kick off the new year of horse.
VectorChord 1.0 was a milestone for Postgres-native vector search: it made large-scale indexing fast enough to feel like iteration, not an outage. In 1.1, ...]]></description><link>https://blog.vectorchord.ai/vectorchord-11-native-48-bit-vector-types-and-per-index-query-defaults</link><guid isPermaLink="true">https://blog.vectorchord.ai/vectorchord-11-native-48-bit-vector-types-and-per-index-query-defaults</guid><category><![CDATA[vector database]]></category><category><![CDATA[Benchmark]]></category><category><![CDATA[VectorSearch]]></category><category><![CDATA[PostgreSQL]]></category><category><![CDATA[vector embeddings]]></category><category><![CDATA[vectorchord]]></category><dc:creator><![CDATA[Junyu Chen]]></dc:creator><pubDate>Sat, 14 Feb 2026 10:08:22 GMT</pubDate><content:encoded><![CDATA[<p>We’re excited to announce the release of VectorChord 1.1 as we kick off the new year of horse.</p>
<p><a target="_blank" href="https://blog.vectorchord.ai/vectorchord-10-developer-first-vector-search-on-postgres-100x-faster-indexing-than-pgvector">VectorChord 1.0</a> was a milestone for Postgres-native vector search: it made large-scale indexing fast enough to feel like iteration, not an outage. In 1.1, we’re building on that foundation with RaBitQ8: A new vector type that delivers 4x storage savings and faster search speeds compared to float32, all with less than 1% recall loss. Paired with RaBitQ4 and more composable query control, VectorChord 1.1 is ready for your most demanding production workloads.</p>
<p>This release focuses on two pain points we’ve observed through community feedback:</p>
<ul>
<li><p><strong>Capacity constraints.</strong> When vector indexes grow massive, scaling isn't just about throughput. It is also about SSD footprint and buffer cache pressure.</p>
</li>
<li><p><strong>Operational Overhead.</strong> As the number of indexes grows, manually setting search parameters (like nprobe) for every query clutters your application logic. This becomes a major headache in scenarios like partitioned tables, where individual indexes need distinct configurations. Relying on a global session variable (GUC) makes it impossible to tune these indexes independently.</p>
</li>
</ul>
<p>VectorChord 1.1 addresses both. We’re introducing <code>rabitq8</code> and <code>rabitq4</code>, native quantized vector types that are 4x and 8x smaller than standard float32. These types shrink total index size by 4x–7x, drastically reducing storage overhead. We’ve also added per-index query defaults, allowing multi-index queries to rely on specific configurations rather than a brittle global PostgreSQL GUC.</p>
<h2 id="heading-rabitq8-amp-rabitq4-low-bit-vector-type-for-storage-efficiency">RaBitQ8 &amp; RaBitQ4: Low bit vector type for storage efficiency</h2>
<p>Vector data footprint is the hidden limiter of large-scale search on Postgres. Storing high-dimensional vectors as standard float32 (4 bytes per dimension) consumes massive amounts of SSD space and memory. As this footprint grows, it increases <strong>buffer cache pressure</strong> and forces queries to read from disk, causing P99 latency to spike.</p>
<p>VectorChord 1.1 solves this with two new quantized data types: <code>rabitq8</code> and <code>rabitq4</code>, built on extended RaBitQ (<a target="_blank" href="https://arxiv.org/abs/2409.09913">paper</a>). By using 8-bit and 4-bit representations instead of 32-bit floats, these types reduce per-vector storage by 4x and 8x respectively, while keeping accuracy loss minimal.</p>
<p>Using these types is almost identical to using <code>vector</code>. The only difference is that you quantize values with <code>quantize_to_rabitq8</code> or <code>quantize_to_rabitq4</code> during <code>INSERT</code> and <code>SELECT</code>.</p>
<pre><code class="lang-pgsql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> items (id <span class="hljs-type">bigserial</span> <span class="hljs-keyword">PRIMARY KEY</span>, embedding rabitq8(<span class="hljs-number">3</span>));
<span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">INDEX</span> <span class="hljs-keyword">ON</span> items <span class="hljs-keyword">USING</span> vchordrq (embedding rabitq8_l2_ops);
<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> items (embedding) <span class="hljs-keyword">VALUES</span> (quantize_to_rabitq8(<span class="hljs-string">'[0,0,0]'</span>));
<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> items (embedding) <span class="hljs-keyword">VALUES</span> (quantize_to_rabitq8(<span class="hljs-string">'[1,1,1]'</span>));
<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> items (embedding) <span class="hljs-keyword">VALUES</span> (quantize_to_rabitq8(<span class="hljs-string">'[2,2,2]'</span>));
<span class="hljs-comment">--- ...</span>
<span class="hljs-keyword">SELECT</span> id <span class="hljs-keyword">FROM</span> items <span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> embedding &lt;-&gt; quantize_to_rabitq8(<span class="hljs-string">'[1,2,3]'</span>) <span class="hljs-keyword">LIMIT</span> <span class="hljs-number">100</span>;
</code></pre>
<p>On LAION with 100M 768-dimensional vectors, a typical <code>vector</code> (float32) index occupies about <strong>400GB</strong>. Using the <code>halfvec</code> (float16) type reduces that to roughly <strong>171GB</strong>. Under the same index options, switching to <code>rabitq8</code> further reduces it to about <strong>95GB</strong>, and <code>rabitq4</code> to around <strong>58GB</strong>, delivering a <strong>4x to 7x</strong> footprint reduction compared with the baseline.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1770862442344/781ef6d4-aa1e-4281-b306-7de000051672.png" alt class="image--center mx-auto" /></p>
<p>The next question is the recall and throughput trade-off. <code>rabitq8</code> delivers the index size reduction while keeping recall and QPS essentially on par with the baseline (<code>float32</code> vectors). <code>rabitq4</code> is a more aggressive option. It trades a larger recall loss for the smallest possible index size, and is best suited for deployments where storage is the primary constraint.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1771052144440/61043c76-0e82-441a-bd3e-c0eeea17747c.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-per-index-query-defaults-composable-multi-index-queries">Per-Index Query Defaults: Composable Multi-Index Queries</h2>
<p>In VectorChord, the <code>vchordrq.probes</code> GUC controls how much of the vector space is searched at query time, directly trading off throughput against recall. For hierarchical <code>vchordrq</code> indexes, <code>vchordrq.probes</code> must have the same length as the <code>build.internal.lists</code> configuration used at build time. Typical configurations look like:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td></td><td><code>build.internal.lists</code></td><td><code>vchordrq.probes</code></td></tr>
</thead>
<tbody>
<tr>
<td>No partition</td><td><code>[]</code></td><td><code>''</code></td></tr>
<tr>
<td>1-layer partition</td><td><code>[2000]</code></td><td><code>'40'</code></td></tr>
<tr>
<td>2-layer partition</td><td><code>[800, 640000]</code></td><td><code>'40,200'</code></td></tr>
</tbody>
</table>
</div><p>For simple vector search on a single table with a single index, setting the GUC once before running a query is a natural choice.</p>
<pre><code class="lang-pgsql"><span class="hljs-keyword">SET</span> vchordrq.probes = <span class="hljs-string">'40'</span>;
</code></pre>
<p>However, the situation changes when a single query touches more than one vector index. Indexes built with different <code>build.internal.lists</code> configurations should logically use different probes settings. Worse, when the number of list levels differs, no single global probes value can simultaneously match both indexes, making the global setting inherently unsuitable for such queries.</p>
<pre><code class="lang-pgsql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">INDEX</span> idx_1 <span class="hljs-keyword">ON</span> table_1 <span class="hljs-keyword">USING</span> vchordrq (emb vector_cosine_ops);

<span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">INDEX</span> idx_2 <span class="hljs-keyword">ON</span> table_2 <span class="hljs-keyword">USING</span> vchordrq (emb vector_cosine_ops) <span class="hljs-keyword">WITH</span> (<span class="hljs-keyword">options</span> = $$<span class="pgsql">
[build.internal]
lists = [<span class="hljs-number">2000</span>]
$$</span>);

<span class="hljs-comment">-- Bad: matches idx_2 configuration but not idx_1</span>
<span class="hljs-keyword">SET</span> vchordrq.probes = <span class="hljs-string">'40'</span>;
<span class="hljs-comment">-- Bad: matches idx_1 configuration but not idx_2</span>
<span class="hljs-keyword">SET</span> vchordrq.probes = <span class="hljs-string">''</span>;

<span class="hljs-comment">-- Error: single global probes cannot satisfy both index configurations</span>
<span class="hljs-keyword">SELECT</span> <span class="hljs-string">'table_1'</span> <span class="hljs-keyword">AS</span> src, id <span class="hljs-keyword">FROM</span> table_1 <span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> emb &lt;=&gt; <span class="hljs-string">'[1,2,3]'</span> <span class="hljs-keyword">LIMIT</span> <span class="hljs-number">5</span> 
<span class="hljs-keyword">UNION</span> <span class="hljs-keyword">ALL</span>
<span class="hljs-keyword">SELECT</span> <span class="hljs-string">'table_2'</span> <span class="hljs-keyword">AS</span> src, id <span class="hljs-keyword">FROM</span> table_2 <span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> emb &lt;=&gt; <span class="hljs-string">'[4,5,6]'</span> <span class="hljs-keyword">LIMIT</span> <span class="hljs-number">5</span>;
</code></pre>
<p>To solve this, VectorChord 1.1 introduces per-index query defaults as index options, following PostgreSQL’s per-index configuration model. Instead of forcing every query to share a single global configuration, each index can now define its own defaults, which take effect when a global configuration is not explicitly set.</p>
<p>You can set per-index defaults at index creation time, and modify them at any time without rebuilding the index.</p>
<pre><code class="lang-pgsql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">INDEX</span> idx_1 <span class="hljs-keyword">ON</span> table_1 <span class="hljs-keyword">USING</span> vchordrq (emb vector_cosine_ops) <span class="hljs-keyword">WITH</span> (<span class="hljs-keyword">options</span> = $$<span class="pgsql">
[build.internal]
lists = []
$$</span>, probes = <span class="hljs-string">''</span>);

<span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">INDEX</span> idx_2 <span class="hljs-keyword">ON</span> table_2 <span class="hljs-keyword">USING</span> vchordrq (emb vector_cosine_ops) <span class="hljs-keyword">WITH</span> (<span class="hljs-keyword">options</span> = $$<span class="pgsql">
[build.internal]
lists = [<span class="hljs-number">2000</span>]
$$</span>, probes = <span class="hljs-string">'20'</span>);

<span class="hljs-comment">-- Modify per-index probes default online</span>
<span class="hljs-keyword">ALTER</span> <span class="hljs-keyword">INDEX</span> idx_2 <span class="hljs-keyword">SET</span> (probes = <span class="hljs-string">'40'</span>);

<span class="hljs-comment">-- Success: the statement runs correctly with per-index defaults</span>
<span class="hljs-keyword">SELECT</span> <span class="hljs-string">'table_1'</span> <span class="hljs-keyword">AS</span> src, id <span class="hljs-keyword">FROM</span> table_1 <span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> emb &lt;=&gt; <span class="hljs-string">'[1,2,3]'</span> <span class="hljs-keyword">LIMIT</span> <span class="hljs-number">5</span> 
<span class="hljs-keyword">UNION</span> <span class="hljs-keyword">ALL</span>
<span class="hljs-keyword">SELECT</span> <span class="hljs-string">'table_2'</span> <span class="hljs-keyword">AS</span> src, id <span class="hljs-keyword">FROM</span> table_2 <span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> emb &lt;=&gt; <span class="hljs-string">'[4,5,6]'</span> <span class="hljs-keyword">LIMIT</span> <span class="hljs-number">5</span>;
</code></pre>
<p>In addition to <code>vchordrq.probes</code>, other query-time parameters can also be given per-index defaults at index creation time, so multi-index queries can rely on independent, index-specific settings. For the full list of configurable parameters, see <a target="_blank" href="https://docs.vectorchord.ai/vectorchord/usage/fallback-parameters.html">Fallback Parameters</a>.</p>
<h2 id="heading-summary">Summary</h2>
<p>VectorChord 1.1 makes large-scale deployments easier to run. <code>rabitq8</code> and <code>rabitq4</code> significantly reduce storage size, lowering storage and memory pressure while preserving a familiar SQL usage pattern. Per-index fallback query defaults make multi-index vector queries possible within a single statement, especially when those indexes are built with different options.</p>
<p>Whether you are hitting storage and memory limits in real-world workloads or running complex queries with multiple vector indexes, we invite you to upgrade to VectorChord 1.1, try it on your own workloads, and share your feedback.</p>
<ul>
<li><p><strong>Download/Star on GitHub:</strong> <a target="_blank" href="https://github.com/tensorchord/VectorChord">https://github.com/tensorchord/VectorChord</a></p>
</li>
<li><p><strong>Join the Community:</strong> <a target="_blank" href="https://discord.gg/KqswhpVgdU">https://discord.gg/KqswhpVgdU</a></p>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Scaling Vector Search to 1 Billion on PostgreSQL]]></title><description><![CDATA[For teams trying to run vector search at billion scale themselves, the challenge is often not raw performance, but practicality. Many solutions designed for billion-scale, low-latency vector search come with practical constraints, requiring tradeoffs...]]></description><link>https://blog.vectorchord.ai/scaling-vector-search-to-1-billion-on-postgresql</link><guid isPermaLink="true">https://blog.vectorchord.ai/scaling-vector-search-to-1-billion-on-postgresql</guid><category><![CDATA[vector database]]></category><category><![CDATA[Benchmark]]></category><category><![CDATA[VectorSearch]]></category><category><![CDATA[vector embeddings]]></category><category><![CDATA[PostgreSQL]]></category><category><![CDATA[Databases]]></category><category><![CDATA[big data]]></category><category><![CDATA[vectorchord]]></category><dc:creator><![CDATA[Junyu Chen]]></dc:creator><pubDate>Thu, 15 Jan 2026 09:18:09 GMT</pubDate><content:encoded><![CDATA[<p>For teams trying to run vector search at billion scale themselves, the challenge is often not raw performance, but practicality. Many solutions designed for billion-scale, low-latency vector search come with practical constraints, requiring tradeoffs that affect how easily they can be adopted.</p>
<p>Most existing approaches fall into one of a few categories:</p>
<ul>
<li><p><strong>Operationally complex</strong>: Powered by multi-node distributed system that are difficult to manage, operate, and maintain over time.</p>
</li>
<li><p><strong>Build-time prohibitive</strong>: Requiring long index build times, which makes re-indexing costly and easily impacts production workloads.</p>
</li>
<li><p><strong>Memory heavy</strong>: Depending on up to 1 TB of memory on a single machine, making the hardware significantly less affordable for most teams.</p>
</li>
</ul>
<p>These tradeoffs are visible in existing public benchmarks. For example, <a target="_blank" href="https://www.scylladb.com/2025/12/01/scylladb-vector-search-1b-benchmark/">ScyllaDB</a> reports up to 98% recall with a P99 latency of 12.3 ms on DEEP-1B, but this result depends on multiple large instances. <a target="_blank" href="https://www.yugabyte.com/blog/benchmarking-1-billion-vectors-in-yugabytedb/">YugabyteDB</a>, on the same dataset, reports significantly higher tail latency, with P99 reaching 0.319 seconds at the same scale.</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td></td><td>Latency / Recall</td><td>Hardware</td><td>Build time</td></tr>
</thead>
<tbody>
<tr>
<td>ScyllaDB</td><td>13ms / 98%</td><td>3×<code>AWS r7i.48xlarge</code> + 3×<code>AWS i4i.16xlarge</code></td><td>24.4 h</td></tr>
<tr>
<td>YugabyteDB</td><td>319 ms / 96%</td><td>Not reported</td><td>Not reported</td></tr>
<tr>
<td>VectorChord</td><td>40 ms / 95%</td><td><code>AWS i7ie.6xlarge</code></td><td>1.8 h</td></tr>
</tbody>
</table>
</div><p>For single-node deployments, previous work from <a target="_blank" href="https://intel.github.io/ScalableVectorSearch/benchs/static/previous/large_scale_benchs.html#search-with-reduced-memory-footprint">Scalable Vector Search (SVS)</a> shows that indexing DEEP-1B with <a target="_blank" href="https://github.com/nmslib/hnswlib">HNSWlib</a> typically requires 800 GiB of memory, though SVS or <a target="_blank" href="https://github.com/facebookresearch/faiss">FAISS-IVFPQs</a> can reduce it to around 300 GiB, which is still a substantial hardware requirement.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768291667540/ec225f8e-015a-4832-9aef-c5227a545a67.png" alt class="image--center mx-auto" /></p>
<p>This makes smooth scaling difficult for teams that want to stay self-hosted. Moving from 1 million to 1 billion vectors often requires rethinking the architectures: new hardware assumptions or a different workflow.</p>
<p>With <a target="_blank" href="https://blog.vectorchord.ai/vectorchord-10-developer-first-vector-search-on-postgres-100x-faster-indexing-than-pgvector">VectorChord 1.0.0</a>, scaling is far more easy. By taking advantage of the new <a target="_blank" href="https://blog.vectorchord.ai/how-we-made-100m-vector-indexing-in-20-minutes-possible-on-postgresql">Hierarchical K-means and other optimizations</a>, indexing 1B vectors follows the same process as indexing 1M vectors. This is exactly how we use it as well: Simply move to a slightly larger machine, for example from an AWS i7i.xlarge to an i7i.4xlarge.</p>
<h2 id="heading-the-deep-1b-benchmark">The DEEP-1B Benchmark</h2>
<p>To validate VectorChord’s capability at the billion-vector scale, we use the <a target="_blank" href="https://research.yandex.com/datasets/biganns">Yandex DEEP-1B</a> dataset from <a target="_blank" href="https://big-ann-benchmarks.com/neurips21.html">BIGANN</a>. DEEP-1B is a widely adopted benchmark for large-scale vector search, consisting of <strong>1 billion 96-dimensional embeddings</strong> generated from deep learning models trained on natural images.</p>
<p>Because of its scale, DEEP-1B is widely used as a benchmark dataset for evaluating large-scale vector search systems. Its broad adoption makes results easy to reproduce and compare, particularly when assessing indexing performance, query latency, and resource efficiency at the billion-vector scale.</p>
<p>For such a huge dataset, building a VectorChord index requires:</p>
<ul>
<li><p><strong>Storage</strong>: Approximately <strong>900 GB</strong> of high-performance SSD</p>
</li>
<li><p><strong>Memory (</strong><a target="_blank" href="https://www.postgresql.org/docs/current/runtime-config-resource.html#GUC-SHARED-BUFFERS"><strong>shared_buffers</strong></a><strong>)</strong>: No less than <strong>60 GB</strong>, managed by PostgreSQL</p>
</li>
<li><p><strong>Memory (extra)</strong>: At least <strong>60 GB extra</strong> memory required during index construction</p>
</li>
</ul>
<p>Based on these requirements, an <code>AWS i7i.4xlarge</code> is the minimum configuration for indexing at this scale. Our tests were run on an <code>AWS i7ie.6xlarge</code>, reflecting practical deployments where additional memory is provisioned to reduce disk access and maintain stable query latency.</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>instance</td><td><code>AWS i7i.4xlarge</code></td><td><code>AWS i7ie.6xlarge</code></td></tr>
</thead>
<tbody>
<tr>
<td>Physical Processor</td><td>Intel Xeon Scalable (Emerald Rapids)</td><td>Intel Xeon Scalable (Emerald Rapids)</td></tr>
<tr>
<td>vCPUs</td><td>16</td><td>24</td></tr>
<tr>
<td>Memory (GiB)</td><td>128</td><td>192</td></tr>
<tr>
<td>Disk Space (GiB)</td><td>3750 GB NVMe SSD</td><td>2×7500 GB NVMe SSD</td></tr>
<tr>
<td>Price</td><td><strong>$1088</strong> monthly</td><td><strong>$2246</strong> monthly</td></tr>
</tbody>
</table>
</div><p>All experiments were run with VectorChord 1.0.0 on PostgreSQL 17. The SQL below is the exact command we used to build the index:</p>
<pre><code class="lang-pgsql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">INDEX</span> <span class="hljs-keyword">ON</span> deep <span class="hljs-keyword">USING</span> vchordrq (embedding vector_l2_ops) <span class="hljs-keyword">WITH</span> (<span class="hljs-keyword">options</span> = $$<span class="pgsql">
build.pin = <span class="hljs-number">2</span>
residual_quantization = <span class="hljs-keyword">true</span>
[build.internal]
build_threads = <span class="hljs-number">24</span>
lists = [<span class="hljs-number">800</span>, <span class="hljs-number">640000</span>]
kmeans_algorithm.hierarchical = {}
$$</span>);
</code></pre>
<p>Here’s what each option does in our build configuration:</p>
<ul>
<li><p><a target="_blank" href="https://docs.vectorchord.ai/vectorchord/usage/indexing.html#build-pin"><code>build.pin</code></a>: Enables build-time pinning, caching the hot portion of the index in shared memory to speed up indexing on large datasets.</p>
</li>
<li><p><a target="_blank" href="https://docs.vectorchord.ai/vectorchord/usage/indexing.html#residual-quantization"><code>residual_quantization</code></a>: On DEEP-1B, we find that enabling <strong>residual quantization</strong> improves query performance, so we keep it on for this benchmark.</p>
</li>
<li><p><a target="_blank" href="https://docs.vectorchord.ai/vectorchord/usage/indexing.html#build-internal-build-threads"><code>build.internal.build_threads</code></a>: Uses 24 threads for the K-means build stage, helping saturate available CPU resources on the instance.</p>
</li>
<li><p><a target="_blank" href="https://docs.vectorchord.ai/vectorchord/usage/indexing.html#build-internal-lists"><code>build.internal.lists</code></a>: Based on our experience, we choose the appropriate list according to the number of rows. Using a <strong>two-level list</strong> helps significantly at large scale, improving both index build efficiency and query performance.</p>
</li>
<li><p><a target="_blank" href="https://docs.vectorchord.ai/vectorchord/usage/indexing.html#build-internal-kmeans-algorithm"><code>build.internal.kmeans_algorithm.hierarchical</code></a>: Enables the <strong>Hierarchical K-means</strong> path introduced in VectorChord 1.0, which significantly accelerates index construction at scale.</p>
</li>
</ul>
<h2 id="heading-our-results">Our results</h2>
<p>Index construction completed in <strong>6,408 seconds (≈ 1.8 hours)</strong> when utilizing <strong>24 CPU cores</strong> on a single AWS i7ie.6xlarge machine, demonstrating that billion-scale indexing can be completed within a practical window.</p>
<p>The figure shows query throughput versus recall on the Deep1B dataset using a single search thread, evaluated for both Top-10 and Top-100 queries, with all queries run against a warm cache.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1767855767038/b2a769c9-0d5d-48d6-ab2f-791d6bac7e29.png" alt class="image--center mx-auto" /></p>
<p>For Top-10, throughput ranges from over 117 QPS at ~0.91 recall to around 69 QPS at ~0.95 recall. Top-100 queries follow the same pattern, with throughput decreasing as recall increases.</p>
<p>The table below lists the exact search parameters behind each data point, varying the number of probes while keeping <code>epsilon = 1.9</code> fixed. Together, the figure and table show that even at the 1B-vector scale, VectorChord provides stable and tunable query performance on a single machine.</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td><a target="_blank" href="https://docs.vectorchord.ai/vectorchord/usage/indexing.html#vchordrq-probes">probes</a> / <a target="_blank" href="https://docs.vectorchord.ai/vectorchord/usage/indexing.html#vchordrq-epsilon">epsilon</a></td><td>Recall@Top 10</td><td>QPS</td><td>P99 latency / ms</td></tr>
</thead>
<tbody>
<tr>
<td>40,120 / 1.9</td><td>0.9132</td><td>117.45</td><td>12.33</td></tr>
<tr>
<td>40,160 / 1.9</td><td>0.9305</td><td>97.32</td><td>14.61</td></tr>
<tr>
<td>40,250 / 1.9</td><td>0.9511</td><td>68.87</td><td>20.53</td></tr>
</tbody>
</table>
</div><div class="hn-table">
<table>
<thead>
<tr>
<td><a target="_blank" href="https://docs.vectorchord.ai/vectorchord/usage/indexing.html#vchordrq-probes">probes</a> / <a target="_blank" href="https://docs.vectorchord.ai/vectorchord/usage/indexing.html#vchordrq-epsilon">epsilon</a></td><td>Recall@Top 100</td><td>QPS</td><td>P99 latency / ms</td></tr>
</thead>
<tbody>
<tr>
<td>40,180 / 1.9</td><td>0.9051</td><td>66.12</td><td>22.60</td></tr>
<tr>
<td>40,270 / 1.9</td><td>0.9318</td><td>49.55</td><td>30.40</td></tr>
<tr>
<td>40,390 / 1.9</td><td>0.9509</td><td>37.53</td><td>39.70</td></tr>
</tbody>
</table>
</div><p>These results demonstrate that vector query performance can be practical and predictable on a single machine, even at the billion scale.</p>
<h2 id="heading-summary"><strong>Summary</strong></h2>
<p>VectorChord 1.0.0 demonstrates that real-time vector search can scale cleanly from 1 million to 1 billion vectors without forcing users to change architectures, workflows, or usages. Whether you’re building image search, AI-powered applications, or RAG pipelines, VectorChord is designed to be a reliable vector engine that runs on your own machine, scaling naturally as your data grows.</p>
<p>Ready to scale up? You can get started today, or reach out to us on <a target="_blank" href="https://github.com/tensorchord/VectorChord">GitHub</a> or <a target="_blank" href="https://discord.gg/KqswhpVgdU">Discord</a> to learn more and get support from the community.</p>
]]></content:encoded></item><item><title><![CDATA[VectorChord 1.0: Developer-First Vector Search on Postgres, 100x Faster Indexing than pgvector]]></title><description><![CDATA[Two years ago, when we published the very first pgvecto.rs blog post, we made a bet: Postgres is the best place to do vector search. Since then we’ve been iterating on that bet — from VBASE with filtered vector search, to longer vector support, to th...]]></description><link>https://blog.vectorchord.ai/vectorchord-10-developer-first-vector-search-on-postgres-100x-faster-indexing-than-pgvector</link><guid isPermaLink="true">https://blog.vectorchord.ai/vectorchord-10-developer-first-vector-search-on-postgres-100x-faster-indexing-than-pgvector</guid><dc:creator><![CDATA[Jinjing Zhou]]></dc:creator><pubDate>Thu, 04 Dec 2025 07:27:23 GMT</pubDate><content:encoded><![CDATA[<p>Two years ago, when we published the very first <a target="_blank" href="https://github.com/tensorchord/pgvecto.rs">pgvecto.rs</a> blog post, we made a bet: <strong>Postgres is the best place to do vector search</strong>. Since then we’ve been iterating on that bet — from VBASE with filtered vector search, to longer vector support, to the RaBitQ quantization scheme and disk‑friendly index layouts.</p>
<p>With VectorChord 1.0 we’re moving the needle again. On a 16 vCPU machine, we can now build an index over 100M vectors in under 20 minutes. On the same scale, pgvector needs more than 50 hours. That number sounds impressive, but the point of this release isn’t just to win a benchmark slide. It’s to make your <strong>actual</strong> development and iteration loop much faster.</p>
<p>This post is organized into three parts:</p>
<ol>
<li><p><a class="post-section-overview" href="#heading-1-simplicity-over-complexity-breaking-the-hnsw-is-always-better-myth">Why we chose a simpler IVF + RaBitQ index instead of HNSW, and how that plays much better with Postgres.</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-2-making-indexing-feel-like-iteration-not-an-outage">How we made index build time short enough that you can treat “rebuild” as part of normal iteration, not an overnight job.</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-3-features-built-for-developers-not-just-for-a-benchmark-chart">What we’ve added around developer experience so VectorChord feels like a developer tool, not just a performance demo.</a></p>
</li>
</ol>
<hr />
<h2 id="heading-1-simplicity-over-complexity-breaking-the-hnsw-is-always-better-myth">1. Simplicity over complexity: breaking the “HNSW is always better” myth</h2>
<p>When people first talk to us about VectorChord, one of the most common questions is very direct:</p>
<blockquote>
<p>“Are you using HNSW? I heard HNSW is always better than IVF.”</p>
</blockquote>
<p>On paper, HNSW is a beautiful algorithm. In many isolated benchmarks it wins. But we don’t run vector search inside a vacuum — we run it inside Postgres, with its own storage model, vacuuming, MVCC, and operational habits.</p>
<p>If you look at it from that angle, the trade‑off changes a lot.</p>
<h3 id="heading-the-postgres-reality-of-hnsw">The Postgres reality of HNSW</h3>
<p>HNSW uses a layered graph structure. Every new vector is a node that may connect to several other nodes across multiple layers. Every delete potentially removes a node that many other nodes depend on.</p>
<p>In a stand‑alone vector engine you can design the entire storage engine around this structure. Inside Postgres, you can’t. You have to live inside the existing table/index model, work with vacuum, and behave well under MVCC.</p>
<p>That’s where the pain starts:</p>
<ul>
<li><p><strong>Insertions are heavy.</strong> Inserting a single vector into an HNSW index can trigger cascades of changes across multiple nodes and layers. Under high write load this makes it harder to keep latency stable.</p>
</li>
<li><p><strong>Deletions are subtle.</strong> In practice, a delete often just marks a node as dead but leaves it in the graph so connectivity isn’t broken immediately. As more nodes are marked this way, search still walks through them, which adds extra latency when a large fraction of the graph is “dead but still present.”</p>
</li>
<li><p><strong>Vacuum pays the real price.</strong> You can’t simply drop those nodes, because removing them outright would break neighborhoods and disconnect parts of the graph. To fix that, pgvector has to reinsert all the neighbors of a dead node so the graph stays connected without it. That reinsertion work is what makes maintenance expensive; the core issue isn’t just reclaiming dead tuples, it’s re‑wiring the graph around every deleted node.</p>
</li>
</ul>
<p>None of this is impossible, but it pushes you toward high cost whenever you have frequent updates.</p>
<h3 id="heading-why-we-chose-ivf-rabitq-instead">Why we chose IVF + RaBitQ instead</h3>
<p>VectorChord’s core index is IVF + RaBitQ with simple posting lists. Vectors are routed into coarse clusters, and inside each cluster we store a tiny quantized code instead of a 768‑dimensional float. At query time almost all the work happens on these bit‑packed codes, using table lookups and cheap integer math.</p>
<p>Because the index mostly scans compressed codes, a posting‑list scan stays fast even when it touches many entries. In our benchmarks this makes VectorChord clearly faster than a naïve IVF that compares full‑precision vectors, and still faster than pgvector’s HNSW index, which walks its graph and scores neighbors with full‑precision arithmetic.</p>
<p>People sometimes ask “what about HNSW + quantized vectors?” You can do that, and it can speed up the first phase of a search, where you traverse the quantized vectors and pick candidates. But you still need a second phase that fetches and scores full‑precision vectors based on those candidates, and that part doesn’t care which index you used. In typical workloads the first phase is only about 40% of total latency, so even a 2x faster scan would only cut end‑to‑end time by roughly 20% — and because HNSW can’t lay out data in the tight, batch‑friendly fast‑scan format, even that theoretical gain may be hard to realize in practice.</p>
<p>With IVF + RaBitQ the postings are just arrays of compressed entries. Inserts append a new code; deletes clear a code. There is no global graph to repair, so frequent updates don’t trigger cascades of work. Operationally it behaves like a normal index, and in our tests this design handles around 10x the update throughput of pgvector’s HNSW index while keeping latency stable. For additional performance numbers — especially around query performance — please see our earlier blogs <a target="_blank" href="https://blog.vectorchord.ai/vector-search-over-postgresql-a-comparative-analysis-of-memory-and-disk-solutions"><strong>Vector Search Over PostgreSQL: A Comparative Analysis of Memory and Disk Solutions</strong></a>.</p>
<p>Most importantly, this simplicity is what enables the improvements in the next sections. If your index structure is already incredibly intricate, every extra optimization adds more moving parts. By keeping the core design simpler, we gave ourselves room to make <strong>index builds</strong> and <strong>developer workflows</strong> dramatically better.</p>
<hr />
<h2 id="heading-2-making-indexing-feel-like-iteration-not-an-outage">2. Making indexing feel like iteration, not an outage</h2>
<p>Let’s talk about the number in the title: <strong>100x faster indexing than pgvector</strong>.</p>
<p>On paper, the comparison looks like this on LAION 100M‑vector dataset:</p>
<ul>
<li><p>pgvector: more than 50 hours of index build time on 16 vCPUs (and it may fail if memory is insufficient).</p>
</li>
<li><p>VectorChord 0.1: KMeans done externally in about 2 hours on a GPU; insertion in Postgres taking around 20 hours on 4 vCPUs.</p>
</li>
<li><p>VectorChord 1.0: KMeans and insertion both done inside Postgres, finishing in under 20 minutes on 16 vCPUs (about 8 minutes for KMeans + 12 minutes for insertion).</p>
</li>
</ul>
<p>But what changes for you is much simpler:</p>
<ul>
<li><p>In the pgvector world, indexing a large dataset is a <strong>multi‑day event</strong>. You plan around it, you babysit it, you worry about what happens if it fails.</p>
</li>
<li><p>In the VectorChord 1.0 world, a full rebuild is closer to <strong>“run it before lunch and check the results after coffee.”</strong></p>
</li>
</ul>
<p>We got there by attacking both phases of IVF building: the KMeans step that finds centroids, and the insertion step that assigns every vector to its nearest centroid.</p>
<p>To get there, we tackled the problem instead of just the code. First, instead of micro‑optimizing a naïve 768‑dimensional, 160,000‑centroid KMeans over 100M points, we project vectors down to a much smaller space with Johnson–Lindenstrauss Lemma, which cuts the KMeans compute and memory footprint by roughly 7x. Second, we run hierarchical KMeans in two stages so we only ever cluster smaller subsets of the data instead of 160,000 centroids at once, which accelerates 400x in theory. For insertion, we reuse the same idea one level up by building an IVF over the centroids themselves, so each data vector only compares against a small set of candidate centroid buckets using quantized codes. Together these changes move most of the work out of CPU‑bound distance math: for a 100M‑vector build the dominant cost becomes Postgres allocating and writing index pages at roughly the SSD limit. After tightening the allocation path and lock granularity so we can stream pages out in large chunks, the practical result is that a full rebuild fits comfortably into minutes instead of days. For more details, we've written a dedicated blog <a target="_blank" href="https://blog.vectorchord.ai/how-we-made-100m-vector-indexing-in-20-minutes-possible-on-postgresql"><strong>How We Made 100M Vector Indexing in 20 Minutes Possible on PostgreSQL</strong></a> to explain the technical detail.</p>
<p>These optimizations can be done easily with the following SQL:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">INDEX</span> <span class="hljs-keyword">ON</span> laion <span class="hljs-keyword">USING</span> vchordrq (embedding vector_l2_ops) <span class="hljs-keyword">WITH</span> (options = $$
build.pin = <span class="hljs-number">2</span>
[build.internal]
lists = [<span class="hljs-number">400</span>, <span class="hljs-number">160000</span>]        <span class="hljs-comment">-- Hierarchical KMeans</span>
build_threads = <span class="hljs-number">16</span>
spherical_centroids = <span class="hljs-literal">true</span>
kmeans_algorithm.hierarchical = {}
kmeans_dimension = <span class="hljs-number">100</span> <span class="hljs-comment">-- Dimension Reduction</span>
sampling_factor = <span class="hljs-number">32</span>        
$$);
</code></pre>
<hr />
<h2 id="heading-3-features-built-for-developers-not-just-for-a-benchmark-chart">3. Features built for developers, not just for a benchmark chart</h2>
<p>VectorChord 1.0 also adds a set of features that barely show up in benchmarks, but matter a lot when you live with the system every day. They’re all about helping you understand how your index behaves, and about letting you use modern models and deployments without friction.</p>
<h3 id="heading-builtin-monitoring-of-index-quality">Built‑in monitoring of index quality</h3>
<p>All approximate nearest‑neighbor indexes drift over time. Data distributions change, and what was a great index at build time might quietly degrade.</p>
<p>Instead of leaving you to guess, VectorChord can continuously measure recall for you. It automatically samples real query vectors, re‑evaluates their neighbors with a more exact method, and tracks how often the index returns the “right” answers.</p>
<p>This turns a vague feeling — “search seems a bit off lately” — into a graph you can look at. If recall is steady, you know you can postpone a rebuild. If it’s trending down, you can plan a rebuild before users notice. It also gives you something concrete to feed into your existing observability stack, or into Prometheus so you can keep an eye on recall on the same dashboards as your other SLOs.</p>
<p>In practice, you can evaluate recall for a real query pattern with a single SQL call, and then export that number as a metric. For example:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> vchordrq_evaluate_query_recall(<span class="hljs-keyword">query</span> =&gt; $$
  <span class="hljs-keyword">SELECT</span> ctid <span class="hljs-keyword">FROM</span> items <span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> embedding &lt;-&gt; <span class="hljs-string">'[3,1,2]'</span> <span class="hljs-keyword">LIMIT</span> <span class="hljs-number">10</span>
$$);
<span class="hljs-comment">-- With sampled vector recorded from query</span>
<span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">AVG</span>(recall_value) <span class="hljs-keyword">FROM</span> (
    <span class="hljs-keyword">SELECT</span> vchordrq_evaluate_query_recall(
            <span class="hljs-keyword">format</span>(
                <span class="hljs-string">'SELECT ctid FROM %I.%I ORDER BY %I OPERATOR(%s) %L LIMIT 10'</span>,
                lq.schema_name,
                lq.table_name,
                lq.column_name,
                lq.operator,
                lq.value
            )
    ) <span class="hljs-keyword">AS</span> recall_value
    <span class="hljs-keyword">FROM</span> vchordrq_sampled_queries(<span class="hljs-string">'items_embedding_idx'</span>) <span class="hljs-keyword">AS</span> lq
) <span class="hljs-keyword">AS</span> eval_results;
</code></pre>
<h3 id="heading-long-vector-support-so-you-dont-have-to-cripple-your-model">Long vector support so you don’t have to cripple your model</h3>
<p>We support vectors up to 16,000 dimensions. That sounds like a dry specification, but it has a clear practical effect: you can plug in newer models, including long‑context or multimodal ones, without immediately needing to compress or truncate their outputs just to make the index happy.</p>
<p>You can start with the representation your model naturally produces, get a feel for performance and quality, and then decide whether you want to apply more aggressive quantization or dimensionality reduction. The index shouldn’t be the thing forcing you to compromise on model choice.</p>
<h3 id="heading-multivector-retrieval-for-richer-rag">Multi‑vector retrieval for richer RAG</h3>
<p>Not every document fits into a single vector. Modern retrieval‑augmented generation (RAG) systems often represent a passage as a <em>set</em> of vectors — for example, one per token or one per sentence — and then compare that set to a set of query vectors. This “late interaction” style is usually called multi‑vector retrieval.</p>
<p>VectorChord supports this pattern natively via a MaxSim‑style operator over arrays of vectors. Conceptually, for each query vector you look for the best‑matching document vector, take their dot product, and then sum those best scores. In VectorChord this is exposed as the distance‑based <code>@#</code> operator on <code>vector[]</code> columns: the left‑hand side is the document’s vector array, the right‑hand side is the query’s vector array.</p>
<p>Getting started looks very similar to single‑vector search. You store an array of vectors per row and build a dedicated index:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> items (
  <span class="hljs-keyword">id</span>         bigserial PRIMARY <span class="hljs-keyword">KEY</span>,
  embeddings vector(<span class="hljs-number">3</span>)[]
);

<span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">INDEX</span> <span class="hljs-keyword">ON</span> items
<span class="hljs-keyword">USING</span> vchordrq (embeddings vector_maxsim_ops);
</code></pre>
<p>At query time you pass in an array of query vectors and order by the <code>@#</code> score:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> *
<span class="hljs-keyword">FROM</span> items
<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> embeddings @<span class="hljs-comment"># ARRAY[</span>
  <span class="hljs-string">'[3,1,2]'</span>::vector,
  <span class="hljs-string">'[2,2,2]'</span>::vector
]
<span class="hljs-keyword">LIMIT</span> <span class="hljs-number">5</span>;
</code></pre>
<p>Under the hood VectorChord applies its ANN machinery to these vector arrays, so you get the expressiveness of multi‑vector models with the same kind of performance you expect from single‑vector IVF indexes — all inside Postgres and plain SQL.</p>
<h3 id="heading-multiplatform-simd">Multi‑platform SIMD</h3>
<p>VectorChord ships with SIMD acceleration for x86_64, ARM and IBM architectures. At runtime we detect the best available instruction set — AVX512 and friends where available — and use it without you having to tune build flags or keep separate binaries.</p>
<h3 id="heading-experimental-diskann-rabitq">Experimental DiskANN + RaBitQ</h3>
<p>We also have an experimental index type combining DiskANN with 2‑bit RaBitQ. On some datasets and recall targets it can deliver higher QPS than IVF + RaBitQ. The trade‑off is that indexing and updates are noticeably slower and more complex.</p>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">INDEX</span> <span class="hljs-keyword">ON</span> items <span class="hljs-keyword">USING</span> vchordg (embedding vector_l2_ops);
</code></pre>
<p>We don’t recommend this as the default choice. It’s there for teams who have very specific workloads, know exactly what they are doing, and are willing to pay higher operational costs for more QPS in a narrow slice of the parameter space. For everyone else, IVF + RaBitQ remains the recommended workhorse.</p>
<h3 id="heading-similarity-filters-that-stop-scanning-early">Similarity filters that stop scanning early</h3>
<p>Sometimes you don’t just want the nearest neighbors; you want “everything within this radius, up to N rows.” The most natural way to write that in SQL is with a distance in the WHERE clause, plus ORDER BY and LIMIT:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> *
<span class="hljs-keyword">FROM</span> items
<span class="hljs-keyword">WHERE</span> embedding &lt;-&gt; <span class="hljs-string">'[0,0,0]'</span> &lt; <span class="hljs-number">0.1</span>
<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> embedding &lt;-&gt; <span class="hljs-string">'[0,0,0]'</span>
<span class="hljs-keyword">LIMIT</span> <span class="hljs-number">10</span>;
</code></pre>
<p>This returns the right answers, but it performs poorly when only a few points fall inside the radius. Postgres still has to keep scanning the index until it finds ten rows or exhausts the search space, because the distance check is just a filter applied after the ANN scan.</p>
<p>VectorChord adds a “similarity filter” syntax that lets the distance threshold be pushed down into the index itself. You wrap the query vector and radius in a <code>sphere()</code> value, and use the <code>&lt;&lt;-&gt;&gt;</code> operator so the index knows it can stop as soon as the search region moves beyond that sphere:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> *
<span class="hljs-keyword">FROM</span> items
<span class="hljs-keyword">WHERE</span> embedding &lt;&lt;-&gt;&gt; sphere(<span class="hljs-string">'[0,0,0]'</span>::vector, <span class="hljs-number">0.1</span>)
<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> embedding &lt;-&gt; <span class="hljs-string">'[0,0,0]'</span>
<span class="hljs-keyword">LIMIT</span> <span class="hljs-number">10</span>;
</code></pre>
<h3 id="heading-prefilter-and-postfilter-so-queries-match-your-data-model">Prefilter and postfilter so queries match your data model</h3>
<p>In real applications, vector search almost never happens in isolation. You filter by tenant, permissions, time ranges, or content type, then rank by vector similarity, or sometimes the other way around.</p>
<p>VectorChord lets you choose between prefiltering and postfiltering at the index level. In low‑selectivity scenarios, prefiltering can reduce the candidate set enough to get up to roughly five‑fold QPS improvements. In other cases you might want to search first and filter only the top results for better recall. The important part is that you can express both patterns naturally in Postgres.</p>
<p>Enabling prefiltering for a session is a single SQL statement:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SET</span> vchordrq.prefilter = <span class="hljs-keyword">on</span>;
</code></pre>
<h3 id="heading-text-search-with-vectorchordbm25">Text search with VectorChord‑BM25</h3>
<p>Finally, we added a VectorChord‑BM25 extension that brings strong text search directly into your Postgres instance. It supports multiple languages and advanced tokenization, and aims to be competitive with ElasticSearch‑style relevance for many workloads.</p>
<p>The idea is not to replace every dedicated search engine, but to let you build systems where BM25 and vector search live side by side, inside the same database, using the same operational tools. For many teams that alone removes a lot of deployment and maintenance burden.</p>
<p>In practice you can call it from plain SQL, combine it with your embeddings, and order by a unified score. For example:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">id</span>,
       passage,
       embedding &lt;&amp;&gt; to_bm25query(<span class="hljs-string">'documents_embedding_bm25'</span>, tokenize(<span class="hljs-string">'PostgreSQL'</span>, <span class="hljs-string">'bert'</span>)) <span class="hljs-keyword">AS</span> <span class="hljs-keyword">rank</span>
<span class="hljs-keyword">FROM</span> documents
<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> <span class="hljs-keyword">rank</span>
<span class="hljs-keyword">LIMIT</span> <span class="hljs-number">10</span>;
</code></pre>
<hr />
<h2 id="heading-closing-thoughts">Closing thoughts</h2>
<p>VectorChord 1.0 is easy to summarize as “100x faster indexing than pgvector on 100M vectors.” That headline is true, but it is not the main story.</p>
<p>Our goal is to make VectorChord one of the best ways to do retrieval on Postgres, from the first prototype to billion‑scale datasets. If you’re already using pgvector, we’d love you to try VectorChord 1.0 on your real workloads and tell us where it helps and where it can do better. We would also like to express our appreciation to the EnterpriseDB team for their valuable feedback throughout this work.</p>
]]></content:encoded></item><item><title><![CDATA[How We Made 100M Vector Indexing in 20 Minutes Possible on PostgreSQL]]></title><description><![CDATA[1. Introduction
In the past few months, we’ve heard consistent feedback from users and partners: while our goal of providing a scalable, high-performance alternative to pgvector is well-received, index build time and memory usage remain major concern...]]></description><link>https://blog.vectorchord.ai/how-we-made-100m-vector-indexing-in-20-minutes-possible-on-postgresql</link><guid isPermaLink="true">https://blog.vectorchord.ai/how-we-made-100m-vector-indexing-in-20-minutes-possible-on-postgresql</guid><category><![CDATA[VectorSearch]]></category><category><![CDATA[PostgreSQL]]></category><category><![CDATA[pgvector]]></category><category><![CDATA[vectorchord]]></category><category><![CDATA[Performance Optimization]]></category><category><![CDATA[vector similarity]]></category><category><![CDATA[vector database]]></category><category><![CDATA[K means Clustering ]]></category><category><![CDATA[infrastructure]]></category><dc:creator><![CDATA[Junyu Chen]]></dc:creator><pubDate>Wed, 03 Dec 2025 09:12:58 GMT</pubDate><content:encoded><![CDATA[<h2 id="heading-1-introduction">1. Introduction</h2>
<p>In the past few months, we’ve heard consistent feedback from users and partners: while our goal of providing a scalable, high-performance alternative to pgvector is well-received, index build time and memory usage remain major concerns at billion-scale.</p>
<p>Now VectorChord can index 100 million 768-dimensional vectors in 20 minutes on a 16 vCPU machine with just 12 GB of memory. By contrast, indexing the same data with pgvector requires around 200 GB of memory and about 40 hours on a 16-core instance. And pgvector with insufficient memory often suffer from page swapping, making builds even slower.</p>
<p>In short, memory usage and build time have become the key barriers to large-scale deployment of vectors. Through a series of targeted optimizations, we reduced build time to <strong>20 minutes</strong> and memory usage by <strong>7×</strong>, with only minor accuracy trade-offs.</p>
<p>With these improvements, we can now use far cheaper machines with much less memory, without a GPU, <strong>while still hosting 100 million 768-dimensional vectors</strong>:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td></td><td>Instance</td><td>Price</td><td>Memory used / total</td></tr>
</thead>
<tbody>
<tr>
<td>Previous minimum</td><td>Amazon i7i.8xlarge</td><td>🟨 $2174 monthly</td><td>135 GB / 256 GB</td></tr>
<tr>
<td>Recommend for faster indexing</td><td>Amazon i7i.4xlarge</td><td>✅ $<strong>1087</strong> monthly</td><td>12 GB / 128 GB</td></tr>
<tr>
<td>Minimum</td><td>Amazon i7i.xlarge + GPU for indexing</td><td>✅ $<strong>272</strong> monthly + GPU cost</td><td>6 GB / 32 GB</td></tr>
</tbody>
</table>
</div><p>In the following sections, we will introduce how we optimized these phases to make index building faster and more memory-efficient. The optimizations are organized as follows: one targets each phase.</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Optimization</td><td>Target phase</td><td>Result</td></tr>
</thead>
<tbody>
<tr>
<td><a class="post-section-overview" href="#heading-hierarchical-k-means">Hierarchical K-means</a> ➕ <a class="post-section-overview" href="#heading-dimensionality-reduction">Dimensionality Reduction</a></td><td>1️⃣ Initialization</td><td>🚀 30 min (GPU) → 8 min (CPU) 🧾 135 → 23 GB</td></tr>
<tr>
<td><a class="post-section-overview" href="#heading-4-reducing-contention">Reducing Contention</a></td><td>2️⃣ Insertion</td><td>🚀 420 min → 9 min</td></tr>
<tr>
<td><a class="post-section-overview" href="#heading-5-parallelize-compaction">Parallelize Compaction</a></td><td>3️⃣ Compaction</td><td>🚀 8 min → 1 min</td></tr>
</tbody>
</table>
</div><h2 id="heading-2-background">2. Background</h2>
<p>The index type used in VectorChord, <strong>vchordrq</strong> is logically a tree of height \(n+1\). The first \(n\) levels of the tree are immutable which serve purely as the routing structure for search. The \((n+1)\)-th level stores all data.</p>
<p>If \(n=1\), the index is a flat, non-partitioned structure. If \(n=2\), it is an inverted file index. If \(n=3\), it has an additional layer.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764057309344/ab2bc34f-0185-4513-bdf6-ff82c91db0c4.webp" alt class="image--center mx-auto" /></p>
<p>The index building can be divided into 3 phases: <strong>Initialization</strong>, <strong>insertion</strong> and <strong>compaction</strong>.</p>
<ul>
<li><p><strong>Initialization</strong> <strong>Phase</strong><br />  In this phase, top \(n\) levels of the tree are written to the index. Firstly, the index samples vectors in the table. Then the index builds the tree by clustering the samples, the centroids, the centroids of centroids, and so on for \(n\) levels. Finally the tree is written to the index.</p>
</li>
<li><p><strong>Insertion Phase</strong><br />  The index inserts vectors from the table into the bottom level of the tree.</p>
</li>
<li><p><strong>Compaction Phase</strong></p>
<p>  The index converts all the inserted vectors from non-compact layout to compact layout.</p>
</li>
</ul>
<h2 id="heading-3-making-clustering-faster-and-more-memory-efficient">3. Making Clustering Faster and More Memory-efficient</h2>
<p>In the past, although we can build index for 100 million vectors on small instances, it typically needs a GPU to accelerate clustering.</p>
<p>The main bottleneck in the initialization phase is clustering, which is time-consuming and memory-intensive. In fact, it decides the minimum memory requirement of index building. If we implement clustering on the CPU in a way that is both fast and memory-efficient, it would be practical to build indexes on small instances without large memory and GPUs.</p>
<p>Let \(n\) be the number of vectors, \(c\) be the number of centroids, \(d\) be the dimension of vectors, and \(l\) be the number of iterations. The time complexity of K-means is \(O(ncdl)\), and the space complexity of it is \(O(nd + cd)\). Let \(f\) be the sampling factor, in other words, \(n=fc\). The time complexity of K-means is \(O(fc^2dl)\), and the space complexity of it is \(O(fcd)\).</p>
<p>In the following sections, we will explain how to reduce the complexity, as well as decrease \(d\) and \(f\) for better performance.</p>
<h3 id="heading-hierarchical-k-means">Hierarchical K-means</h3>
<p>Constrained by time complexity, K-means cannot be improved beyond linear speedup, regardless of the optimizations applied. Even on a GPU, this would take 30 minutes. So we must reduce time complexity.</p>
<p>A simple idea is to divide the samples to multiple disjoint subsets, run K-means on every subset, and then merge the centroids on every subset. To balance the size of these subsets and the number of them, we choose \(\sqrt{c}\) as the number of subsets. In order to generate \(\sqrt{c}\) subsets, we initially perform a small-scale K-means and then assign the \(n\) vectors to \(\sqrt{c}\) disjoint subsets using \(\sqrt{c}\) centroids.</p>
<p>Assuming the subsets are of uniform size, the time complexity of this step is \(O(f\sqrt{c}\sqrt{c}dl) \times \sqrt{c} = O(fc^{1.5}dl)\). If <code>f = 64</code> and <code>c = 160,000</code>, the algorithm would be roughly 400 times faster.</p>
<p>There is still a small problem here. How many centroids should be computed for a subset? If we ignore the constraint that it must be an integer, it's \(\frac{n}{|s|}c\). Considering this constraint, this problem is similar to proportional representation, where <a target="_blank" href="https://en.wikipedia.org/wiki/Sainte-Lagu%C3%AB_method">Sainte-Laguë method</a> is an algorithm that minimizes the average seats-to-votes ratio deviation. It works as follows.</p>
<blockquote>
<p>After all the votes have been tallied, successive quotients are calculated for each party. The formula for the quotient is \(\frac{V}{s_i+0.5}\), where \(V\) is the total number of votes that party received, and \(s_i\) is the number of seats that have been allocated so far to that party, initially \(0\) for all parties.</p>
</blockquote>
<p>Now clustering on CPU is practical. However, this algorithm does not reduce memory usage.</p>
<h3 id="heading-dimensionality-reduction">Dimensionality Reduction</h3>
<p>It’s time to review the 140 GB of memory used for K-means samples. It would definitely result in OOM on a machine with memory of 128 GB. Consider the space complexity \(O(fcd)\), we have two ways to reduce memory usage: reduce \(f\), and reduce \(d\).</p>
<p>Let's reduce \(d\) now. Although it sounds incredible, we can first reduce the dimension of the vectors and then perform clustering without compromising accuracy. <a target="_blank" href="https://arxiv.org/abs/1011.4632">Christos’s results</a> show that running K-means on low-dimensional projections can still maintain good accuracy.</p>
<blockquote>
<p><a target="_blank" href="https://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_lemma">Johnson–Lindenstrauss lemma</a> states that a set of points in a high-dimensional space can be embedded into a space of much lower dimension in such a way that distances between the points are nearly preserved. In the classical proof of the lemma, the embedding is a random orthogonal projection.</p>
</blockquote>
<p>According to the theorem, \(n\) vectors can be reduced to \(O(\lg n)\) dimensions. Specifically, we only need to construct a random Gaussian matrix, which allows us to reduce high-dimensional vectors to low-dimensional ones using matrix multiplication. Then we perform K-means on it.</p>
<p>Since we need to reduce memory usage, we apply the Johnson-Lindenstrauss transform directly during sampling. In the end, we obtain low-dimensional centroids. We do not attempt to perform an inverse transform; instead, we sample from the table again, find the nearest cluster in the low-dimensional space after the Johnson-Lindenstrauss transform, and thereby recover the high-dimensional centroids.</p>
<p>With dimensionality reduction from 768 to 100, the resident set size of the instance dropped to 23 GB, allowing us to build the index on an i4i.xlarge instance. Additionally, this also results in clustering being 7 times faster, in theory. With hierarchical K-means and dimensionality reduction, the time of initialization phase fell to 24 minutes.</p>
<p>Reducing \(f\) is trivial. It's configured as <code>build.internal.sampling_factor</code> so we only need to change the configuration. Let's set \(f\) to \(64\). The resident set size of the instance dropped to 6 GB, and clustering is roughly 2 times faster.</p>
<h3 id="heading-sampling">Sampling</h3>
<p>In order to perform clustering, we need to sample vectors from the table. Our previous approach, <a target="_blank" href="https://en.wikipedia.org/wiki/Reservoir_sampling">reservoir sampling</a> was reliable but slow. This method is used because we do not know the number of rows in the table without doing a full table scan. However, it would still perform a full table scan.</p>
<p>To avoid a full table scan, we take advantage of PostgreSQL table access method's sampling interface. The interface takes a function that, given the maximum block number, produces an iterator of block numbers. The interface then returns an iterator over the tuples in those blocks. In order to generate such a random iterator, we can generate an ordered sequence and perform <a target="_blank" href="https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle">Fisher–Yates shuffle</a> on it, but this consumes memory. In fact, we have a more clever approach. In cryptography, a <a target="_blank" href="https://en.wikipedia.org/wiki/Pseudorandom_permutation">pseudorandom permutation</a> is a function that cannot be distinguished from a random permutation.</p>
<p><a target="_blank" href="https://en.wikipedia.org/wiki/Feistel_cipher">Feistel network</a> could be used as a pseudorandom permutation. It defines as \(L_{i + 1} = R_{i}, R_{i + 1} = L_{i} \oplus F(R_i, K_i)\), where \(F\) is a hash function, \(K_i\) is the random seed. The input of the function is \((L_0, R_0)\), and the output of the function is \((L_n, R_n)\). So it's a function from \([0, 2^n) \times [0, 2^n)\) to \([0, 2^n) \times [0, 2^n)\). Cleverly, because of \(R_{i} = L_{i + 1}, L_{i} = R_{i + 1} \oplus F(L_{i + 1}, K_i)\), this function is reversible. A reversible function is bijective, so this function is bijective. \([0, 2^n) \times [0, 2^n)\) is equivalent to \([0, 4^n)\), and therefore it's a permutation of \([0, 4^n)\). Now, by filtering out all elements greater than the maximum block number from this permutation, we get the lazy random permutation we need.</p>
<p>Based on the interface and this function, we implement block sampling, which only needs to access the sampled vectors.</p>
<p>With all these optimizations, the initialization phase takes only 8 minutes in total now.</p>
<h2 id="heading-4-reducing-contention">4. Reducing Contention</h2>
<p>In earlier experiments, building the index for the <code>LAION-100m</code> dataset on an Amazon i7i.16xlarge (64 vCPU) instance takes approximately 420 minutes during the insertion phase if \(n=2\) is used, and this is entirely computation-bound.</p>
<p>Starting with version 0.1, VectorChord allows \(n\) to be set to a positive integer no greater than \(8\). From our perspective, it is necessary for billion-scale data. However, at that time, we didn't actually know how much fast it would be.</p>
<p>After trying \(n=3\) on a smaller instance, i7i.4xlarge (16 vCPU), we observed that the insertion phase completed in just 40–60 minutes. At that point, CPU utilization stayed around 40%, and IO throughput fluctuated between 300 MB/s and 800 MB/s, suggesting a large room for optimization.</p>
<h3 id="heading-reducing-linked-list-contention">Reducing Linked-List Contention</h3>
<p>The insertion phase took 40–60 minutes. Surprisingly, our tests showed that 8 workers took 40 minutes, while 16 workers took 55 minutes. This suggests the existence of potential contention among workers during insertion.</p>
<p>In the implementation, the index maintains a single linked list to store full-precision vectors aside from the tree, while the tree only stores quantized vectors. This makes the tree nodes much smaller and allows the tree to fit in memory.</p>
<p>Since changing \(n\) from \(2\) to \(3\), the number of computations of insertion phase has decreased. As a result, inserting vectors into this linked list occurs much more frequently. Parallel workers experience contention when inserting into the list. So more workers actually slow down the insertion, making the performance unpredictable.</p>
<p>To address this, we replaced the single linked list with \(1+k\) linked lists. The first linked list stores full-precision vectors for the top \(n\) levels of the tree, while the other \(k\) lists store vectors for the bottom level. During index build, the \(i\)-th worker inserts vectors into the \((i\ \text{mod}\ k)\)-th list. We set \(k=32\) as the default, and consider it sufficient for most cases.</p>
<p>With this change, CPU utilization stabilizes around 54%, and the insertion phase now completes in about 30 minutes.</p>
<h3 id="heading-reducing-page-extension-lock-contention">Reducing Page Extension Lock Contention</h3>
<p>The CPU utilization still suggested that more optimizations were potential. But where exactly was the bottleneck? We started our investigation by checking PostgreSQL worker processes using <code>htop</code>.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764296302226/4ac9364b-09d9-4a13-81b7-dc2252b800b6.png" alt class="image--center mx-auto" /></p>
<p>Many processes showed <code>waiting</code> on their titles, indicating heavy internal contention inside PostgreSQL. Searching in the code, we traced the source that sets <code>waiting</code> to <a target="_blank" href="https://github.com/postgres/postgres/blob/REL_18_0/src/backend/storage/lmgr/lock.c#L1943">lock.c</a>. To measure off-CPU time, we turned to offcputime from <a target="_blank" href="https://github.com/iovisor/bcc">BCC</a>. Then, with <code>stackcollapse.pl</code> and <code>flamegraph.pl</code> from Brendan Gregg's <a target="_blank" href="https://github.com/brendangregg/FlameGraph">FlameGraph</a>, we generated a flame graph for the process's off-CPU time.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764294200682/6c64c880-2083-4c73-bf1e-65a10b61c77e.png" alt class="image--center mx-auto" /></p>
<p>The result was surprising: the culprit was <code>LockRelationForExtension</code>, which acquired the lock of the index for extending it.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">Here, the <code>ockRelationForExtension</code> should be <code>LockRelationForExtension</code>. This may be result from an unknown behavior from the <code>flamegraph.pl</code> script.</div>
</div>

<p>Why does acquiring this lock become a bottleneck? Searching through the PostgreSQL mailing list led us to this <a target="_blank" href="https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de">discussion</a>.</p>
<p>In short, PostgreSQL places a lock on each index to prevent this index from being extended concurrently. But the granularity of this lock is too coarse. Thanks to Andres Freund, a patch was introduced that narrows the critical section and fixes the issue, but it requires the new API available since PostgreSQL 16.</p>
<p>Since VectorChord supports PostgreSQL 13 through 18, we took advantage of the old API in the early development. Unfortunately, that meant we overlooked this optimization.</p>
<p>After switching to the new API, the insertion phase dropped to 22 minutes.</p>
<h3 id="heading-bulk-page-extensions">Bulk Page Extensions</h3>
<p>However, another round of profiling revealed that the bottleneck remains in the same area.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764294272317/208f1bcb-ee51-46a0-98b4-9e18257e48a2.png" alt class="image--center mx-auto" /></p>
<p>As the critical section of the lock is already narrowed, we need to speed up the page extension to ease the bottleneck. Extending a file using <code>fallocate</code> is fast on the filesystem. If <code>fallocate</code> is used for extending the index, the average time for extending a page will be shorter. So the question becomes: can we use <code>fallocate</code> to extend the index?</p>
<p>The answer is yes. If the index is extended more than 8 pages at a time, PostgreSQL automatically switches from <code>pwrite</code> to <code>fallocate</code>. By requesting 16 pages at once, we significantly increased the speed of page extension.</p>
<p>With this change, the insertion phase dropped to 9 minutes, CPU utilization stabilized at 90%, and write throughput stayed around 1.8 GB/s. <code>iostat</code> reported IO utilization of 0.75–0.85, indicating we were finally making better use of the resources.</p>
<p>There is still room for improvement. But for now, there is no longer any trivial bottleneck in the insertion phase.</p>
<h2 id="heading-5-parallelize-compaction">5. Parallelize Compaction</h2>
<p>On the bottom level of the tree, quantized vectors exist in two layouts:</p>
<ul>
<li><p>Non-compact layout (insert-oriented): Every quantized vector is stored as a tuple, so data can be appended directly without modifying existing tuples.</p>
</li>
<li><p>Compact layout (search-oriented): Every 32 quantized vectors are stored as a tuple. It's optimized for SIMD and makes the search fast.</p>
</li>
</ul>
<p>All vectors are initially inserted in the non-compact layout, and will be converted into the compact layout in the final phase. It is worth noting that this phase is serial and takes about 8 minutes during index build.</p>
<p>Since the other phases have become much faster, optimizing this phase has become more important. So we parallelize this phase. If there are \(k\) workers, and \(m\) nodes in the level \(n\) of the tree, the children of the \(i\)-th node will be compacted by the \((i\ \text{mod}\ k)\)-th worker. Benefiting from parallelism, the compaction phase now takes less than 1 minute.</p>
<p>You may notice that an effective index also requires this compaction occasionally to maintain search performance. PostgreSQL has a vacuum mechanism for this purpose. So this phase is also performed for the indexes routinely in vacuum. Unfortunately, we cannot parallelize it in vacuum: PostgreSQL does not allow an index to use nested parallelism. If the vacuum is parallel, the index could not start parallel workers again.</p>
<h2 id="heading-6-conclusion">6. Conclusion</h2>
<p>Previously, indexing the <code>LAION-100M</code> dataset with VectorChord <code>0.5.3</code> on an Amazon i7i.4xlarge instance was infeasible due to out-of-memory (OOM) failures. <a target="_blank" href="https://docs.vectorchord.ai/vectorchord/usage/external-index-precomputation.html">Offloading clustering to a GPU</a> made the build possible, yielding a recall of <strong>95.6%</strong> at <strong>120</strong> QPS for querying the top 10 results, with a build time of <strong>30 minutes</strong> on the GPU and <strong>420 minutes</strong> on the i7i.4xlarge.</p>
<p>With the optimizations introduced in VectorChord <code>1.0.0</code>, the index can now be built entirely on the i7i.4xlarge instance in only <strong>18 minutes</strong>, achieving a recall of <strong>94.9%</strong> under the same QPS setting.</p>
<pre><code class="lang-pgsql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">INDEX</span> <span class="hljs-keyword">ON</span> laion <span class="hljs-keyword">USING</span> vchordrq (embedding vector_ip_ops) <span class="hljs-keyword">WITH</span> (<span class="hljs-keyword">options</span> = $$<span class="pgsql">
build.pin = <span class="hljs-number">2</span>
[build.internal]
lists = [<span class="hljs-number">400</span>, <span class="hljs-number">160000</span>]
build_threads = <span class="hljs-number">16</span>
spherical_centroids = <span class="hljs-keyword">true</span>
kmeans_algorithm.hierarchical = {}
kmeans_dimension = <span class="hljs-number">100</span>
sampling_factor = <span class="hljs-number">64</span>
$$</span>);
</code></pre>
<p>Our goal is to make VectorChord one of the best ways to do retrieval on PostgreSQL, from the first prototype to billion‑scale datasets. If you’re already using pgvector, we’d love you to try VectorChord 1.0 on your real workloads and tell us where it helps and where it can do better.</p>
]]></content:encoded></item><item><title><![CDATA[VectorChord 0.5: New RaBitQ-empowered DiskANN Index and Continuous Recall Measurement]]></title><description><![CDATA[We're thrilled to announce the release of VectorChord 0.5! This release marks a significant step forward in our mission to provide more flexible, powerful, and controllable vector search capabilities. In the 0.5 release, we're introducing two major u...]]></description><link>https://blog.vectorchord.ai/vectorchord-05-new-rabitq-empowered-diskann-index-and-continuous-recall-measurement</link><guid isPermaLink="true">https://blog.vectorchord.ai/vectorchord-05-new-rabitq-empowered-diskann-index-and-continuous-recall-measurement</guid><category><![CDATA[PostgreSQL]]></category><category><![CDATA[vector database]]></category><category><![CDATA[VectorSearch]]></category><category><![CDATA[llm]]></category><category><![CDATA[RAG ]]></category><category><![CDATA[Rust]]></category><dc:creator><![CDATA[xieydd]]></dc:creator><pubDate>Fri, 26 Sep 2025 10:08:15 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1758879267587/b742b0af-a68e-4a14-b20a-435d6fa09f74.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>We're thrilled to announce the release of <strong>VectorChord 0.5</strong>! This release marks a significant step forward in our mission to provide more flexible, powerful, and controllable vector search capabilities. In the 0.5 release, we're introducing two major updates: <strong>experimental support for the DiskANN graph index</strong> and a new <strong>recall measurement tool to monitor the health of your IVF+RaBitQ indexes proactively</strong>.</p>
<h2 id="heading-new-index-preview-rabitq-empowered-diskann-index">New Index Preview: RaBitQ-empowered DiskANN Index</h2>
<p>From day one, VectorChord's core indexing solution has relied on a powerful combination: the cluster-based IVF algorithm and RaBitQ quantization—what we ship as <strong>vchordrq (IVF+RaBitQ)</strong>. This approach consistently delivers excellent low-latency, high-recall performance across a wide range of use cases and remains <strong>our recommended default</strong> for most scenarios.</p>
<p>At the same time, we recognize that <strong>graph-based indexes</strong> can outperform IVF+RaBitQ <strong>on certain datasets and at specific (often higher) recall targets</strong>. To give you the best tool for every job—and because our goal is to be a <strong>one‑stop solution for vector search</strong>—VectorChord 0.5 introduces <strong>experimental support for the DiskANN algorithm</strong> as an additional option.</p>
<h3 id="heading-what-is-diskann-and-why-does-it-matter">What is DiskANN and Why Does It Matter?</h3>
<p>Traditional top-tier graph algorithms (like HNSW and NSG) achieve their speed by keeping the entire graph in memory, which drives up hardware costs and caps single-node scale. SSDs are cheaper, but historically their random I/O made on-disk graphs impractical without big latency penalties.</p>
<p><strong>DiskANN's mission</strong> is to break this memory barrier. It’s a graph ANN designed from the ground up to store and search billion-scale datasets on inexpensive SSDs while maintaining competitive recall and low latency.</p>
<p><img src="https://pic3.zhimg.com/80/v2-2d4c5810d8074f4faac3ed9f4ae7ad1e_720w.webp" alt="diskann" class="image--center mx-auto" /></p>
<blockquote>
<p>A visualization from the original DiskANN paper.</p>
</blockquote>
<h4 id="heading-how-our-variant-differs">How our variant differs</h4>
<p>We implement a <strong>variant of the original DiskANN design</strong>: instead of the paper’s default Product Quantization (PQ), our implementation uses <strong>RaBitQ</strong> as the quantization layer. <strong>RaBitQ is more accurate and comes with theoretical guarantees</strong>, which helps us deliver <strong>better recall–latency trade‑offs</strong> compared with vanilla PQ used in original implementations. In practice, this change enables stronger accuracy at the same footprint or faster queries at the same recall.</p>
<p>In VectorChord 0.5, we've integrated DiskANN into the existing ecosystem so you can evaluate it side-by-side with vchordrq.</p>
<p><strong>When DiskANN Can Shine</strong></p>
<ul>
<li><p><strong>Workload + target dependent gains.</strong> For some model families (e.g., OpenAI/Cohere‑style embeddings) and <strong>high-recall settings</strong>, DiskANN can deliver <strong>lower latency and/or higher recall</strong> than IVF+RaBitQ on the same hardware.</p>
</li>
<li><p><strong>Stable and predictable build memory.</strong> It avoids the KMeans phase of IVF, reducing memory fluctuations during index build and simplifying capacity planning.</p>
</li>
<li><p><strong>Ecosystem compatibility.</strong> Keep using <strong>VBase</strong> for metadata filtering and <strong>RaBitQ quantization</strong> for smaller footprint and faster re-ranking.</p>
</li>
</ul>
<h3 id="heading-important-considerations-read-this-first">Important Considerations (Read This First)</h3>
<p>DiskANN is still <strong>experimental</strong> in VectorChord, and—like all graph ANN methods—it entails trade‑offs:</p>
<ul>
<li><p><strong>Updates &amp; build speed.</strong> <strong>Index builds and updates are much slower than with vchordrq (IVF+RaBitQ).</strong> If you have frequent inserts/deletes or need fast rebuilds, vchordrq is the better fit.</p>
</li>
<li><p><strong>Dynamic data performance.</strong> Modifying graph connectivity is computationally heavier than updating IVF lists; expect better operational ergonomics with vchordrq for mutable datasets.</p>
</li>
<li><p><strong>Operational simplicity.</strong> vchordrq tends to be simpler to tune and operate for most teams.</p>
</li>
</ul>
<blockquote>
<p><strong>Our recommendation:</strong> Start with <strong>vchordrq (IVF+RaBitQ)</strong> for the majority of use cases. Consider <strong>DiskANN</strong> when you’re chasing <strong>top‑tier recall on relatively static corpora</strong>, want <strong>tighter tail latency at high recall</strong>, or when <strong>build‑time memory predictability</strong> is a hard requirement. Your feedback will help us harden this option for production.</p>
</blockquote>
<h3 id="heading-how-to-get-started"><strong>How to Get Started</strong></h3>
<pre><code class="lang-bash">docker run \
  --name vectorchord-demo \
  -e POSTGRES_PASSWORD=mysecretpassword \
  -p 5432:5432 \
  -d tensorchord/vchord-postgres:pg17-v0.5.2
</code></pre>
<pre><code class="lang-pgsql"><span class="hljs-comment">-- 1. Enable the VectorChord extension</span>
postgres=# <span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">EXTENSION</span> <span class="hljs-keyword">IF</span> <span class="hljs-keyword">NOT</span> <span class="hljs-keyword">EXISTS</span> vchord <span class="hljs-keyword">CASCADE</span>;
<span class="hljs-keyword">NOTICE</span>:  installing required <span class="hljs-keyword">extension</span> "vector"
<span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">EXTENSION</span>
postgres=# \dx
                                                 List <span class="hljs-keyword">of</span> installed extensions
  <span class="hljs-type">Name</span>   | <span class="hljs-keyword">Version</span> |   <span class="hljs-keyword">Schema</span>   |                                         Description
<span class="hljs-comment">---------+---------+------------+---------------------------------------------------------------------------------------------</span>
 plpgsql | <span class="hljs-number">1.0</span>     | pg_catalog | PL/pgSQL <span class="hljs-keyword">procedural</span> <span class="hljs-keyword">language</span>
 vchord  | <span class="hljs-number">0.5</span><span class="hljs-number">.0</span>   | <span class="hljs-built_in">public</span>     | vchord: Vector <span class="hljs-keyword">database</span> plugin <span class="hljs-keyword">for</span> Postgres, written <span class="hljs-keyword">in</span> Rust, specifically designed <span class="hljs-keyword">for</span> LLM
 vector  | <span class="hljs-number">0.8</span><span class="hljs-number">.0</span>   | <span class="hljs-built_in">public</span>     | vector data <span class="hljs-keyword">type</span> <span class="hljs-keyword">and</span> ivfflat <span class="hljs-keyword">and</span> hnsw <span class="hljs-keyword">access</span> methods
(<span class="hljs-number">3</span> <span class="hljs-keyword">rows</span>)

<span class="hljs-comment">-- 2. Create a table with a vector column</span>
postgres=# <span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> items (id <span class="hljs-type">bigserial</span> <span class="hljs-keyword">PRIMARY KEY</span>, embedding vector(<span class="hljs-number">3</span>));
<span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span>
<span class="hljs-comment">-- 3. Insert some sample data</span>
postgres=# <span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> items (embedding) <span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">ARRAY</span>[random(), random(), random()]::<span class="hljs-type">real</span>[] <span class="hljs-keyword">FROM</span> generate_series(<span class="hljs-number">1</span>, <span class="hljs-number">1000</span>);
<span class="hljs-keyword">INSERT</span> <span class="hljs-number">0</span> <span class="hljs-number">1000</span>
<span class="hljs-comment">-- 4. Build the DiskANN graph index using the 'vchordg' method</span>
postgres=# <span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">INDEX</span> <span class="hljs-keyword">ON</span> items <span class="hljs-keyword">USING</span> vchordg (embedding vector_l2_ops);
<span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">INDEX</span>
<span class="hljs-comment">-- 5. Run an approximate nearest neighbor search!</span>
postgres=# <span class="hljs-keyword">SELECT</span> * <span class="hljs-keyword">FROM</span> items <span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> embedding &lt;-&gt; <span class="hljs-string">'[3,1,2]'</span> <span class="hljs-keyword">LIMIT</span> <span class="hljs-number">5</span>;
 id  |             embedding
<span class="hljs-comment">-----+-----------------------------------</span>
 <span class="hljs-number">370</span> | [<span class="hljs-number">0.9644321</span>,<span class="hljs-number">0.7480163</span>,<span class="hljs-number">0.95816356</span>]
 <span class="hljs-number">186</span> | [<span class="hljs-number">0.9583021</span>,<span class="hljs-number">0.6364455</span>,<span class="hljs-number">0.9882274</span>]
 <span class="hljs-number">303</span> | [<span class="hljs-number">0.9735422</span>,<span class="hljs-number">0.9101731</span>,<span class="hljs-number">0.92470014</span>]
 <span class="hljs-number">793</span> | [<span class="hljs-number">0.93719786</span>,<span class="hljs-number">0.8919436</span>,<span class="hljs-number">0.9729206</span>]
 <span class="hljs-number">432</span> | [<span class="hljs-number">0.97542423</span>,<span class="hljs-number">0.62692064</span>,<span class="hljs-number">0.9525037</span>]
(<span class="hljs-number">5</span> <span class="hljs-keyword">rows</span>)
</code></pre>
<p>Check out our official documentation for detailed instructions on configuring and using the new graph index: <a target="_blank" href="https://docs.vectorchord.ai/vectorchord/usage/graph-index.html"><strong>Guide to Using Graph Index in VectorChord</strong></a></p>
<h2 id="heading-proactive-monitoring-recall-measurement-tool-for-ivfrabitq">Proactive Monitoring: Recall Measurement Tool for IVF+RaBitQ</h2>
<p>A potential challenge with vector search index is "<strong>data distribution drift</strong>". As you continuously add new data—especially when the new data's vector distribution differs significantly from the initial dataset—the index's recall can degrade over time, impacting query quality.</p>
<p>Previously, quantifying this degradation was difficult, often relying on intuition or complex offline evaluations to decide when an index rebuild was necessary.</p>
<p>To empower you to proactively and easily monitor your index's health, version 0.5 introduces a powerful new function: <code>vchordrq_evaluate_query_recall</code>.</p>
<p>This function allows you to use a small, representative set of query vectors and their ground truth nearest neighbors to quickly and accurately assess the actual recall of your current IVF+RaBitQ index.</p>
<p><strong>What It Does For You:</strong></p>
<ul>
<li><p><strong>Quantify Performance</strong>: Turn the "feel" of your index's quality into a hard, measurable number.</p>
</li>
<li><p><strong>Inform Decisions</strong>: By running evaluations periodically, you can clearly visualize the recall trend and make data-driven decisions on when to rebuild your index for optimal performance.</p>
</li>
<li><p><strong>Enhance Service Quality</strong>: Ensure your live vector search service consistently meets its quality SLAs and prevent business impact from silent index degradation.</p>
</li>
</ul>
<h3 id="heading-how-to-use-it">How to Use It</h3>
<p>Here's how to put it into practice:</p>
<pre><code class="lang-pgsql"><span class="hljs-comment">-- Select ctid from your vector table with the target query vector to evaluate</span>
postgres=# <span class="hljs-keyword">SELECT</span> vchordrq_evaluate_query_recall(query =&gt; $$<span class="pgsql">
  <span class="hljs-keyword">SELECT</span> ctid <span class="hljs-keyword">FROM</span> items <span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> embedding &lt;-&gt; <span class="hljs-string">'[3,1,2]'</span> <span class="hljs-keyword">LIMIT</span> <span class="hljs-number">10</span>$$</span>);
 vchordrq_evaluate_query_recall
<span class="hljs-comment">--------------------------------</span>
                              <span class="hljs-number">1</span>
(<span class="hljs-number">1</span> <span class="hljs-keyword">row</span>)
</code></pre>
<p>For more details and configuration, you can refer to <a target="_blank" href="https://docs.vectorchord.ai/vectorchord/usage/measure-recall.html"><strong>Measuring Recall for IVF+RaBitQ Indexes</strong></a></p>
<h2 id="heading-automatically-record-queries-for-recall-tracking">Automatically Record Queries for Recall Tracking</h2>
<p>Keeping a high-quality vector search service isn't just about measuring recall once—it’s about measuring it continually on <strong>real traffic</strong>. Starting in <strong>v0.5.2</strong>, VectorChord can <strong>automatically record (sample) your production queries</strong> so you can evaluate recall on the exact vectors users searched for.</p>
<h3 id="heading-why-it-matters">Why it matters</h3>
<ul>
<li><p><strong>Hands‑off data collection</strong>: No custom logging pipelines or app changes.</p>
</li>
<li><p><strong>Representative signals</strong>: Evaluate recall on the same distributions your system sees in production.</p>
</li>
<li><p><strong>Privacy‑aware &amp; lightweight</strong>: You control when sampling is enabled and how much to collect.</p>
</li>
</ul>
<h3 id="heading-enable-query-sampling">Enable query sampling</h3>
<pre><code class="lang-sql"><span class="hljs-comment">-- Turn on sampling and set sensible limits</span>
<span class="hljs-keyword">SET</span> vchordrq.query_sampling_enable = <span class="hljs-keyword">on</span>;
<span class="hljs-keyword">SET</span> vchordrq.query_sampling_max_records = <span class="hljs-number">1000</span>;  <span class="hljs-comment">-- cap the total stored samples</span>
<span class="hljs-keyword">SET</span> vchordrq.query_sampling_rate = <span class="hljs-number">0.01</span>;         <span class="hljs-comment">-- sample every 100 query (adjust as needed)</span>
</code></pre>
<p>After enabling, run your normal vector searches (e.g., <code>SELECT * FROM items ORDER BY embedding &lt;-&gt; '[3,1,2]' LIMIT 10;</code>). VectorChord will capture the essential components of each sampled query: <strong>schema, index, table, column, operator, and vector value</strong>.</p>
<h3 id="heading-inspect-what-was-recorded">Inspect what was recorded</h3>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> * <span class="hljs-keyword">FROM</span> vchordrq_sampled_queries(<span class="hljs-string">'items_embedding_idx'</span>);
<span class="hljs-comment">-- schema_name | index_name          | table_name | column_name | operator |    value</span>
<span class="hljs-comment">-- ------------+---------------------+------------+-------------+----------+--------------</span>
<span class="hljs-comment">-- public      | items_embedding_idx | items      | embedding   | &lt;-&gt;      | [0.5,0.25,1]</span>
</code></pre>
<h3 id="heading-evaluate-recall-on-recorded-queries">Evaluate recall on recorded queries</h3>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">AVG</span>(recall_value) <span class="hljs-keyword">FROM</span> (
    <span class="hljs-keyword">SELECT</span> vchordrq_evaluate_query_recall(
            <span class="hljs-keyword">format</span>(
                <span class="hljs-string">'SELECT ctid FROM %I.%I ORDER BY %I OPERATOR(%s) %L LIMIT 10'</span>,
                lq.schema_name,
                lq.table_name,
                lq.column_name,
                lq.operator,
                lq.value
            )
    ) <span class="hljs-keyword">AS</span> recall_value
    <span class="hljs-keyword">FROM</span> vchordrq_sampled_queries(<span class="hljs-string">'items_embedding_idx'</span>) <span class="hljs-keyword">AS</span> lq
) <span class="hljs-keyword">AS</span> eval_results;
</code></pre>
<blockquote>
<p><strong>Note:</strong> If you run PostgreSQL <strong>replication</strong> with primary/standby servers, sampled queries are <strong>not</strong> synchronized via replication or backup. Each server records its <strong>own</strong> queries based on the traffic it serves.</p>
</blockquote>
<h2 id="heading-summary">Summary</h2>
<p>VectorChord 0.5 brings unprecedented flexibility to our users. The experimental DiskANN support offers a new high-performance option for specific workloads, while the recall measurement tool gives you fine-grained monitor over the operational health of your existing IVF indexes.</p>
<p>We believe that empowering users with more choices and greater control is the driving force behind VectorChord's evolution. We sincerely invite you to try out the new version and share your feedback, especially on how DiskANN performs in your real-world scenarios.</p>
<ul>
<li><p><strong>Download/Star on GitHub:</strong> https://github.com/tensorchord/VectorChord</p>
</li>
<li><p><strong>Join the Community:</strong> https://discord.gg/KqswhpVgdU</p>
</li>
</ul>
<p>Ready to take it for a spin? Head over to our GitHub repository, download the latest release, and see what VectorChord 0.5 can do for you</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://github.com/tensorchord/VectorChord/">https://github.com/tensorchord/VectorChord/</a></div>
]]></content:encoded></item><item><title><![CDATA[Bringing Search‑Engine Ranking to PostgreSQL with VectorChord‑BM25]]></title><description><![CDATA[Modern applications rely on PostgreSQL for its fully ACID‑compliant, expressive SQL, and rich ecosystem of extensions. The database handles relational workloads exceptionally well, but many projects also need to search for large text collections—prod...]]></description><link>https://blog.vectorchord.ai/bringing-searchengine-ranking-to-postgresql-with-vectorchordbm25</link><guid isPermaLink="true">https://blog.vectorchord.ai/bringing-searchengine-ranking-to-postgresql-with-vectorchordbm25</guid><category><![CDATA[PostgreSQL]]></category><category><![CDATA[search]]></category><category><![CDATA[vector database]]></category><category><![CDATA[RAG ]]></category><dc:creator><![CDATA[Zoie Wang]]></dc:creator><pubDate>Tue, 12 Aug 2025 16:02:47 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1755014528217/0d9a5e5c-f525-442d-bfa1-569ddfec146e.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Modern applications rely on PostgreSQL for its fully ACID‑compliant, expressive SQL, and rich ecosystem of extensions. The database handles relational workloads exceptionally well, but many projects also need to search for large text collections—product descriptions, support tickets, documentation—and present the most relevant rows first. PostgreSQL’s native tools offer a foundation for this, yet their default ranking logic can leave ideal matches buried under less useful results. VectorChord‑BM25 changes that equation by introducing the BM25 relevance‑scoring algorithm directly into PostgreSQL.</p>
<h2 id="heading-what-is-postgresql">What is PostgreSQL?</h2>
<p>PostgreSQL is an open‑source, enterprise‑class relational database management system (RDBMS) renowned for reliability, data integrity, and standards compliance. It supports advanced features such as window functions, materialized views, JSONB storage, and a robust extension mechanism that allows developers to add capabilities. Learn more about PostgreSQL <a target="_blank" href="https://www.postgresql.org/">here</a>.</p>
<h2 id="heading-fulltext-search-with-tsvector">Full‑Text Search with <code>tsvector</code></h2>
<p>PostgreSQL has its own tool for basic text search called <code>tsvector</code>. Think of it as a special column that stores the important words (lexemes) from each document in a way the database can search quickly. When you add a GIN or GiST index on that column, PostgreSQL can find rows that match a keyword almost instantly. You write searches with the <code>@@</code> operator, and the full details are in the PostgreSQL manual <a target="_blank" href="https://www.postgresql.org/docs/current/datatype-textsearch.html">here</a>.</p>
<h3 id="heading-where-native-ranking-falls-short">Where native ranking falls short</h3>
<p>While the <code>ts_rank</code> function can sort matches, its scoring method is basic. Long documents that mention a term once may outrank short documents focused entirely on the query, and rare but important words carry little extra weight. In large datasets, the perceived relevance of results often suffers.</p>
<h2 id="heading-whats-new-vectorchord-adds-bm25">What’s New: VectorChord Adds BM25</h2>
<p>VectorChord‑BM25 is a lightweight PostgreSQL extension that augments the existing text‑search stack with the industry‑standard BM25 ranking formula. Instead of exporting data to an external search engine, you can keep everything inside a single database, gaining:</p>
<ul>
<li><p>Ranking that rewards distinctive terms and penalises irrelevant verbosity</p>
</li>
<li><p>In‑index scoring, reducing query latency</p>
</li>
<li><p>Seamless SQL workflow—documents remain rows, searches remain queries</p>
</li>
</ul>
<h2 id="heading-what-is-bm25">What is BM25?</h2>
<p>BM25 (Best Matching 25) is a probabilistic retrieval model widely adopted by search engines and academic literature. It evaluates how often a query term appears in a document, how rare that term is across the entire corpus, and how long each document is. The algorithm is discussed in detail on its <a target="_blank" href="https://en.wikipedia.org/wiki/Okapi_BM25#:~:text=BM25%20is%20a%20bag%2Dof,slightly%20different%20components%20and%20parameters.">Wikipedia page</a>, but the core idea is straightforward:</p>
<ul>
<li><p><strong>Term Frequency (TF)</strong> – More occurrences of a word in a document increase its relevance but with diminishing returns.</p>
</li>
<li><p><strong>Inverse Document Frequency (IDF)</strong> – Rare words are more informative than common ones.</p>
</li>
<li><p><strong>Document Length Normalisation</strong> – Shorter documents aren’t unfairly penalised when they focus on the query topic.</p>
</li>
</ul>
<p>VectorChord implements BM25 through a new index type and operator (&lt;&amp;&gt;), letting PostgreSQL compute scores while scanning the index.</p>
<hr />
<h2 id="heading-two-practical-examples">Two Practical Examples</h2>
<p>Below are concise demonstrations that illustrate how to adopt VectorChord‑BM25 with a pre‑trained tokenizer and, for specialised domains, with a custom model.</p>
<h3 id="heading-example-1-quick-start-with-a-pretrained-model">Example 1: Quick Start with a Pre‑Trained Model</h3>
<pre><code class="lang-Bash"><span class="hljs-comment"># Spin up a pre‑configured PostgreSQL instance</span>
docker run --name vchord-suite \
  -e POSTGRES_PASSWORD=postgres \
  -p 5432:5432 \
  -d tensorchord/vchord-suite:pg17-latest
</code></pre>
<pre><code class="lang-SQL"><span class="hljs-comment">-- Inside psql</span>
<span class="hljs-keyword">CREATE</span> EXTENSION pg_tokenizer <span class="hljs-keyword">CASCADE</span>;
<span class="hljs-keyword">CREATE</span> EXTENSION vchord_bm25  <span class="hljs-keyword">CASCADE</span>;

<span class="hljs-comment">-- Register the LLMLingua‑2 tokenizer</span>
<span class="hljs-keyword">SELECT</span> create_tokenizer(<span class="hljs-string">'llm_tok'</span>, $$ <span class="hljs-keyword">model</span> = <span class="hljs-string">"llmlingua2"</span> $$);

<span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> articles (
    <span class="hljs-keyword">id</span>   <span class="hljs-built_in">SERIAL</span> PRIMARY <span class="hljs-keyword">KEY</span>,
    <span class="hljs-keyword">body</span> <span class="hljs-built_in">TEXT</span>,
    emb  bm25vector
);

<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> articles(<span class="hljs-keyword">body</span>) <span class="hljs-keyword">VALUES</span>
(<span class="hljs-string">'PostgreSQL is a powerful open‑source database system.'</span>),
(<span class="hljs-string">'BM25 is a ranking function used by search engines.'</span>);

<span class="hljs-comment">-- Tokenise each row</span>
<span class="hljs-keyword">UPDATE</span> articles <span class="hljs-keyword">SET</span> emb = tokenize(<span class="hljs-keyword">body</span>, <span class="hljs-string">'llm_tok'</span>);

<span class="hljs-comment">-- Build the BM25 index</span>
<span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">INDEX</span> articles_emb_bm25 <span class="hljs-keyword">ON</span> articles <span class="hljs-keyword">USING</span> bm25 (emb bm25_ops);

<span class="hljs-comment">-- Query with relevance ordering</span>
<span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">id</span>,
       <span class="hljs-keyword">body</span>,
       emb &lt;&amp;&gt; to_bm25query(<span class="hljs-string">'articles_emb_bm25'</span>,
                            tokenize(<span class="hljs-string">'open source database'</span>, <span class="hljs-string">'llm_tok'</span>)) <span class="hljs-keyword">AS</span> score
<span class="hljs-keyword">FROM</span>   articles
<span class="hljs-keyword">ORDER</span>  <span class="hljs-keyword">BY</span> score         <span class="hljs-comment">-- lower (more negative) = higher relevance</span>
<span class="hljs-keyword">LIMIT</span>  <span class="hljs-number">10</span>;
</code></pre>
<p>The query returns rows already ranked by BM25, producing more intuitive results than the default <code>ts_rank</code>.</p>
<h3 id="heading-example-2-custom-model-for-domainspecific-vocabulary">Example 2: Custom Model for Domain‑Specific Vocabulary</h3>
<p>Domain‑specific text—medical notes, legal briefs, technical logs—often includes jargon absent from general models. VectorChord lets you train a custom tokenizer directly in SQL.</p>
<pre><code class="lang-SQL"><span class="hljs-comment">-- 1. Create a text analyzer with Unicode segmentation, lowercasing,</span>
<span class="hljs-comment">--    stop‑word removal, and stemming.</span>
<span class="hljs-keyword">SELECT</span> create_text_analyzer(<span class="hljs-string">'tech_analyzer'</span>, $$
pre_tokenizer = <span class="hljs-string">"unicode_segmentation"</span>
[[character_filters]]
to_lowercase = {}
[[token_filters]]
stopwords = <span class="hljs-string">"nltk_english"</span>
[[token_filters]]
stemmer = <span class="hljs-string">"english_porter2"</span>
$$);

<span class="hljs-comment">-- 2. Train a model on your own corpus and set up automatic embedding</span>
<span class="hljs-keyword">SELECT</span> create_custom_model_tokenizer_and_trigger(
    tokenizer_name     =&gt; <span class="hljs-string">'tech_tok'</span>,
    model_name         =&gt; <span class="hljs-string">'tech_model'</span>,
    text_analyzer_name =&gt; <span class="hljs-string">'tech_analyzer'</span>,
    table_name         =&gt; <span class="hljs-string">'tickets'</span>,
    source_column      =&gt; <span class="hljs-string">'issue_text'</span>,
    target_column      =&gt; <span class="hljs-string">'embedding'</span>);

<span class="hljs-comment">-- 3. Insert support tickets; embeddings are generated via trigger</span>
<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> tickets(issue_text)
<span class="hljs-keyword">VALUES</span> (<span class="hljs-string">'Kubernetes pod fails with ExitCode 137 after OOM kill.'</span>),
       (<span class="hljs-string">'Network latency spikes to 250ms during peak hours.'</span>);

<span class="hljs-comment">-- 4. Build an index and query as before</span>
<span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">INDEX</span> tickets_emb_bm25 <span class="hljs-keyword">ON</span> tickets <span class="hljs-keyword">USING</span> bm25 (embedding bm25_ops);

<span class="hljs-keyword">SELECT</span> issue_text,
       embedding &lt;&amp;&gt; to_bm25query(<span class="hljs-string">'tickets_emb_bm25'</span>,
                                  tokenize(<span class="hljs-string">'OOM kill ExitCode 137'</span>, <span class="hljs-string">'tech_tok'</span>)) <span class="hljs-keyword">AS</span> score
<span class="hljs-keyword">FROM</span>   tickets
<span class="hljs-keyword">ORDER</span>  <span class="hljs-keyword">BY</span> score
<span class="hljs-keyword">LIMIT</span>  <span class="hljs-number">5</span>;
</code></pre>
<p>With a tailor‑made vocabulary, the database recognises abbreviations such as “OOM” and “Kubernetes,” yielding more precise rankings for technical support scenarios.</p>
<h2 id="heading-final-thoughts">Final Thoughts</h2>
<p>VectorChord‑BM25 brings search‑engine quality relevance to the PostgreSQL ecosystem without introducing an additional service layer. By combining flexible tokenization, an efficient BM25 index, and familiar SQL, it enables developers to deliver significantly better search experiences while preserving the operational simplicity of a single database system. If your application relies on PostgreSQL and demands accurate text ranking, VectorChord‑BM25 is well worth exploring.</p>
]]></content:encoded></item><item><title><![CDATA[VectorChord 0.4: Faster PostgreSQL Vector Search with Advanced I/O and Prefiltering]]></title><description><![CDATA[We're excited to announce the release of VectorChord 0.4, a significant update that enhances high-performance vector search within PostgreSQL. We believe this release pushes the boundaries of what's possible for vector search in PostgreSQL, introduci...]]></description><link>https://blog.vectorchord.ai/vectorchord-04-faster-postgresql-vector-search-with-advanced-io-and-prefiltering</link><guid isPermaLink="true">https://blog.vectorchord.ai/vectorchord-04-faster-postgresql-vector-search-with-advanced-io-and-prefiltering</guid><category><![CDATA[pgvector]]></category><category><![CDATA[vector database]]></category><category><![CDATA[VectorSearch]]></category><category><![CDATA[PostgreSQL]]></category><category><![CDATA[RAG ]]></category><category><![CDATA[llm]]></category><category><![CDATA[AI]]></category><dc:creator><![CDATA[Jinjing Zhou]]></dc:creator><pubDate>Thu, 05 Jun 2025 15:56:57 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1749114989020/7c086e5c-83cf-4650-b062-16845d20aa0d.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>We're excited to announce the release of <strong>VectorChord 0.4</strong>, a significant update that enhances high-performance vector search within PostgreSQL. We believe this release pushes the boundaries of what's possible for vector search in PostgreSQL, introducing key architectural improvements designed to lower latency and increase throughput for demanding, real-world workloads.</p>
<p>If you're working with similarity searches, RAG pipelines, or other AI-driven applications on PostgreSQL, these updates are for you. We've focused on two major areas: significantly enhancing I/O for cold queries by leveraging PostgreSQL's evolving capabilities, and optimizing filtered vector searches.</p>
<h3 id="heading-major-improvement-1-advanced-io-for-disk-bound-indexes-2x-3x-lower-cold-query-latency">🚀 Major Improvement 1: Advanced I/O for Disk-Bound Indexes (2x-3x Lower Cold Query Latency!)</h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1749113884459/3282ee41-625d-4b8a-9373-8b2eb8d81f1c.png" alt="async i/o figure" /></p>
<p>VectorChord has supported <strong>disk-based indexing</strong> since its early versions. Previously, a challenge was I/O efficiency, as full-precision vectors were read one-by-one due to <strong>limitations in PostgreSQL's older internal APIs for buffer operations</strong>.</p>
<p>VectorChord 0.4 addresses this with a <strong>rewritten page layout</strong> and the adoption of <strong>modern streaming I/O techniques.</strong> We're proud to be <strong>one of the first, if not the first, PostgreSQL extensions to adopt these new streaming I/O APIs</strong> as they become available. This allows us to pipeline I/O with computation and issue disk I/O operations more effectively. As a part of this effort, we've also <strong>contributed initial PostgreSQL 18 support to the</strong> <code>pgrx</code> framework to enable testing and development with these upcoming AIO capabilities, benefiting the wider PostgrSQL community.</p>
<p>Our approach adapts to your PostgreSQL version:</p>
<ul>
<li><p><strong>On PostgreSQL 17 (and newer with</strong> <code>io_method=sync</code>): Utilizing <code>madvise</code> for Prefetching When VectorChord calls PostgreSQL's new streaming I/O APIs, PostgreSQL's underlying implementation on PG17 (or PG18 with io_method=sync) may internally use madvise(MADV_WILLNEED). This hints the OS kernel about upcoming data needs, enabling reads from disk to be cached in memory with page cache first. When actual I/O occurs, PostgreSQL can read directly from the page cache instead of waiting for disk I/O.</p>
</li>
<li><p><strong>Preparing for Asynchronous I/O: PostgreSQL 18 &amp;</strong> <code>io_uring</code> Looking ahead to PostgreSQL 18's true Asynchronous I/O (AIO) with <code>io_uring</code>, VectorChord 0.4 is engineered for integration. This will allow data to be fetched directly into PostgreSQL's shared buffers with minimal overhead and maximal efficiency.</p>
</li>
<li><p><strong>For Earlier PostgreSQL Versions (Pre-PG17): Streaming I/O Interface for Prefetching</strong> For users on older PostgreSQL versions, we've <strong>implemented a similar streaming I/O interface within VectorChord for backward compatibility.</strong> While it doesn't leverage kernel-level AIO or the newest PostgreSQL internal APIs, this interface allows us to achieve effective prefetching of buffers. This means users on earlier PostgreSQL versions can also benefit from the improved I/O patterns and reduced latency when fetching full-precision vectors.</p>
</li>
</ul>
<p><strong>The Impact of Advanced I/O for Disk Indexes:</strong> These I/O enhancements result in a <strong>2x-3x reduction in latency for cold queries</strong> in our benchmarks. This directly improves tail latency, especially when queries require re-ranking with disk-resident vectors.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1749113962813/e201cce3-2096-40c8-b318-95a1863e9e47.png" alt="Streaming I/O benchmark" /></p>
<h3 id="heading-major-improvement-2-pre-filtering-for-faster-filtered-searches-up-to-3x-faster">🎯 Major Improvement 2: Pre-filtering for Faster Filtered Searches (Up to 3x Faster!)</h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1749113999755/aaf3b1f5-8289-4a15-9545-774dd9b3b954.png" alt="prefilter comparison" /></p>
<p>Vector search is often combined with metadata filtering. Addressing this efficiently has been a key focus for us. We were among the <strong>first to tackle the vector filtering challenge in PostgreSQL by introducing VBASE-style filtering</strong> in pgvecto.rs, which significantly improved performance over pgvector.</p>
<p>Now, with VectorChord 0.4, we're <strong>pushing the performance to the next level by introducing robust prefiltering support.</strong></p>
<ul>
<li><p><strong>Post-filtering (Previous Method):</strong> Find top-K vectors, then filter. Inefficient for selective filters.</p>
</li>
<li><p><strong>Pre-filtering (New in 0.4):</strong> Based on bit vector scan results, identify rows satisfying metadata filters <em>first</em>, then perform vector distance calculations only on this smaller set.</p>
</li>
</ul>
<p>We now use efficient mechanisms with quantized bit-vector scans to identify matching rows <em>before</em> vector distance computations. Since filter checks are much lighter than distance calculations, this provides a significant speedup. User can enable this with <code>SET vchordrq.prefilter=ON</code>.</p>
<p><strong>The Impact:</strong> Our benchmarks show <strong>up to 3x faster search performance</strong> when pre-filtering is applicable, compared to our already optimized VBASE post-filtering approach. This offers considerable advantages for complex queries.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1749114043805/f1fbd015-c96e-4a22-8c20-bb249cd5ba24.png" alt /></p>
<h3 id="heading-other-notable-improvements">✨ Other Notable Improvements:</h3>
<ul>
<li><p><strong>Optimized Residual Quantization (~20% QPS Boost):</strong> Thanks to Jianyang (author of RaBitQ), we've refined residual quantization. By reformulating <code>&lt;o, q-c&gt;</code> as <code>&lt;o, q&gt; - &lt;o, c&gt;</code>, the query vector <code>q</code> is quantized only once, yielding a <strong>~20% QPS improvement</strong>. We now recommend user to enable residual quantization for L2 distance.</p>
</li>
<li><p><strong>Optimized Rotation Matrix &amp; Simplified Configuration:</strong> Also thanks to Jianyang, a new Fast Hadamard Transform optimizes rotation matrices and removes the <code>prewarm_dim</code> GUC, simplifying configuration and slightly speeding up the process.</p>
</li>
<li><p><strong>Rewritten Documentation:</strong> We've comprehensively rewritten our documentation, now offering <strong>more user guidance and a detailed API reference</strong> to help you get the most out of VectorChord. Check <a target="_blank" href="https://docs.vectorchord.ai/vectorchord/">https://docs.vectorchord.ai/vectorchord/</a>.</p>
</li>
</ul>
<h3 id="heading-get-started-with-vectorchord-04">Get Started with VectorChord 0.4!</h3>
<p>We believe this release significantly enhances vector search capabilities in PostgreSQL.</p>
<ul>
<li><p><strong>Download/Star on GitHub:</strong> https://github.com/tensorchord/VectorChord</p>
</li>
<li><p><strong>Join the Community:</strong> https://discord.gg/KqswhpVgdU We encourage you to try VectorChord 0.4, benchmark it with your workloads, and share your feedback.</p>
</li>
</ul>
<p>Happy Vector Searching!</p>
]]></content:encoded></item><item><title><![CDATA[VectorChord: Cost-Efficient Upload & Search of 400 Million Vectors on AWS]]></title><description><![CDATA[In this article, we will describe one method for uploading, indexing, and searching a big dataset of vector data in a cost-efficient manner using the real-world LAION-400M dataset and the VectorChord extension in PostgreSQL.
We're attempting to show ...]]></description><link>https://blog.vectorchord.ai/vectorchord-cost-efficient-upload-and-search-of-400-million-vectors-on-aws</link><guid isPermaLink="true">https://blog.vectorchord.ai/vectorchord-cost-efficient-upload-and-search-of-400-million-vectors-on-aws</guid><category><![CDATA[PostgreSQL]]></category><category><![CDATA[Databases]]></category><category><![CDATA[vector database]]></category><category><![CDATA[VectorSearch]]></category><category><![CDATA[Benchmark]]></category><category><![CDATA[qdrant]]></category><category><![CDATA[vector similarity]]></category><category><![CDATA[ANN]]></category><dc:creator><![CDATA[Junyu Chen]]></dc:creator><pubDate>Fri, 30 May 2025 06:35:21 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1748586910181/3a6521d2-4fa3-4eba-b57f-bf1849567f95.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In this article, we will describe one method for uploading, indexing, and searching a big dataset of vector data in a cost-efficient manner using the real-world <a target="_blank" href="https://www.google.com/url?sa=E&amp;q=https%3A%2F%2Flaion.ai%2Fblog%2Flaion-400-open-dataset%2F">LAION-400M</a> dataset and the VectorChord extension in PostgreSQL.</p>
<p>We're attempting to show you what kind of hardware setup you need to allow you to search through this vast dataset using VectorChord. We want to make sure that you have a good experience with a search that is both accurate and fast.</p>
<p><a target="_blank" href="https://github.com/tensorchord/VectorChord">VectorChord</a> is a PostgreSQL extension that may bring high-performance vector database capabilities directly into your PostgreSQL database with high compatibility with the popular pgvector extension's interface where possible.</p>
<p>The version of VectorChord to be used for this tutorial must be <code>v0.3.0</code> or later. While VectorChord is compatible with most of the recent pgvector versions, <code>pgvector 0.8.0</code> was extensively tested.</p>
<h2 id="heading-dataset"><strong>Dataset</strong></h2>
<p>The dataset used for this experiment is <a target="_blank" href="https://www.google.com/url?sa=E&amp;q=https%3A%2F%2Flaion.ai%2Fblog%2Flaion-400-open-dataset%2F">LAION-400M</a>, which is a dataset with approximately 400 million vectors derived from images. Each vector is 512-dimensional and was generated from a <a target="_blank" href="https://www.google.com/url?sa=E&amp;q=https%3A%2F%2Fopenai.com%2Fblog%2Fclip%2F">CLIP</a> model.</p>
<p>The total vector data is approximately 400 GB. The vectors have been normalized, so it is feasible to use multiple distance measures (L2, cosine, dot product). To be consistent with other benchmarks on this dataset and following the nature of CLIP embeddings, we employed the cosine distance measure in this experiment.</p>
<p>LAION-400M is divided into 409 chunks, each containing around 1 million vectors. We will process these chunks sequentially for uploading.</p>
<h2 id="heading-algorithm"><strong>Algorithm</strong></h2>
<p>Searching through 400 million vectors is a huge challenge, especially as the data size means much of it must live on disk rather than in memory. We need algorithms built for this scale! A popular and effective method for handling large, disk-based vector data is the graph-based <a target="_blank" href="https://github.com/microsoft/DiskANN">DiskANN</a>, known for its <strong>Low memory and resource cost</strong>.</p>
<p>However, being derived from <a target="_blank" href="https://en.wikipedia.org/wiki/Hierarchical_navigable_small_world">HNSW</a>, DiskANN also has some inherited disadvantages:</p>
<ul>
<li><p><strong>Slow</strong> <a target="_blank" href="https://blog.vectorchord.ai/vector-search-over-postgresql-a-comparative-analysis-of-memory-and-disk-solutions#heading-experience-in-index-building"><strong>index building</strong></a>: For large scale datasets, it's well known that building indexes with DiskANN can take a long time.</p>
</li>
<li><p><strong>Poor</strong> <a target="_blank" href="https://blog.vectorchord.ai/vector-search-over-postgresql-a-comparative-analysis-of-memory-and-disk-solutions#heading-insertion-performance-for-stream-generated-data"><strong>insert performance</strong></a>: For streaming inserted vectors, updating a graph-based index requires traversing the partial graph. This is even more costly because DiskANN's on-disk graph must be loaded into memory.</p>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1743503547507/9847a9c4-fcaa-4d76-bf82-3f5c5e959ef4.png?auto=compress,format&amp;format=webp" alt class="image--center mx-auto" /></p>
<p>VectorChord addresses these problems with its VChordRQ index, which combines a <strong>cluster-based index (ivf)</strong> with a clever data compression scheme called <a target="_blank" href="https://arxiv.org/abs/2405.12497">RabitQ</a>. In addition to <strong>better precision and latency</strong> in both <a target="_blank" href="https://blog.vectorchord.ai/vector-search-over-postgresql-a-comparative-analysis-of-memory-and-disk-solutions#heading-vectors-in-memory-for-small-scale-data">memory</a> and <a target="_blank" href="https://blog.vectorchord.ai/vector-search-over-postgresql-a-comparative-analysis-of-memory-and-disk-solutions#heading-vectors-in-disk-for-large-scale-data">disk</a>, RabitQ's strength lies in its strong theoretical guarantee of accuracy, which is different from other compression techniques. This helps us achieve an excellent trade-off between accuracy and efficiency!</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">For more comparisons with VChordRQ(<a target="_self" href="https://github.com/tensorchord/VectorChord">VectorChord</a>), HNSW(<a target="_self" href="https://github.com/pgvector/pgvector">pgvector</a>) and StreamDiskANN(<a target="_self" href="https://github.com/timescale/pgvectorscale">pgvectorscale</a>), please see our previous <a target="_self" href="https://blog.vectorchord.ai/vector-search-over-postgresql-a-comparative-analysis-of-memory-and-disk-solutions">blog</a>.</div>
</div>

<h2 id="heading-hardware"><strong>Hardware</strong></h2>
<p>Based on our experiments, we determined the following cloud instance type to be sufficient to index and query the dataset with acceptable latency:</p>
<ul>
<li><p><strong>Instance Type:</strong> AWS EC2 i8g.2xlarge ($501.0720 / month)</p>
</li>
<li><p><strong>CPU:</strong> 8 vCPUs (ARM64)</p>
</li>
<li><p><strong>RAM:</strong> 64GB</p>
</li>
<li><p><strong>Storage:</strong> 1875 GB NVMe SSD</p>
</li>
</ul>
<p>This configuration provides the necessary compute and ample fast local NVMe storage. The 64GB of RAM serves as cache to accelerate search queries.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">We used an instance with <strong>1875 GB</strong> NVMe storage, though only about <strong>1014 GB</strong> was needed. This was the <strong>minimum</strong> AWS option providing sufficient fast local SSD. Finding a machine with less storage could reduce costs.</div>
</div>

<p>For the PostgreSQL setup within this instance, we adjusted the <code>shared_buffers</code> parameter:</p>
<ul>
<li><p>Set to 48GB during the index building process.</p>
</li>
<li><p>Increased to 54GB for search evaluation to allow more data caching for search performance.</p>
</li>
</ul>
<h2 id="heading-uploading-and-indexing"><strong>Uploading and Indexing</strong></h2>
<p>The VectorChord VChordRQ index uses an Index-based Vector File (IVF) structuring. While VChordRQ can compute the required centroids during the index building process, the building of an index of this size (400M vectors) pre-calculating the centroids out of the system significantly accelerates the process of building an index.</p>
<p>For LAION-400M, we performed an external k-means clustering step to determine these centroids. This was done on an instance with an <code>A10 GPU</code> for faster computation.</p>
<ul>
<li><p>The clustering was configured for <code>nlist = 160000</code> clusters.</p>
</li>
<li><p>The clustering took approximately 220 seconds per iteration for 25 iterations, totaling about 1.5 hours.</p>
</li>
</ul>
<p>Once we have computed the centroids, we set up the VectorChord-enabled PostgreSQL database. We used a pre-built Docker image for convenience:</p>
<pre><code class="lang-sql"><span class="hljs-comment">-- Mount PGDATA path to the NVMe SSD</span>
<span class="hljs-comment">-- sudo mount /dev/nvme1n1 /data</span>
docker run <span class="hljs-comment">--name vchord-pg17 \</span>
-e POSTGRES_PASSWORD=mysecretpassword -p 5432:5432 -v /data/pg:/var/lib/postgresql/data \
-d tensorchord/vchord-postgres:pg17-v0.3.0
</code></pre>
<p>This Docker image contains <code>PostgreSQL 17</code> with <code>pgvector 0.8.0</code> and <code>VectorChord 0.3.0</code> installed.</p>
<p>The core of the upload process involves inserting the pre-computed centroids into a dedicated table and then inserting the vectors from the LAION dataset chunks into the main data table. This can be done efficiently using a client library like psycopg in Python with batched inserts.</p>
<p>First, insert the centroids into a table (e.g., laion_centroids):</p>
<pre><code class="lang-sql"><span class="hljs-comment">-- The centroids table we created is:</span>
<span class="hljs-comment">-- CREATE TABLE laion_centroids (id SERIAL PRIMARY KEY, vector vector(512));</span>
<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> laion_centroids (vector) <span class="hljs-keyword">VALUES</span> (<span class="hljs-string">'[...centroid vector...]'</span>); <span class="hljs-comment">-- Insert all 160000 centroids</span>
</code></pre>
<p>Then, insert the vectors from the dataset into the data table (e.g., laion):</p>
<pre><code class="lang-sql"><span class="hljs-comment">-- The data table we created is:</span>
<span class="hljs-comment">-- CREATE TABLE laion (id BIGINT PRIMARY KEY, embedding vector(512));</span>
<span class="hljs-comment">-- <span class="hljs-doctag">Note:</span> VectorChord is compatible with pgvector's vector insert syntax.</span>
<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> laion (<span class="hljs-keyword">id</span>, embedding) <span class="hljs-keyword">VALUES</span> (<span class="hljs-number">0</span>, <span class="hljs-string">'[...vector data...]'</span>); <span class="hljs-comment">-- Batch inserts for chunks</span>
</code></pre>
<p>After inserting all data, we build the VectorChord index (vchordrq). vchordrq stands for <a target="_blank" href="https://arxiv.org/abs/2405.12497">RabitQ</a>, which is VectorChord's highly optimized index type based on the IVF structure combined with quantization and efficient reranking.</p>
<pre><code class="lang-sql"><span class="hljs-comment">-- <span class="hljs-doctag">Note:</span> The metric operator vector_cosine_ops is specified here, similar to pgvector. </span>
<span class="hljs-comment">-- The $$ syntax is used to pass build options as a text block.</span>
<span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">INDEX</span> laion_embedding_idx <span class="hljs-keyword">ON</span> laion <span class="hljs-keyword">USING</span> vchordrq (embedding vector_cosine_ops) <span class="hljs-keyword">WITH</span> (options = $$
[build.external]
<span class="hljs-keyword">table</span> = <span class="hljs-string">'public.laion_centroids'</span> <span class="hljs-comment">-- Link to the pre-computed centroids</span>
$$);
</code></pre>
<p>The image below shows the system resource utilization (CPU, memory, and disk usage) on the instance during the upload and indexing process. The disk usage graph shows the data being written (initial steep increase) and the subsequent index building (steady increase), resulting in a final size over 1TB. The CPU and memory graphs indicate the resources consumed throughout this process.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1745721010613/c5d20e21-6c6f-43d4-a915-aa121497d496.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-ground-truth-data"><strong>Ground Truth Data</strong></h2>
<p>To correctly measure search performance and accuracy, comparing to ground truth is vital. The LAION data set comes without inherent ground truth. <strong>We are in debt to the Qdrant team for having published the</strong> <a target="_blank" href="https://github.com/qdrant/laion-400m-benchmark/blob/master/expected.py"><strong>ground truth data</strong></a> <strong>they had prepared for their</strong> <a target="_blank" href="https://qdrant.tech/documentation/database-tutorials/large-scale-search/"><strong>tutorial</strong></a> <strong>on this dataset.</strong></p>
<p>Their methodology involved performing a full-scan nearest neighbor search for the first 100 vectors in the dataset to find the top 50 true nearest neighbors for each query. We <strong>used this same exact ground truth file</strong> to evaluate the VectorChord search accuracy.</p>
<h2 id="heading-search-query"><strong>Search Query</strong></h2>
<p>VectorChord search employs two main parameters to control the performance-accuracy trade-off: nprobe and epsilon.</p>
<ul>
<li><p><strong>nprobe</strong>: Configured using the <code>vchord.probes</code> setting. This parameter determines how many clusters (partitions) of the IVF index are searched in the initial phase. A higher nprobe increases the search scope, potentially finding better results but also increasing latency.</p>
</li>
<li><p><strong>epsilon</strong>: Configured using the <code>vchord.epsilon</code> setting. This parameter controls the reranking precision. After fetching the initial candidates from the nprobed clusters, VectorChord performs a reranking step using more precise vector data. A larger epsilon means more candidates are considered and potentially reranked for higher recall, impacting latency.</p>
</li>
</ul>
<p>These parameters are set as PostgreSQL session variables before executing the search query:</p>
<pre><code class="lang-sql"><span class="hljs-comment">-- Example settings</span>
<span class="hljs-keyword">SET</span> vchord.probes = <span class="hljs-number">100</span>; <span class="hljs-comment">-- Configure nprobe</span>
<span class="hljs-keyword">SET</span> vchord.epsilon = <span class="hljs-number">0.8</span>; <span class="hljs-comment">-- Configure epsilon</span>
​
<span class="hljs-comment">-- The search query itself uses the standard pgvector distance operator</span>
<span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">id</span> <span class="hljs-keyword">FROM</span> laion <span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> embedding &lt;=&gt; <span class="hljs-string">'[...query_vector...]'</span> <span class="hljs-keyword">LIMIT</span> <span class="hljs-number">50</span>;
</code></pre>
<p>The ORDER BY embedding <code>&lt;=&gt; '[...]'</code> clause leverages the index to find the approximate nearest neighbors based on the specified distance metric (cosine distance, like CLIP). The LIMIT 50 retrieves the top 50 results.</p>
<h2 id="heading-running-search-requests"><strong>Running Search Requests</strong></h2>
<p>After the index built is finished, we ran queries over the 100 ground truth vectors with varying nprobe and epsilon settings to try out performance versus precision trade-offs.</p>
<p>We expressed search latency in milliseconds (ms). "Cold" latency refers to instances where vector data of interest may need to be read from disk, while "warm" latency implies that data will be in memory.</p>
<p>Here are the average latencies and corresponding Precision@50 scores across the 100 queries:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>nprob / epsilon</strong></td><td><strong>Cold Latency (ms)</strong></td><td><strong>Warm Latency (ms)</strong></td><td><strong>Precision@50</strong></td></tr>
</thead>
<tbody>
<tr>
<td>10 / 0.8</td><td><strong>112.5</strong></td><td><strong>7.0</strong></td><td>0.8192</td></tr>
<tr>
<td>100 / 0.8</td><td>194.2</td><td>15.0</td><td>0.9216</td></tr>
<tr>
<td>200 / 1.0</td><td>331.2</td><td>24.9</td><td>0.9438</td></tr>
<tr>
<td>400 / 1.5</td><td>970.9</td><td>51.6</td><td><strong>0.9608</strong></td></tr>
<tr>
<td>800 / 1.9</td><td>2174.0</td><td>2174.0</td><td><strong>0.9662</strong></td></tr>
</tbody>
</table>
</div><p>The table clearly illustrates the trade-off:</p>
<ul>
<li><p>Increasing <code>nprobe</code> and <code>epsilon</code> generally leads to higher Precision@50.</p>
</li>
<li><p>However, this comes at the cost of increased search latency, especially noticeable in cold scenarios where more disk I/O is required to fetch data from more clusters and perform more extensive reranking.</p>
</li>
<li><p>Warm cache performance is significantly faster, highlighting the importance of sufficient RAM for caching frequently accessed data.</p>
</li>
<li><p>It is possible to have over 90% precision through warm latencies less than 50ms, and increasing precision above this yields hundreds of milliseconds or even more than a second latency for best recall, depending on the parameters selected.</p>
</li>
</ul>
<p>The graph below shows the trade-off between Queries Per Second (QPS) and Precision@50 for VectorChord (cold and warm cache) alongside Qdrant, with Qdrant’s metrics taken from their <a target="_blank" href="https://qdrant.tech/documentation/database-tutorials/large-scale-search/#running-search-requests">published benchmarks</a> on a comparable 8 vCPU, 64 GB machine rather than run on the same hardware.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1745753692001/731cb592-86e6-4eef-ae0f-c699df421a33.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-conclusion"><strong>Conclusion</strong></h2>
<p>In this tutorial, we illustrated how VectorChord as an extension to PostgreSQL can be utilized to upload, index, and search a gigantic 400 million vector dataset cost-efficiently on relatively small hardware with fast NVMe SSD.</p>
<p>By using the VectorChord VChordRQ index with external centroids, we were able to handle the scale. Having carefully set <code>nprobe</code> and <code>epsilon</code> at query time provides precise control over the critical trade-off between search speed and accuracy, allowing users to tune the system to their specific needs.</p>
<p>VectorChord's compatibility with the pgvector interface simplifies adoption for current pgvector users, and its vchordrq index provides powerful capabilities for large-scale vector search directly within the robust and familiar PostgreSQL environment.</p>
]]></content:encoded></item><item><title><![CDATA[Optimize RAG with Contextual Retrieval in PostgreSQL]]></title><description><![CDATA[In the standard approach to Retrieval Augmented Generation (RAG), a common practice is to break down large documents into smaller, more manageable chunks. This is primarily done to improve the efficiency of the retrieval process, leveraging the sente...]]></description><link>https://blog.vectorchord.ai/optimize-rag-with-contextual-retrieval-in-postgresql</link><guid isPermaLink="true">https://blog.vectorchord.ai/optimize-rag-with-contextual-retrieval-in-postgresql</guid><category><![CDATA[contextual-retrieval]]></category><category><![CDATA[bm25]]></category><category><![CDATA[VectorSearch]]></category><category><![CDATA[PostgreSQL]]></category><category><![CDATA[gemini]]></category><category><![CDATA[llm]]></category><category><![CDATA[RAG ]]></category><dc:creator><![CDATA[Keming]]></dc:creator><pubDate>Fri, 30 May 2025 06:28:14 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1748586485438/666850dc-57aa-49e8-812f-72d39e2bc59b.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In the standard approach to Retrieval Augmented Generation (RAG), a common practice is to break down large documents into smaller, more manageable chunks. This is primarily done to improve the efficiency of the retrieval process, leveraging the sentence embedding and keyword bm25 scoring, allowing systems to quickly find and pull relevant pieces of information. While effective in many scenarios, this method can introduce a significant challenge: the potential for individual chunks to lose their surrounding context.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1747364145166/41e0e46b-bd45-4f32-a742-9bb176253b67.png" alt class="image--center mx-auto" /></p>
<p>An example could be from a collection of historical documents. If a user asks about a specific event, and a retrieved chunk says: "The treaty was signed, bringing an end to hostilities."</p>
<p>Without the surrounding text, this chunk lacks vital context such as <em>which treaty</em> was signed, <em>who the involved parties were</em>, and <em>when or where</em> this significant event took place. Relying solely on this decontextualized snippet would provide an incomplete and potentially inaccurate picture to the LLM, hindering its ability to generate a comprehensive and accurate answer about the historical event.</p>
<p>The example highlights that while chunking aids retrieval efficiency, it can inadvertently strip away the necessary context that gives meaning and relevance to the individual pieces of information. This loss of context can significantly impact the quality and accuracy of the responses generated by LLMs in a RAG system.</p>
<p>To tackle this problem, Anthropic explores a method called <a target="_blank" href="https://www.anthropic.com/news/contextual-retrieval"><strong>Contextual Retrieval</strong></a>.</p>
<p>Basically, it could be treated as the chunk augmentation with the document context.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1747364245704/d73b5c52-6ec5-453d-80e7-243942fbd8f4.png" alt class="image--center mx-auto" /></p>
<p>This article will walk through the steps to improve the retrieval accuracy with the help of contextual information. We use the same <a target="_blank" href="https://github.com/anthropics/anthropic-cookbook"><strong>prompt,</strong> dataset, and metrics</a> described in the anthropic tutorial. Note that we use a different embedding model and LLM in this experiment, so the results will differ. However, this does not reflect the model's accuracy, as the dataset consists of code rather than natural language.</p>
<p>We mainly use Gemini models to generate the chunk context and text embedding, other models also work. The embedding quality affects the vector retrieval metrics heavily, we recommend checking the <a target="_blank" href="https://huggingface.co/spaces/mteb/leaderboard">leaderboard</a> to find the one that suits your use case. For the keyword tokenization, we use the default BERT tokenizer, you can also try <a target="_blank" href="https://github.com/tensorchord/VectorChord-bm25/?tab=readme-ov-file#tokenizer">other solutions</a>.</p>
<p>The rerank part is not the main point of this article, if you want to know more about the rerank methods, check our article about the <a target="_blank" href="https://blog.vectorchord.ai/hybrid-search-with-postgres-native-bm25-and-vectorchord">cross-encoder rerank</a> and <a target="_blank" href="https://blog.vectorchord.ai/supercharge-vector-search-with-colbert-rerank-in-postgresql">token-level late interaction rerank</a>. BTW, since we support <a target="_blank" href="https://blog.vectorchord.ai/vectorchord-03-bringing-efficient-multi-vector-contextual-late-interaction-in-postgresql">multivector MaxSim index</a>, the token-level late interaction rerank can be done more efficiently.</p>
<h3 id="heading-prepare-the-environment">Prepare the environment</h3>
<p>We provide both vector search and keyword search features in our <a target="_blank" href="https://github.com/tensorchord/VectorChord-images"><strong>VectorChord-suite</strong></a> docker image.</p>
<pre><code class="lang-bash">docker run --rm -d --name vdb -e POSTGRES_PASSWORD=postgres -p 5432:5432 ghcr.io/tensorchord/vchord-suite:pg17-20250414
</code></pre>
<p>We will use the <a target="_blank" href="https://github.com/tensorchord/vechord"><code>vechord</code></a> Python library to reduce the boilerplate code and simplify the process.</p>
<pre><code class="lang-bash">pip install <span class="hljs-string">'vechord[gemini]'</span>
</code></pre>
<h3 id="heading-define-the-rag-related-tables">Define the RAG-related tables</h3>
<p>Let’s get started with the basic RAG process.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> vechord.spec <span class="hljs-keyword">import</span> (
    ForeignKey,
    Keyword,
    PrimaryKeyAutoIncrease,
    Table,
    UniqueIndex,
    Vector,
)

DenseVector = Vector[<span class="hljs-number">768</span>]

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">Document</span>(<span class="hljs-params">Table, kw_only=True</span>):</span>
    uid: Optional[PrimaryKeyAutoIncrease] = <span class="hljs-literal">None</span>
    uuid: Annotated[str, UniqueIndex()]
    content: str

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">Chunk</span>(<span class="hljs-params">Table, kw_only=True</span>):</span>
    uid: Optional[PrimaryKeyAutoIncrease] = <span class="hljs-literal">None</span>
    doc_uuid: Annotated[str, ForeignKey[Document.uuid]]
    index: int
    content: str
    vector: DenseVector
    keyword: Keyword

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">Query</span>(<span class="hljs-params">Table, kw_only=True</span>):</span>
    uid: Optional[PrimaryKeyAutoIncrease] = <span class="hljs-literal">None</span>
    content: str
    answer: str
    doc_uuids: list[str]
    chunk_index: list[int]
    vector: DenseVector
</code></pre>
<p>This includes the vector embedding index and keyword index.</p>
<p>To load the datasets:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> vechord.embedding <span class="hljs-keyword">import</span> GeminiDenseEmbedding
<span class="hljs-keyword">from</span> vechord.registry <span class="hljs-keyword">import</span> VechordRegistry

emb = GeminiDenseEmbedding()
vr = VechordRegistry(<span class="hljs-string">"anthropic"</span>, <span class="hljs-string">"postgresql://postgres:postgres@172.17.0.1:5432/"</span>)

vr.register([Document, Chunk, Query])

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">load_data</span>(<span class="hljs-params">filepath: str</span>):</span>
    <span class="hljs-keyword">with</span> open(filepath, <span class="hljs-string">"r"</span>, encoding=<span class="hljs-string">"utf-8"</span>) <span class="hljs-keyword">as</span> f:
        docs = json.load(f)
        <span class="hljs-keyword">for</span> doc <span class="hljs-keyword">in</span> docs:
            vr.insert(
                Document(
                    uuid=doc[<span class="hljs-string">"original_uuid"</span>],
                    content=doc[<span class="hljs-string">"content"</span>],
                )
            )
            <span class="hljs-keyword">for</span> chunk <span class="hljs-keyword">in</span> doc[<span class="hljs-string">"chunks"</span>]:
                vr.insert(
                    Chunk(
                        doc_uuid=doc[<span class="hljs-string">"original_uuid"</span>],
                        index=chunk[<span class="hljs-string">"original_index"</span>],
                        content=chunk[<span class="hljs-string">"content"</span>],
                        vector=emb.vectorize_chunk(chunk[<span class="hljs-string">"content"</span>]),
                        keyword=Keyword(chunk[<span class="hljs-string">"content"</span>]),
                    )
                )

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">load_query</span>(<span class="hljs-params">filepath: str</span>):</span>
    queries = []
    <span class="hljs-keyword">with</span> open(filepath, <span class="hljs-string">"r"</span>, encoding=<span class="hljs-string">"utf-8"</span>) <span class="hljs-keyword">as</span> f:
        <span class="hljs-keyword">for</span> line <span class="hljs-keyword">in</span> f:
            query = json.loads(line)
            queries.append(
                Query(
                    content=query[<span class="hljs-string">"query"</span>],
                    answer=query[<span class="hljs-string">"answer"</span>],
                    doc_uuids=[x[<span class="hljs-number">0</span>] <span class="hljs-keyword">for</span> x <span class="hljs-keyword">in</span> query[<span class="hljs-string">"golden_chunk_uuids"</span>]],
                    chunk_index=[x[<span class="hljs-number">1</span>] <span class="hljs-keyword">for</span> x <span class="hljs-keyword">in</span> query[<span class="hljs-string">"golden_chunk_uuids"</span>]],
                    vector=emb.vectorize_query(query[<span class="hljs-string">"query"</span>]),
                )
            )
    vr.copy_bulk(queries)
</code></pre>
<h3 id="heading-evaluation">Evaluation</h3>
<p>Now we have everything for the basic RAG process. Let’s define the evaluation metric <code>Pass@k</code> , which means that retrieved top-k chunks contain how many of the groundtruth chunks.</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">evaluate</span>(<span class="hljs-params">topk=<span class="hljs-number">5</span>, search_func=vector_search</span>):</span>
    print(<span class="hljs-string">f"TopK=<span class="hljs-subst">{topk}</span>, search by: <span class="hljs-subst">{search_func.__name__}</span>"</span>)
    queries: list[Query] = vr.select_by(Query.partial_init())
    total_score = <span class="hljs-number">0</span>
    start = perf_counter()
    <span class="hljs-keyword">for</span> query <span class="hljs-keyword">in</span> queries:
        chunks: list[Chunk] = search_func(query, topk)
        count = <span class="hljs-number">0</span>
        <span class="hljs-keyword">for</span> doc_uuid, chunk_index <span class="hljs-keyword">in</span> zip(
            query.doc_uuids, query.chunk_index, strict=<span class="hljs-literal">True</span>
        ):
            <span class="hljs-keyword">for</span> chunk <span class="hljs-keyword">in</span> chunks:
                <span class="hljs-keyword">if</span> chunk.doc_uuid == doc_uuid <span class="hljs-keyword">and</span> chunk.index == chunk_index:
                    count += <span class="hljs-number">1</span>
                    <span class="hljs-keyword">break</span>
        score = count / len(query.doc_uuids)
        total_score += score

    print(
        <span class="hljs-string">f"Pass@<span class="hljs-subst">{topk}</span>: <span class="hljs-subst">{total_score / len(queries):<span class="hljs-number">.4</span>f}</span>, total queries: <span class="hljs-subst">{len(queries)}</span>, QPS: <span class="hljs-subst">{len(queries) / (perf_counter() - start):<span class="hljs-number">.3</span>f}</span>"</span>
    )
</code></pre>
<p>We can try different retrieval strategies like vector search, keyword search, hybrid search with fusion or rerank.</p>
<p>Those strategies can be defined as:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> vechord.rerank <span class="hljs-keyword">import</span> CohereReranker, ReciprocalRankFusion

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">vector_search</span>(<span class="hljs-params">query: Query, topk: int</span>) -&gt; list[Chunk]:</span>
    <span class="hljs-keyword">return</span> vr.search_by_vector(Chunk, query.vector, topk=topk)

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">keyword_search</span>(<span class="hljs-params">query: Query, topk: int</span>) -&gt; list[Chunk]:</span>
    <span class="hljs-keyword">return</span> vr.search_by_keyword(Chunk, query.content, topk=topk)

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">hybrid_search_fuse</span>(<span class="hljs-params">query: Query, topk: int</span>) -&gt; list[Chunk]:</span>
    rrf = ReciprocalRankFusion()
    <span class="hljs-keyword">return</span> rrf.fuse([vector_search(query, topk), keyword_search(query, topk)])[:topk]

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">hybrid_search_rerank</span>(<span class="hljs-params">query: Query, topk: int, boost=<span class="hljs-number">3</span></span>) -&gt; list[Chunk]:</span>
    ranker = CohereReranker()
    vecs = vector_search(query, topk * boost)
    keys = keyword_search(query, topk * boost)
    chunks = list({chunk.uid: chunk <span class="hljs-keyword">for</span> chunk <span class="hljs-keyword">in</span> vecs + keys}.values())
    indices = ranker.rerank(query.content, [chunk.content <span class="hljs-keyword">for</span> chunk <span class="hljs-keyword">in</span> chunks])
    <span class="hljs-keyword">return</span> [chunks[i] <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> indices[:topk]]
</code></pre>
<h3 id="heading-contextual-retrieval">Contextual retrieval</h3>
<p>The LLM generates contextual information with the prompt like:</p>
<pre><code class="lang-python">prompt = (
    <span class="hljs-string">"&lt;document&gt;\n{whole_document}\n&lt;/document&gt;"</span>
    <span class="hljs-string">"Here is the chunk we want to situate within the whole document \n"</span>
    <span class="hljs-string">"&lt;chunk&gt;\n{chunk}\n&lt;/chunk&gt;\n"</span>
    <span class="hljs-string">"Please give a short succinct context to situate this chunk within "</span>
    <span class="hljs-string">"the overall document for the purposes of improving search retrieval "</span>
    <span class="hljs-string">"of the chunk. Answer only with the succinct context and nothing else."</span>
)
</code></pre>
<p>The prompt above is a general one designed for retrieval tasks. You may find that a more specialized prompt offers better performance for your specific use case.</p>
<p>This feature is already included in the <code>vechord</code> library.</p>
<p><code>GeminiAugmenter</code> also supports the prompt caching like <code>Claude</code>, but it requires the cached content to be at least 32768 tokens, which is not the case for this dataset.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> vechord.augment <span class="hljs-keyword">import</span> GeminiAugmenter

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">ContextualChunk</span>(<span class="hljs-params">Table, kw_only=True</span>):</span>
    uid: Optional[PrimaryKeyAutoIncrease] = <span class="hljs-literal">None</span>
    doc_uuid: Annotated[str, ForeignKey[Document.uuid]]
    index: int
    content: str
    context: str
    vector: DenseVector
    keyword: Keyword

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">load_contextual_chunks</span>(<span class="hljs-params">filepath: str</span>):</span>
    augmenter = GeminiAugmenter()

    <span class="hljs-keyword">with</span> open(filepath, <span class="hljs-string">"r"</span>, encoding=<span class="hljs-string">"utf-8"</span>) <span class="hljs-keyword">as</span> f:
        docs = json.load(f)
        <span class="hljs-keyword">for</span> doc <span class="hljs-keyword">in</span> docs:
            augmenter.reset(doc[<span class="hljs-string">"content"</span>])
            chunks = doc[<span class="hljs-string">"chunks"</span>]
            augments = augmenter.augment_context([chunk[<span class="hljs-string">"content"</span>] <span class="hljs-keyword">for</span> chunk <span class="hljs-keyword">in</span> chunks])
            <span class="hljs-keyword">if</span> len(augments) != len(chunks):
                print(<span class="hljs-string">f"augments length not match for uuid: <span class="hljs-subst">{doc[<span class="hljs-string">'original_uuid'</span>]}</span>, <span class="hljs-subst">{len(augments)}</span> != <span class="hljs-subst">{len(chunks)}</span>"</span>)
            <span class="hljs-keyword">for</span> chunk, context <span class="hljs-keyword">in</span> zip(chunks, augments, strict=<span class="hljs-literal">False</span>):
                contextual_content = <span class="hljs-string">f"<span class="hljs-subst">{chunk[<span class="hljs-string">'content'</span>]}</span>\n\n<span class="hljs-subst">{context}</span>"</span>
                vr.insert(
                    ContextualChunk(
                        doc_uuid=doc[<span class="hljs-string">"original_uuid"</span>],
                        index=chunk[<span class="hljs-string">"original_index"</span>],
                        content=chunk[<span class="hljs-string">"content"</span>],
                        context=context,
                        vector=emb.vectorize_chunk(contextual_content),
                        keyword=Keyword(contextual_content),
                    )
                )
</code></pre>
<p>Then, we can expand the strategies like:</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">vector_contextual_search</span>(<span class="hljs-params">query: Query, topk: int</span>) -&gt; list[ContextualChunk]:</span>
    <span class="hljs-keyword">return</span> vr.search_by_vector(ContextualChunk, query.vector, topk=topk)

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">keyword_contextual_search</span>(<span class="hljs-params">query: Query, topk: int</span>) -&gt; list[ContextualChunk]:</span>
    <span class="hljs-keyword">return</span> vr.search_by_keyword(ContextualChunk, query.content, topk=topk)

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">hybrid_contextual_search_fuse</span>(<span class="hljs-params">query: Query, topk: int</span>) -&gt; list[ContextualChunk]:</span>
    rrf = ReciprocalRankFusion()
    <span class="hljs-keyword">return</span> rrf.fuse(
        [vector_contextual_search(query, topk), keyword_contextual_search(query, topk)]
    )[:topk]

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">hybrid_contextual_search_rerank</span>(<span class="hljs-params">
    query: Query, topk: int, boost=<span class="hljs-number">3</span>
</span>) -&gt; list[ContextualChunk]:</span>
    ranker = CohereReranker()
    vecs = vector_contextual_search(query, topk * boost)
    keys = keyword_contextual_search(query, topk * boost)
    chunks = list({chunk.uid: chunk <span class="hljs-keyword">for</span> chunk <span class="hljs-keyword">in</span> vecs + keys}.values())
    indices = ranker.rerank(
        query.content, [<span class="hljs-string">f"<span class="hljs-subst">{chunk.content}</span>\n<span class="hljs-subst">{chunk.context}</span>"</span> <span class="hljs-keyword">for</span> chunk <span class="hljs-keyword">in</span> chunks]
    )
    <span class="hljs-keyword">return</span> [chunks[i] <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> indices[:topk]]
</code></pre>
<h3 id="heading-benchmark">Benchmark</h3>
<ul>
<li><code>topk=5</code></li>
</ul>
<div class="hn-table">
<table>
<thead>
<tr>
<td></td><td>Pass_at_5</td><td>QPS</td></tr>
</thead>
<tbody>
<tr>
<td>vector search</td><td>0.8071</td><td>289.024</td></tr>
<tr>
<td>keyword search</td><td>0.7003</td><td>337.211</td></tr>
<tr>
<td>vector contextual search</td><td>0.8404</td><td>307.253</td></tr>
<tr>
<td>keyword contextual search</td><td>0.8033</td><td>331.239</td></tr>
</tbody>
</table>
</div><ul>
<li><code>topk=10</code></li>
</ul>
<div class="hn-table">
<table>
<thead>
<tr>
<td></td><td>Pass_at_10</td><td>QPS</td></tr>
</thead>
<tbody>
<tr>
<td>vector search</td><td>0.8574</td><td>254.730</td></tr>
<tr>
<td>keyword search</td><td>0.7577</td><td>219.653</td></tr>
<tr>
<td>vector contextual search</td><td>0.8807</td><td>247.819</td></tr>
<tr>
<td>keyword contextual search</td><td>0.8563</td><td>216.560</td></tr>
</tbody>
</table>
</div><p>All the code can be found in our <a target="_blank" href="https://github.com/tensorchord/vechord/pull/30"><code>vechord</code></a> repository.</p>
]]></content:encoded></item><item><title><![CDATA[All-in-one VectorChord Suite: Building Production-Ready RAG Solutions Directly in PostgreSQL]]></title><description><![CDATA[Retrieval-Augmented Generation (RAG) is revolutionizing our interaction with vast datasets and language models. RAG systems offer more precise, contextually aware, and current answers by retrieving pertinent information before generating responses. N...]]></description><link>https://blog.vectorchord.ai/all-in-one-vectorchord-suite-building-production-ready-rag-solutions-directly-in-postgresql</link><guid isPermaLink="true">https://blog.vectorchord.ai/all-in-one-vectorchord-suite-building-production-ready-rag-solutions-directly-in-postgresql</guid><category><![CDATA[PostgreSQL]]></category><category><![CDATA[RAG ]]></category><category><![CDATA[bm25]]></category><category><![CDATA[search]]></category><category><![CDATA[llm]]></category><category><![CDATA[vector database]]></category><category><![CDATA[Databases]]></category><dc:creator><![CDATA[xieydd]]></dc:creator><pubDate>Tue, 29 Apr 2025 00:51:17 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1745887781588/efbd90bf-3049-435f-8aaf-9edd99765384.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Retrieval-Augmented Generation (RAG) is revolutionizing our interaction with vast datasets and language models. RAG systems offer more precise, contextually aware, and current answers by retrieving pertinent information before generating responses. Nonetheless, constructing robust, scalable, and efficient RAG pipelines poses considerable challenges, particularly for production use.</p>
<p>Many solutions feature complex architectures, using distinct databases for vector search, keyword search, and primary data storage. This commonly results in data synchronization challenges, greater infrastructure complexity, and higher operational costs.</p>
<p>What if you could build a powerful, production-ready RAG system directly within your trusted PostgreSQL database?</p>
<p>Enter the <strong>VectorChord Suite</strong>, a collection of PostgreSQL extensions designed to bring high-performance vector search, native BM25 ranking, and flexible tokenization capabilities right into Postgres. Let's explore the components and how they enable next-generation RAG solutions.</p>
<h2 id="heading-what-is-the-vectorchord-suite">What is the VectorChord Suite?</h2>
<p>The suite comprises three key PostgreSQL extensions working in concert:</p>
<ol>
<li><p><a target="_blank" href="https://github.com/tensorchord/VectorChord"><strong>VectorChord:</strong></a> The core vector search engine. It's specifically designed for scalable, high-performance, and disk-efficient vector similarity search within PostgreSQL.</p>
</li>
<li><p><a target="_blank" href="https://github.com/tensorchord/VectorChord-bm25"><strong>VectorChord-bm25:</strong></a> This extension implements the sophisticated BM25 ranking algorithm directly inside PostgreSQL, leveraging efficient Block-WeakAnd algorithms. BM25 is a standard for relevance ranking based on keyword frequency and document characteristics.</p>
</li>
<li><p><a target="_blank" href="https://github.com/tensorchord/pg_tokenizer.rs"><strong>pg_</strong></a><a target="_blank" href="http://tokenizer.rs"><strong>tokenizer.rs</strong></a><a target="_blank" href="https://github.com/tensorchord/pg_tokenizer.rs"><strong>:</strong></a> Provides essential text tokenization capabilities needed for effective full-text search, enabling fine-grained control over how text is processed for full-text search.</p>
</li>
</ol>
<p>By combining these extensions, you unlock powerful capabilities for building advanced RAG systems entirely within PostgreSQL.</p>
<h2 id="heading-how-to-use-the-vectorchord-suite">How to Use the VectorChord Suite</h2>
<p>You can use the <code>tensorchord/vchord-suite</code> image to run multiple extensions which are provided by TensorChord. The image is based on the official Postgres image and includes the following extensions:</p>
<pre><code class="lang-powershell">docker run   \           
  -<span class="hljs-literal">-name</span> vchord<span class="hljs-literal">-suite</span>  \
  <span class="hljs-literal">-e</span> POSTGRES_PASSWORD=postgres  \
  <span class="hljs-literal">-p</span> <span class="hljs-number">5432</span>:<span class="hljs-number">5432</span> \
  <span class="hljs-literal">-d</span> tensorchord/vchord<span class="hljs-literal">-suite</span>:pg17<span class="hljs-literal">-latest</span>
  <span class="hljs-comment"># If you want to use ghcr image, you can change the image to `ghcr.io/tensorchord/vchord-suite:pg17-latest`.</span>
  <span class="hljs-comment"># if you want to use the specific version, you can use the tag `pg17-20250414`, supported version can be found in the support matrix.</span>
</code></pre>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> EXTENSION <span class="hljs-keyword">IF</span> <span class="hljs-keyword">NOT</span> <span class="hljs-keyword">EXISTS</span> vchord <span class="hljs-keyword">CASCADE</span>;
<span class="hljs-keyword">CREATE</span> EXTENSION <span class="hljs-keyword">IF</span> <span class="hljs-keyword">NOT</span> <span class="hljs-keyword">EXISTS</span> pg_tokenizer <span class="hljs-keyword">CASCADE</span>;
<span class="hljs-keyword">CREATE</span> EXTENSION <span class="hljs-keyword">IF</span> <span class="hljs-keyword">NOT</span> <span class="hljs-keyword">EXISTS</span> vchord_bm25 <span class="hljs-keyword">CASCADE</span>;
\dx
pg_tokenizer | 0.1.0   | tokenizer_catalog | pg_tokenizer
vchord       | 0.3.0   | public            | vchord: Vector database plugin for Postgres, written in Rust, specifically designed for LLM
vchord_bm25  | 0.2.0   | bm25_catalog      | vchord_bm25: A postgresql extension for bm25 ranking algorithm
vector       | 0.8.0   | public            | vector data type and ivfflat and hnsw access methods
</code></pre>
<h2 id="heading-use-cases-for-the-vectorchord-suite">Use Cases for the VectorChord Suite</h2>
<h3 id="heading-use-case-1-powerful-hybrid-search-with-native-bm25-and-vectorchord">Use Case 1: Powerful Hybrid Search with Native BM25 and VectorChord</h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1744643437495/b57592c9-e1c2-4940-ae59-751ff93e37a3.jpeg" alt class="image--center mx-auto" /></p>
<p>In the RAG era, effective retrieval is paramount. Neither keyword search nor vector search alone is perfect:</p>
<ul>
<li><p><strong>Keyword Search (like BM25):</strong> Excels at precision, finding documents with exact keyword matches. It's great for structured queries and term-specific searches. However, it struggles with synonyms, paraphrasing, and understanding the underlying <em>meaning</em> or semantic intent. (Leverages <code>VectorChord-bm25</code> and <code>pg_tokenizer</code>).</p>
</li>
<li><p><strong>Vector Search:</strong> Captures deep semantic meaning and relationships between concepts, allowing it to find relevant information even if the exact keywords aren't present. However, it can sometimes lack precision for queries demanding specific term matches. (Leverages <code>VectorChord</code>).</p>
</li>
</ul>
<p><strong>The Solution: Hybrid Search.</strong> By combining the strengths of both approaches within Postgres, the VectorChord Suite bridges this gap. You can run a query that leverages:</p>
<ul>
<li><p><code>VectorChord-bm25</code> (powered by <code>pg_tokenizer</code>) for keyword precision.</p>
</li>
<li><p><code>VectorChord</code> for semantic understanding.</p>
</li>
</ul>
<p>The results from both can be intelligently combined (e.g., using reciprocal rank fusion - RRF or Model-based rerank) to produce a final ranking that is more accurate, contextually relevant, and semantically aware than either method could achieve alone. This delivers significantly better retrieval performance for your RAG applications.</p>
<h3 id="heading-use-case-2-beyond-text-ocr-free-rag-with-colqwen2-amp-vectorchord">Use Case 2: Beyond Text - OCR-Free RAG with ColQwen2 &amp; VectorChord</h3>
<p>Building RAG systems for documents like PDFs or scanned images often involves cumbersome pre-processing pipelines. Traditional methods rely heavily on Optical Character Recognition (OCR) and layout analysis to extract text. This process can be:</p>
<ul>
<li><p><strong>Slow:</strong> OCR can be computationally expensive.</p>
</li>
<li><p><strong>Error-Prone:</strong> OCR accuracy varies significantly depending on document quality.</p>
</li>
<li><p><strong>Lossy:</strong> Crucial visual contexts like tables, figures, formatting, and relative positioning is often lost during text extraction.</p>
</li>
</ul>
<p><strong>The Solution: OCR-Free, Visually-Aware RAG.</strong> What if you could query documents based on <em>how they look</em> and their <em>content</em>, without explicit OCR?</p>
<p>This is now achievable by combining:</p>
<ol>
<li><p><strong>Multi-Modal Vision Language Models (VLMs):</strong> Models like ColQwen2 can process images (document pages) and generate embeddings that capture <em>both</em> textual content and visual layout information.</p>
</li>
<li><p><strong>VectorChord's Multi-Vector Capabilities:</strong> VectorChord can efficiently store and search <em>multiple</em> vectors per document within Postgres – allowing you to store embeddings representing different aspects (e.g., text content, visual layout).</p>
</li>
</ol>
<p>With this setup, you can query your document database using prompts that reference visual elements ("Find documents with a bar chart comparing sales figures") or combined textual and visual cues, directly within PostgreSQL. This simplifies your RAG stack, potentially boosts retrieval accuracy by preserving visual context, and eliminates the bottlenecks associated with traditional OCR pipelines.</p>
<h2 id="heading-why-build-rag-in-postgresql">Why Build RAG in PostgreSQL?</h2>
<p>Leveraging the VectorChord Suite within Postgres offers significant advantages:</p>
<ul>
<li><p><strong>Unified Data:</strong> Keep your source data, text, metadata, and vector embeddings all in one place.</p>
</li>
<li><p><strong>Reduced Complexity:</strong> Eliminate the need for separate vector databases and synchronization pipelines.</p>
</li>
<li><p><strong>Leverage Existing Infrastructure:</strong> Utilize your existing Postgres expertise, tooling, and operational practices.</p>
</li>
<li><p><strong>Transactional Integrity:</strong> Benefit from PostgreSQL's robust ACID compliance.</p>
</li>
<li><p><strong>Rich Ecosystem:</strong> Access the wide array of tools and features available within the Postgres ecosystem.</p>
</li>
</ul>
<h2 id="heading-conclusion">Conclusion</h2>
<p>The VectorChord Suite transforms PostgreSQL into a powerful, production-ready platform for building advanced RAG solutions. By integrating high-performance vector search (<code>VectorChord</code>), sophisticated keyword ranking (<code>VectorChord-bm25</code>), and flexible text processing (<code>pg_tokenizer</code>), you can implement cutting-edge techniques like hybrid search and OCR-free multi-modal retrieval directly within your database. Simplify your architecture, enhance retrieval accuracy, and unlock the full potential of RAG with VectorChord in PostgreSQL.</p>
<h2 id="heading-previous-blog-post">Previous Blog Post</h2>
<p>In our previous blog, we shared some user cases where all images can be replaced with VectorChord Suite image.</p>
<ul>
<li><p><a target="_blank" href="https://blog.vectorchord.ai/supercharge-vector-search-with-colbert-rerank-in-postgresql">Supercharge vector search with ColBERT rerank in PostgreSQL</a></p>
</li>
<li><p><a target="_blank" href="https://blog.vectorchord.ai/hybrid-search-with-postgres-native-bm25-and-vectorchord">Hybrid search with Postgres Native BM25 and VectorChord</a></p>
</li>
<li><p><a target="_blank" href="https://blog.vectorchord.ai/preview/67e35c2ffba3e0edbac95265">Beyond Text: Unlock OCR-Free RAG in PostgreSQL with Modal &amp; VectorChord</a></p>
</li>
</ul>
<h2 id="heading-references">References</h2>
<ul>
<li><p><a target="_blank" href="https://blog.vectorchord.ai/vectorchord-store-400k-vectors-for-1-in-postgresql">https://blog.vectorchord.ai/vectorchord-store-400k-vectors-for-1-in-postgresql</a></p>
</li>
<li><p><a target="_blank" href="https://blog.vectorchord.ai/vectorchord-bm25-revolutionize-postgresql-search-with-bm25-ranking-3x-faster-than-elasticsearch">https://blog.vectorchord.ai/vectorchord-bm25-revolutionize-postgresql-search-with-bm25-ranking-3x-faster-than-elasticsearch</a></p>
</li>
<li><p><a target="_blank" href="https://blog.vectorchord.ai/vectorchord-bm25-introducing-pgtokenizera-standalone-multilingual-tokenizer-for-advanced-search">https://blog.vectorchord.ai/vectorchord-bm25-introducing-pgtokenizera-standalone-multilingual-tokenizer-for-advanced-search</a></p>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Introducing Vechord: Turn PostgreSQL into your search engine in a Pythonic way.]]></title><description><![CDATA[Today, we're thrilled to announce the release of Vechord, a new Python library designed to dramatically simplify building robust search infrastructure directly on top of the PostgreSQL database.
In the rapidly evolving world of AI and large language ...]]></description><link>https://blog.vectorchord.ai/introducing-vechord-turn-postgresql-into-your-search-engine-in-a-pythonic-way</link><guid isPermaLink="true">https://blog.vectorchord.ai/introducing-vechord-turn-postgresql-into-your-search-engine-in-a-pythonic-way</guid><category><![CDATA[VectorSearch]]></category><category><![CDATA[hybrid search]]></category><category><![CDATA[keyword search]]></category><category><![CDATA[full text search]]></category><category><![CDATA[bm25]]></category><category><![CDATA[RAG ]]></category><category><![CDATA[PostgreSQL]]></category><dc:creator><![CDATA[Keming]]></dc:creator><pubDate>Wed, 23 Apr 2025 05:21:22 GMT</pubDate><content:encoded><![CDATA[<p>Today, we're thrilled to announce the release of <a target="_blank" href="https://github.com/tensorchord/vechord"><strong>Vechord</strong></a>, a new Python library designed to dramatically simplify building robust search infrastructure directly on top of the PostgreSQL database.</p>
<p>In the rapidly evolving world of AI and large language models (LLMs), Retrieval-Augmented Generation (RAG) and semantic search have become crucial components. However, setting up the necessary vector search infrastructure often involves learning new database technologies, managing complex integrations, or wrestling with intricate frameworks. This adds friction and slows down development, especially for teams already comfortable with PostgreSQL.</p>
<h2 id="heading-the-challenge-hybrid-search-complexity"><strong>The Challenge: Hybrid Search Complexity</strong></h2>
<p>Building search capabilities often means:</p>
<ol>
<li><p><strong>Choosing &amp; Managing a Vector Database:</strong> Evaluating, deploying, and maintaining specialized vector databases (Pinecone, Weaviate, Milvus, etc.) and text search frameworks (ElasticSearch, Solr, etc.) adds operational overhead.</p>
</li>
<li><p><strong>Complex Data Handling:</strong> Managing the synchronization between the source data and the vector representations.</p>
</li>
<li><p><strong>Steep Learning Curves:</strong> Understanding the APIs and abstractions of comprehensive frameworks for a wide range of LLM tasks.</p>
</li>
</ol>
<h2 id="heading-vechord-the-simple-pythonic-solution-built-on-top-of-postgresql"><strong>Vechord: The Simple, Pythonic Solution Built on Top of PostgreSQL</strong></h2>
<p>Vechord tackles these challenges head-on by leveraging the power and extensibility of PostgreSQL, enhanced with the powerful <a target="_blank" href="https://github.com/tensorchord/VectorChord/"><strong>VectorChord</strong></a> and <a target="_blank" href="https://github.com/tensorchord/VectorChord-bm25"><strong>VectorChord-bm25</strong></a> extensions. Our core philosophy is <strong>simplicity and focus</strong>.</p>
<p>Vechord provides a clean, Pythonic interface to:</p>
<ul>
<li><p><strong>Initialize:</strong> Easily configure the table schema with Python struct and annotations.</p>
</li>
<li><p><strong>Ingest Data:</strong> Effortlessly add documents, PDFs, or any other type of data with transformation tools.</p>
</li>
<li><p><strong>Perform Hybrid Search:</strong> Efficiently execute the vector similarity search and keyword search, and rerank the retrieval results with the user-friendly API.</p>
</li>
<li><p><strong>Evaluate Metrics</strong>: Evaluate metrics seamlessly, either against ground truth or with LLM-based scoring.</p>
</li>
<li><p><strong>Makes Simple Tasks Simple</strong>: offer an ORM-like interface to select, insert, and delete records from the PostgreSQL database.</p>
</li>
</ul>
<h2 id="heading-how-is-vechord-different"><strong>How is Vechord Different?</strong></h2>
<ol>
<li><p><strong>Laser Focus on PostgreSQL Vector + Keyword Search:</strong> Vechord concentrates <em>specifically</em> on making the PostgreSQL + VectorChord-suite combination easy to use for search. If your primary goal is streamlined vector search and keyword search within your existing PostgreSQL ecosystem, Vechord offers a leaner, more direct path.</p>
</li>
<li><p><strong>Library, Not a Full Platform:</strong> Vechord is designed as a <em>library</em> – a focused building block that you can integrate into your application code. It gives you the core storage and hybrid search capability on PostgreSQL, leaving the broader application architecture and workflow design entirely up to you.</p>
</li>
<li><p><strong>Leveraging Existing Infrastructure:</strong> The core premise of Vechord is to empower teams already using PostgreSQL. You don't need to introduce and manage a separate, dedicated vector database or document database if your scale and requirements are well-served by the VectorChord suite. This reduces operational complexity and cost.</p>
</li>
<li><p><strong>Simplicity as a Feature:</strong> Vechord prioritizes a minimal API surface and ease of use for its specific task. Vechord aims to get you performing hybrid search on Postgres with minimal boilerplate and cognitive load.</p>
</li>
</ol>
<h2 id="heading-get-started-with-vechord">Get Started with Vechord</h2>
<ul>
<li>Define the table schema</li>
</ul>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> typing <span class="hljs-keyword">import</span> Annotated, Optional
<span class="hljs-keyword">from</span> vechord.spec <span class="hljs-keyword">import</span> Table, Vector, PrimaryKeyAutoIncrease, ForeignKey, Keyword

<span class="hljs-comment"># use 768 dimension vector</span>
DenseVector = Vector[<span class="hljs-number">768</span>]

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">Document</span>(<span class="hljs-params">Table, kw_only=True</span>):</span>
    uid: Optional[PrimaryKeyAutoIncrease] = <span class="hljs-literal">None</span>  <span class="hljs-comment"># auto-increase id, no need to set</span>
    link: str = <span class="hljs-string">""</span>
    text: str

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">Chunk</span>(<span class="hljs-params">Table, kw_only=True</span>)
    <span class="hljs-title">uid</span>:</span> Optional[PrimaryKeyAutoIncrease] = <span class="hljs-literal">None</span>
    doc_id: Annotated[int, ForeignKey[Document.uid]]  <span class="hljs-comment"># reference to `Document.uid` on DELETE CASCADE</span>
    vec: DenseVector  <span class="hljs-comment"># this comes with a default vector index</span>
    keyword: Keyword  <span class="hljs-comment"># this comes with a default tokenizer and text index</span>
    text: str
</code></pre>
<ul>
<li>Inject the data with a Python decorator</li>
</ul>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> httpx
<span class="hljs-keyword">from</span> vechord.registry <span class="hljs-keyword">import</span> VechordRegistry
<span class="hljs-keyword">from</span> vechord.extract <span class="hljs-keyword">import</span> SimpleExtractor
<span class="hljs-keyword">from</span> vechord.embedding <span class="hljs-keyword">import</span> GeminiDenseEmbedding

vr = VechordRegistry(namespace=<span class="hljs-string">"test"</span>, url=<span class="hljs-string">"postgresql://postgres:postgres@127.0.0.1:5432/"</span>)
<span class="hljs-comment"># ensure the table and index are created if not exists</span>
vr.register([Document, Chunk])
extractor = SimpleExtractor()
emb = GeminiDenseEmbedding()

<span class="hljs-meta">@vr.inject(output=Document)  # dump to the `Document` table</span>
<span class="hljs-comment"># function parameters are free to define since `inject(input=...)` is not set</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">add_document</span>(<span class="hljs-params">url: str</span>) -&gt; Document:</span>  <span class="hljs-comment"># the return type is `Document`</span>
    <span class="hljs-keyword">with</span> httpx.Client() <span class="hljs-keyword">as</span> client:
        resp = client.get(url)
        text = extractor.extract_html(resp.text)
        <span class="hljs-keyword">return</span> Document(link=url, text=text)

<span class="hljs-meta">@vr.inject(input=Document, output=Chunk)  # load from the `Document` table and dump to the `Chunk` table</span>
<span class="hljs-comment"># function parameters are the attributes of the `Document` table, only defined attributes</span>
<span class="hljs-comment"># will be loaded from the `Document` table</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">add_chunk</span>(<span class="hljs-params">uid: int, text: str</span>) -&gt; list[Chunk]:</span>  <span class="hljs-comment"># the return type is `list[Chunk]`</span>
    chunks = text.split(<span class="hljs-string">"\n"</span>)
    <span class="hljs-keyword">return</span> [Chunk(doc_id=uid, vec=emb.vectorize_chunk(t), keyword=Keyword(t), text=t) <span class="hljs-keyword">for</span> t <span class="hljs-keyword">in</span> chunks]

<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">"__main__"</span>:
    add_document(<span class="hljs-string">"https://paulgraham.com/best.html"</span>)  <span class="hljs-comment"># add arguments as usual</span>
    add_chunk()  <span class="hljs-comment"># omit the arguments since the `input` is will be loaded from the `Document` table</span>
    vr.insert(Document(text=<span class="hljs-string">"hello world"</span>))  <span class="hljs-comment"># insert manually</span>
    print(vr.select_by(Document.partial_init()))  <span class="hljs-comment"># select all the columns from table `Document`</span>
</code></pre>
<ul>
<li>Run several steps in a transaction to guarantee data consistency</li>
</ul>
<pre><code class="lang-python">pipeline = vr.create_pipeline([add_document, add_chunk])
pipeline.run(<span class="hljs-string">"https://paulgraham.com/best.html"</span>)  <span class="hljs-comment"># only accept the arguments for the first function</span>
</code></pre>
<ul>
<li>Search by the vector and keyword, rerank with the cross-encoder model</li>
</ul>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> vechord.rerank <span class="hljs-keyword">import</span> CohereReranker

reranker = CohereReranker()
text = vr.search_by_vector(Chunk, emb.vectorize_query(<span class="hljs-string">"startup"</span>))
vec = vr.search_by_keyword(Chunk, <span class="hljs-string">"startup"</span>)
chunks = list({chunk.uid: chunk <span class="hljs-keyword">for</span> chunk <span class="hljs-keyword">in</span> text_retrieves + vec_retrievse}.values())
indices = reranker.rerank(query, [chunk.text <span class="hljs-keyword">for</span> chunk <span class="hljs-keyword">in</span> chunks])
print([chunks[i] <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> indices[:topk]])
</code></pre>
<h2 id="heading-join-the-community"><strong>Join the Community!</strong></h2>
<p>Vechord is open-source and community-driven. We believe it fills a vital gap for developers wanting powerful search capabilities without unnecessary complexity.</p>
<ul>
<li><p><strong>Check out the code on GitHub:</strong> <a target="_blank" href="https://github.com/tensorchord/vechord">https://github.com/tensorchord/vechord</a></p>
</li>
<li><p><strong>Read the documentation:</strong> <a target="_blank" href="https://tensorchord.github.io/vechord/">https://tensorchord.github.io/vechord/</a></p>
</li>
<li><p><strong>Communicate with us on Discord</strong>: <a target="_blank" href="https://discord.gg/KqswhpVgdU">https://discord.gg/KqswhpVgdU</a></p>
</li>
</ul>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://github.com/tensorchord/vechord">https://github.com/tensorchord/vechord</a></div>
]]></content:encoded></item><item><title><![CDATA[Beyond Text: Unlock OCR-Free RAG in PostgreSQL with Modal & VectorChord]]></title><description><![CDATA[Building effective Retrieval-Augmented Generation (RAG) systems for documents often feels like wrestling with messy, complex pipelines. Especially when dealing with PDFs or scanned images, traditional methods rely heavily on Optical Character Recogni...]]></description><link>https://blog.vectorchord.ai/beyond-text-unlock-ocr-free-rag-in-postgresql-with-modal-and-vectorchord</link><guid isPermaLink="true">https://blog.vectorchord.ai/beyond-text-unlock-ocr-free-rag-in-postgresql-with-modal-and-vectorchord</guid><category><![CDATA[OCR ]]></category><category><![CDATA[llm]]></category><category><![CDATA[PostgreSQL]]></category><category><![CDATA[vector database]]></category><category><![CDATA[VectorSearch]]></category><category><![CDATA[search]]></category><category><![CDATA[modal]]></category><dc:creator><![CDATA[xieydd]]></dc:creator><pubDate>Tue, 15 Apr 2025 08:30:43 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1743746483931/ecc229f8-7be3-4cc4-b166-5aecd304fad8.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Building effective Retrieval-Augmented Generation (RAG) systems for documents often feels like wrestling with messy, complex pipelines. Especially when dealing with PDFs or scanned images, traditional methods rely heavily on Optical Character Recognition (OCR) and layout analysis. These steps can be slow, error-prone, and often lose crucial visual context like tables, figures, and formatting. <strong>But what if you could query documents based on <em>how they look</em>, not just the extracted text?</strong></p>
<p>This post is your guide to building exactly that: an <strong>OCR-free RAG system</strong> directly within your familiar PostgreSQL database. We'll leverage the power of the <strong>ColQwen2</strong> Vision Language Model, the efficiency of <strong>VectorChord</strong> for multi-vector search in Postgres, and the scalability of <strong>Modal</strong> for GPU-powered embedding generation. Get ready to simplify your RAG stack and potentially boost your retrieval accuracy, all without complex pre-processing.</p>
<p>We'll cover:</p>
<ul>
<li><p>What ColQwen2 is and why it's a game-changer.</p>
</li>
<li><p>How VectorChord makes advanced vector search possible in Postgres.</p>
</li>
<li><p>A step-by-step tutorial to build and evaluate the system.</p>
</li>
</ul>
<h2 id="heading-what-is-colqwen2-the-power-of-visual-understanding">What is ColQwen2? The Power of Visual Understanding</h2>
<p>To grasp ColQwen2, let's first look at its foundation: <strong>ColPali</strong>. As introduced in the paper "<a target="_blank" href="https://arxiv.org/abs/2407.01449"><strong>ColPali: Efficient Document Retrieval with Vision Language Models</strong></a>", ColPali represents a novel approach using Vision Language Models (VLMs). Instead of relying on imperfect OCR, it directly indexes documents using their rich <strong>visual features</strong> – text, images, tables, layout, everything the eye can see.</p>
<p>Think about the limitations of traditional OCR: complex layouts get mangled, tables become gibberish, and images are often ignored entirely. It's like trying to understand a book by only reading a flawed transcript. ColPali avoids this by using a powerful VLM (originally PaliGemma) to create embeddings that capture the document's holistic visual nature. Two key concepts make it shine:</p>
<ol>
<li><p><strong>Contextualized Vision Embeddings:</strong> Generating rich embeddings directly from the document image using a VLM.</p>
</li>
<li><p><strong>Late Interaction:</strong> This clever technique allows the <em>query's</em> textual meaning to directly interact with the <em>document's</em> detailed visual features <em>at search time</em>. It's not just matching text summaries; it's comparing the query concept against the visual evidence within the document page.</p>
</li>
</ol>
<p><img src="https://cdn-uploads.huggingface.co/production/uploads/60f2e021adf471cbdf8bb660/La8vRJ_dtobqs6WQGKTzB.png" alt /></p>
<blockquote>
<p>T<em>he ColPali architecture (Image from the ColPali paper)</em></p>
</blockquote>
<p><strong>ColQwen2</strong> builds upon this powerful ColPali architecture but swaps the underlying VLM for the more recent <strong>Qwen2-VL-2B</strong>. It generates <a target="_blank" href="https://arxiv.org/abs/2004.12832"><strong>ColBERT</strong>-style</a> multi-vector representations, capturing fine-grained details from both text and images. As seen on the <a target="_blank" href="https://huggingface.co/spaces/vidore/vidore-leaderboard"><strong>vidro-leaderboard</strong>,</a> ColQwen2 delivers impressive performance with practical model size.</p>
<h2 id="heading-how-does-vectorchord-enable-colqwen2-in-postgres">How Does VectorChord Enable ColQwen2 in Postgres?</h2>
<p>This is where <strong>VectorChord</strong> becomes the crucial piece of the puzzle, bringing this cutting-edge VLM capability into your PostgreSQL database. ColQwen2 (and ColPali) relies heavily on those <strong>multi-vector representations</strong> and the <strong>Late Interaction</strong> mechanism, specifically requiring an efficient <strong>MaxSim (Maximum Similarity)</strong> operation.</p>
<p>Calculating MaxSim – finding the highest similarity score between any vector in the query set and any vector in the document set – can be computationally brutal, especially across millions of document vectors. VectorChord tackles this head-on:</p>
<ul>
<li><p><strong>Native Multi-Vector Support:</strong> It's designed from the ground up to handle multi-vector data efficiently within Postgres.</p>
</li>
<li><p><strong>Optimized MaxSim:</strong> Drawing inspiration from the <a target="_blank" href="https://arxiv.org/abs/2501.17788"><strong>WARP paper</strong></a>, VectorChord uses techniques like dynamic similarity imputation to dramatically speed up MaxSim calculations, making large-scale visual document retrieval feasible.</p>
</li>
<li><p><strong>Hybrid Search Ready:</strong> Beyond multi-vector, it also supports dense, sparse, and hybrid search (check out our <a target="_blank" href="https://blog.vectorchord.ai/hybrid-search-with-postgres-native-bm25-and-vectorchord"><strong>previous post</strong></a>!).</p>
</li>
<li><p><strong>Scalable &amp; Disk-Friendly:</strong> Designed for performance without demanding excessive resources.</p>
</li>
</ul>
<p>In short, VectorChord transforms PostgreSQL into a powerhouse capable of handling the advanced vector search techniques required by models like ColQwen2 or ColPali.</p>
<h2 id="heading-tutorial-building-your-ocr-free-rag-system">Tutorial: Building Your OCR-Free RAG System</h2>
<p>Alright, theory's great, but let's roll up our sleeves and build this thing! We'll walk through setting up the environment, processing data using Modal for scalable embedding generation, indexing into VectorChord within Postgres, and finally, evaluating our shiny new OCR-free RAG system.</p>
<h3 id="heading-prerequisites">Prerequisites</h3>
<p>Before we start, ensure you have:</p>
<ul>
<li><p>A PostgreSQL instance (Docker recommended) with the VectorChord extension installed OR a <a target="_blank" href="https://cloud.vectorchord.ai/"><strong>VectorChord Cloud</strong></a> cluster.</p>
</li>
<li><p>A <a target="_blank" href="https://www.google.com/url?sa=E&amp;q=https%3A%2F%2Fmodal.com%2F"><strong>Modal</strong></a> account (free tier available). Modal's fast GPU provisioning and scaling are perfect for the embedding generation step. To efficiently process a large volume of documents using the ColQwen2 model, it's crucial to leverage Modal's rapid startup and GPU expansion features. This approach will significantly reduce the time required for local processing.</p>
</li>
</ul>
<p>If you want to reproduce the tutorial quickly, you can use the <code>tensorchord/vchord-suite</code> image to run multiple extensions that TensorChord provides.</p>
<p>You can run the following command to build and start Postgres with VectorChord-BM25 and VectorChord.</p>
<pre><code class="lang-bash">docker run   \           
  --name vchord-suite  \
  -e POSTGRES_PASSWORD=postgres  \
  -p 5432:5432 \
  -d tensorchord/vchord-suite:pg17-latest
</code></pre>
<pre><code class="lang-pgsql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">EXTENSION</span> <span class="hljs-keyword">IF</span> <span class="hljs-keyword">NOT</span> <span class="hljs-keyword">EXISTS</span> vchord <span class="hljs-keyword">CASCADE</span>;
<span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">EXTENSION</span> <span class="hljs-keyword">IF</span> <span class="hljs-keyword">NOT</span> <span class="hljs-keyword">EXISTS</span> pg_tokenizer <span class="hljs-keyword">CASCADE</span>;
<span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">EXTENSION</span> <span class="hljs-keyword">IF</span> <span class="hljs-keyword">NOT</span> <span class="hljs-keyword">EXISTS</span> vchord_bm25 <span class="hljs-keyword">CASCADE</span>;
\dx
pg_tokenizer | <span class="hljs-number">0.1</span><span class="hljs-number">.0</span>   | tokenizer_catalog | pg_tokenizer
vchord       | <span class="hljs-number">0.3</span><span class="hljs-number">.0</span>   | <span class="hljs-built_in">public</span>            | vchord: Vector <span class="hljs-keyword">database</span> plugin <span class="hljs-keyword">for</span> Postgres, written <span class="hljs-keyword">in</span> Rust, specifically designed <span class="hljs-keyword">for</span> LLM
vchord_bm25  | <span class="hljs-number">0.2</span><span class="hljs-number">.0</span>   | bm25_catalog      | vchord_bm25: A postgresql <span class="hljs-keyword">extension</span> <span class="hljs-keyword">for</span> bm25 ranking algorithm
vector       | <span class="hljs-number">0.8</span><span class="hljs-number">.0</span>   | <span class="hljs-built_in">public</span>            | vector data <span class="hljs-keyword">type</span> <span class="hljs-keyword">and</span> ivfflat <span class="hljs-keyword">and</span> hnsw <span class="hljs-keyword">access</span> methods
</code></pre>
<p>Set up Modal:</p>
<pre><code class="lang-bash">$ pip install modal
$ python3 -m modal setup  <span class="hljs-comment"># click the link to authorize</span>
</code></pre>
<h3 id="heading-step-1-load-the-data-using-modal-volumes">Step 1: Load the Data (Using Modal Volumes)</h3>
<p>We'll use the <a target="_blank" href="https://huggingface.co/collections/vidore/vidore-benchmark-667173f98e70a1c0fa4db00d"><strong>ViDoRe Benchmark</strong></a> dataset. To handle this data efficiently across potentially distributed Modal functions, we'll download it to a <a target="_blank" href="https://modal.com/docs/guide/volumes"><strong>Modal Volume</strong></a>. Volumes provide persistent, shared storage ideal for this 'download once, process many times' scenario.</p>
<pre><code class="lang-python">image = modal.Image.debian_slim().pip_install(<span class="hljs-string">"datasets"</span>,<span class="hljs-string">"huggingface_hub"</span>,<span class="hljs-string">"Pillow"</span>)
DATASET_DIR = <span class="hljs-string">"/data"</span>
DATASET_VOLUME  = modal.Volume.from_name(
    <span class="hljs-string">"colpali-dataset"</span>, create_if_missing=<span class="hljs-literal">True</span>
)
app = modal.App(image=image)

<span class="hljs-meta">@app.function(volumes={DATASET_DIR: DATASET_VOLUME}, timeout=3000)</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">download_dataset</span>(<span class="hljs-params">cache=False</span>) -&gt; <span class="hljs-keyword">None</span>:</span>
    <span class="hljs-keyword">from</span> datasets <span class="hljs-keyword">import</span> load_dataset
    <span class="hljs-keyword">from</span> tqdm <span class="hljs-keyword">import</span> tqdm

    collection_dataset_names = get_collection_dataset_names(<span class="hljs-string">"vidore/vidore-benchmark-667173f98e70a1c0fa4db00d"</span>) 
    <span class="hljs-keyword">for</span> dataset_name <span class="hljs-keyword">in</span> tqdm(collection_dataset_names, desc=<span class="hljs-string">"vidore benchmark dataset(s)"</span>):
        dataset = load_dataset(dataset_name, split=<span class="hljs-string">"test"</span>,num_proc=<span class="hljs-number">10</span>)
        unique_indices = dataset.to_pandas().drop_duplicates(subset=<span class="hljs-string">"image_filename"</span>, keep=<span class="hljs-string">"first"</span>).index <span class="hljs-comment">#to remove repeating PDF pages with different queries</span>
        dataset = dataset.select(unique_indices)
        dataset.save_to_disk(<span class="hljs-string">f"<span class="hljs-subst">{DATASET_DIR}</span>/<span class="hljs-subst">{dataset_name}</span>"</span>)
</code></pre>
<p>Modal offers a straightforward Python API for interacting with its platform. With <code>modal.Image</code>, you can set the base image for the Modal App, and the <code>function</code> decorator helps define the function to be executed within the Modal application. To download the dataset to the Modal Volume, execute the following command. Modal will then automatically build the Docker image and promptly launch the container to run your function.</p>
<p>Run the download function:</p>
<pre><code class="lang-bash">$ modal run dataset.py::download_dataset
</code></pre>
<p>Modal handles building the environment and running the script in the cloud.</p>
<h3 id="heading-step-2-process-data-amp-generate-embeddings-with-modal-amp-colqwen2">Step 2: Process Data &amp; Generate Embeddings (With Modal &amp; ColQwen2)</h3>
<p>This is the heavy lifting: converting document images into ColQwen2 multi-vector embeddings. Doing this locally for many documents would be slow. Modal shines here:</p>
<ul>
<li><p><strong>Easy GPU Access &amp; Autoscaling</strong>: Launch a ColQwen2 model service on Modal and use it to generate the image embeddings and query embeddings for the dataset. The decision to implement an HTTP service instead of using the SDK directly was made to ensure seamless auto-scaling during large-scale embedding operations. If you want to deploy the embedding service as a persistent web endpoint, you can directly change <code>modal.cls</code> to <code>modal.web_server</code> for ColPaliServer and running <code>modal deploy</code>.</p>
</li>
<li><p><strong>Recovery:</strong> We checkpoint progress to a separate Modal Volume, allowing us to resume embedding generation if interrupted.</p>
</li>
</ul>
<pre><code class="lang-python"><span class="hljs-comment"># embedding.py (Illustrative - keep original code)</span>
<span class="hljs-comment"># ... imports, volumes setup ...</span>
modal_app = modal.App() <span class="hljs-comment"># Define app</span>

<span class="hljs-comment"># Function to coordinate embedding generation</span>
<span class="hljs-meta">@modal_app.function(...)</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">embed_dataset</span>(<span class="hljs-params">down_scale: float = <span class="hljs-number">1.0</span>, batch_size: int = BATCH_SIZE</span>):</span>
    <span class="hljs-comment"># ... (logic for loading dataset names, handling checkpoints) ...</span>
    colpali_server = ColPaliServer() <span class="hljs-comment"># Get handle to our GPU class</span>
    <span class="hljs-comment"># ... (loop through datasets, batch items, call server for embeddings) ...</span>
    <span class="hljs-comment"># ... (save embeddings and update checkpoint) ...</span>
    print(<span class="hljs-string">"Embedding generation complete."</span>)

<span class="hljs-comment"># server.py (Illustrative - keep original code)</span>
<span class="hljs-comment"># Class running on GPU to serve embedding requests</span>
<span class="hljs-meta">@modal_app.cls(gpu=GPU_CONFIG, ...)</span>
<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">ColPaliServer</span>:</span>
<span class="hljs-meta">    @modal.enter()</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">load_model_and_start_server</span>(<span class="hljs-params">self</span>):</span>
        <span class="hljs-comment"># ... (Load ColQwen2 model, start internal FastAPI server) ...</span>
        self.client = httpx.AsyncClient(...) <span class="hljs-comment"># Client to talk to internal server</span>

<span class="hljs-meta">    @modal.exit()</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">shutdown_server</span>(<span class="hljs-params">self</span>):</span>
        <span class="hljs-comment"># ... (Cleanup) ...</span>

    <span class="hljs-comment"># Method called by embed_dataset function</span>
<span class="hljs-meta">    @modal.method()</span>
    <span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">embed_images</span>(<span class="hljs-params">self, images: List[str]</span>) -&gt; np.ndarray:</span>
        <span class="hljs-comment"># ... (Prepare batch, send request to internal server via self.client) ...</span>
        <span class="hljs-comment"># ... (Decode response using msgspec for speed) ...</span>
        <span class="hljs-keyword">return</span> embeddings_numpy_array

    <span class="hljs-comment"># ... (Potentially add embed_queries method too) ...</span>

<span class="hljs-comment"># colpali.py</span>
<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">ColPaliModel</span>:</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self, model_name: str = <span class="hljs-string">"vidore/colqwen2-v1.0"</span>, cache_dir: str=<span class="hljs-string">"/model"</span></span>):</span>
        <span class="hljs-comment"># ...</span>
        <span class="hljs-keyword">if</span> self.model_name == <span class="hljs-string">"vidore/colqwen2-v1.0"</span>:
            <span class="hljs-keyword">from</span> colpali_engine.models <span class="hljs-keyword">import</span> ColQwen2, ColQwen2Processor
            <span class="hljs-keyword">from</span> transformers.utils.import_utils <span class="hljs-keyword">import</span> is_flash_attn_2_available

            <span class="hljs-comment"># load model</span>
            model = ColQwen2.from_pretrained(
                self.model_name,
                torch_dtype=torch.bfloat16,
                device_map=<span class="hljs-string">"cuda:0"</span>, 
                attn_implementation=<span class="hljs-string">"flash_attention_2"</span> <span class="hljs-keyword">if</span> is_flash_attn_2_available() <span class="hljs-keyword">else</span> <span class="hljs-literal">None</span>,
                cache_dir=self.cache_dir,
            ).eval()

            colpali_processor = ColQwen2Processor.from_pretrained(
                self.model_name,
                cache_dir=self.cache_dir,
            )
    <span class="hljs-comment"># ... functions for embedding images and queries</span>
    <span class="hljs-comment"># @modal.method()</span>
    <span class="hljs-comment"># async def batch_embed_images(self, images_base64: List[str]) -&gt; np.ndarray:</span>
    <span class="hljs-comment"># @modal.method()</span>
    <span class="hljs-comment"># async def batch_embed_queries(self, queries: List[str]) -&gt; np.ndarray:</span>
</code></pre>
<p>Generate the embeddings:</p>
<pre><code class="lang-bash">$ modal run embedding.py::embed_dataset
</code></pre>
<h3 id="heading-step-3-create-index-in-vectorchord-using-vechord-sdk">Step 3: Create Index in VectorChord (Using Vechord SDK)</h3>
<p>With embeddings generated and saved in a Modal Volume, we first need to bring them local.</p>
<pre><code class="lang-bash">$ modal volume get colpali-embedding-checkpoint /path/to/<span class="hljs-built_in">local</span>/vidore_embeddings
</code></pre>
<p>Now, we'll use the <strong>vechord</strong> SDK to load these embeddings into our PostgreSQL database and create the necessary multi-vector index. vechord provides a Pythonic, ORM-like way to interact with VectorChord in Postgres.</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://github.com/tensorchord/vechord">https://github.com/tensorchord/vechord</a></div>
<p> </p>
<pre><code class="lang-python">MultiVector = List[Vector[<span class="hljs-number">128</span>]] <span class="hljs-comment"># Assuming 128 dimensions for ColQwen2</span>

lists = <span class="hljs-number">2500</span> <span class="hljs-comment"># lists is the number of the cluster</span>
<span class="hljs-comment"># Define the database table schema using vechord</span>
<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">Image</span>(<span class="hljs-params">Table, kw_only=True</span>):</span>
    uid: Optional[PrimaryKeyAutoIncrease] = <span class="hljs-literal">None</span>
    image_embedding: Annotated[MultiVector, MultiVectorIndex(lists=lists)] <span class="hljs-comment"># Stores the document's visual embeddings </span>
    query_embedding: Annotated[MultiVector, MultiVectorIndex(lists=lists)]  <span class="hljs-comment"># Stores the query's embeddings </span>
    query: str = <span class="hljs-literal">None</span>
    dataset: Optional[str] = <span class="hljs-literal">None</span>
    dataset_id: Optional[int] = <span class="hljs-literal">None</span>

<span class="hljs-comment"># Connect to your PostgreSQL database</span>
<span class="hljs-comment"># Ensure the URL points to your local Docker or VectorChord Cloud instance</span>
DB_URL = <span class="hljs-string">"postgresql://postgres:postgres@127.0.0.1:5432/postgres"</span> <span class="hljs-comment"># Default DB</span>
vr = VechordRegistry(<span class="hljs-string">"colpali"</span>, DB_URL)
vr.register([Image]) <span class="hljs-comment"># Creates the table and multi-vector index if they don't exist</span>

<span class="hljs-comment"># Function to load embeddings from disk and yield Image objects</span>
<span class="hljs-meta">@vr.inject(output=Image)</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">load_image_embeddings</span>(<span class="hljs-params">path: str</span>) -&gt; Iterator[Image]:</span>
    <span class="hljs-comment"># ... (logic to load numpy arrays from disk, convert to List[Vector]) ...</span>
    <span class="hljs-comment"># ... (yield Image(...) instances) ...</span>
    print(<span class="hljs-string">f"Loaded embeddings from <span class="hljs-subst">{path}</span>"</span>)

<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">"__main__"</span>:
    <span class="hljs-comment"># Path where you downloaded the embeddings from Modal</span>
    embedding_dir = <span class="hljs-string">"/path/to/local/vidore_embeddings"</span>
    load_image_embeddings(embedding_dir) <span class="hljs-comment"># This triggers vechord to insert data</span>
    print(<span class="hljs-string">"Data loaded and indexed into VectorChord."</span>)
</code></pre>
<p>Run the indexer script, vechord handles table creation, index creation, and data insertion.</p>
<h3 id="heading-step-4-evaluation-does-it-work">Step 4: Evaluation - Does it Work?</h3>
<p>The final step is crucial: evaluating retrieval performance. Does this OCR-free system deliver accurate results, and can it be fast? We'll use vechord to run queries against our indexed data and look at NDCG@10 and Recall@10. We'll also test the impact of VectorChord's <strong>WARP optimization</strong>, which accelerates MaxSim calculations. In this tutorial, we will use <a target="_blank" href="https://huggingface.co/datasets/vidore/arxivqa_test_subsampled">vidore/arxivqa_test_subsampled</a> dataset for evaluation queries, the data are shown below.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1743483589729/4bc711ae-666c-4bf8-ac3c-63be23ce3071.png" alt class="image--center mx-auto" /></p>
<pre><code class="lang-python">TOP_K = <span class="hljs-number">10</span>

<span class="hljs-comment"># Define structure for results (optional but good practice)</span>
<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">Evaluation</span>(<span class="hljs-params">msgspec.Struct</span>):</span>
    map: float
    ndcg: float
    recall: float

TOP_K = <span class="hljs-number">10</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">evaluate</span>(<span class="hljs-params">queries: list[Image], probes: int, max_maxsim_tuples: int</span>) -&gt; list[Evaluation]:</span>
    result  = []
    <span class="hljs-keyword">for</span> query <span class="hljs-keyword">in</span> queries:
        vector = query.query_embedding
        docs: list[Image] = vr.search_by_multivec(
            Image, vector, topk=TOP_K, probe=probes, max_maxsim_tuples=max_maxsim_tuples
        )
        score = BaseEvaluator.evaluate_one(query.uid, [doc.uid <span class="hljs-keyword">for</span> doc <span class="hljs-keyword">in</span> docs])
        result.append(Evaluation(
            map=score.get(<span class="hljs-string">"map"</span>),
            ndcg=score.get(<span class="hljs-string">"ndcg"</span>),
            recall=score.get(<span class="hljs-string">f"recall_<span class="hljs-subst">{TOP_K}</span>"</span>),
        ))
    <span class="hljs-keyword">return</span> result

<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">"__main__"</span>:
    <span class="hljs-comment"># Select some queries from the benchmark dataset stored in the DB</span>
    <span class="hljs-comment"># Example: Get 100 queries from a specific subset</span>
    test_queries: list[Image] = vr.select_by(
        Image.partial_init(dataset=<span class="hljs-string">"vidore/arxivqa_test_subsampled"</span>), limit=<span class="hljs-number">100</span>
    )

    <span class="hljs-comment"># ... optimize the param ...</span>
    res: list[Evaluation] = evaluate(queries, probes=probes, maxsim_refine=maxsim_refine)
    print(<span class="hljs-string">"ndcg@10"</span>, sum(r.ndcg <span class="hljs-keyword">for</span> r <span class="hljs-keyword">in</span> res) / len(res))
    print(<span class="hljs-string">"recall@10"</span>, sum(r.recall <span class="hljs-keyword">for</span> r <span class="hljs-keyword">in</span> res) / len(res))
    print(<span class="hljs-string">f"Total execution time: <span class="hljs-subst">{total_time:<span class="hljs-number">.4</span>f}</span> seconds"</span>)
</code></pre>
<p>Running the evaluation script yields impressive results:</p>
<pre><code class="lang-markdown"><span class="hljs-section"># Disable WARP</span>
ndcg@10 0.8615
recall@10 0.92
Total execution time: 810 seconds # Baseline accuracy and time

<span class="hljs-section"># Enable WARP</span>
<span class="hljs-section"># There is no need to focus on specific times, </span>
<span class="hljs-section"># only relative times need to be taken into account, </span>
<span class="hljs-section"># as the tests were performed on a local model with poor performance.</span>
ndcg@10 0.8353
recall@10 0.90
Total execution time: 41 seconds # Dramatic WARP Speed Boost
</code></pre>
<p><strong>Analysis:</strong></p>
<ul>
<li><p><strong>High Baseline Accuracy:</strong> First, let's look at the baseline performance without WARP optimization. The system achieves an excellent NDCG@10 of <strong>0.8615</strong> and Recall@10 of <strong>0.92</strong>. This confirms the fundamental effectiveness of ColQwen2 embeddings paired with VectorChord's precise MaxSim search, delivering state-of-the-art results for visual document retrieval <em>even without</em> speed optimizations.</p>
</li>
<li><p><strong>Dramatic WARP Speed Boost:</strong> Now, observe the impact of enabling VectorChord's <strong>WARP optimization</strong>. The total execution time plummets from a substantial 810 seconds down to just <strong>41 seconds</strong>! This represents a massive <strong>~18.7x speedup</strong> for the evaluation query set. This clearly demonstrates WARP's power in dramatically accelerating the computationally intensive MaxSim operations required for late-interaction models like ColQwen2.</p>
</li>
<li><p><strong>Minimal Accuracy Trade-off:</strong> Impressively, this significant speed gain comes at the cost of only a very minor dip in retrieval accuracy. The NDCG@10 slightly decreases to <strong>0.8353</strong> (a difference of less than 0.3), and Recall@10 reduces minimally to <strong>0.9</strong> (a difference of only 0.02).</p>
</li>
</ul>
<h2 id="heading-conclusion-visual-rag-made-simpler-in-postgres">Conclusion: Visual RAG Made Simpler in Postgres</h2>
<p>In this tutorial, we successfully built a high-performance, <strong>OCR-free RAG system</strong> by bringing together the visual understanding capabilities of <strong>ColQwen2</strong>, the scalable multi-vector search of <strong>VectorChord</strong> within <strong>PostgreSQL</strong>, and the efficient GPU processing of <strong>Modal</strong>. We saw how this stack allows us to directly query documents based on their visual content, bypassing the pitfalls of traditional OCR pipelines.</p>
<p>The ability to seamlessly integrate this visual dimension into your RAG system directly within Postgres opens up exciting possibilities for applications dealing with visually rich documents like scientific papers, invoices, product manuals, historical archives, and much more.</p>
<p>Ready to try it yourself?</p>
<ul>
<li><p><strong>Explore the Code:</strong> <a target="_blank" href="https://github.com/xieydd/vectorchord-colqwen2">https://github.com/xieydd/vectorchord-colqwen2</a></p>
</li>
<li><p><strong>Dive Deeper into VectorChord:</strong> Check out the <a target="_blank" href="https://www.google.com/url?sa=E&amp;q=https%3A%2F%2Fdocs.vectorchord.ai%2F"><strong>VectorChord documentation</strong></a> or try the hassle-free <a target="_blank" href="https://www.google.com/url?sa=E&amp;q=https%3A%2F%2Fcloud.vectorchord.ai%2F"><strong>VectorChord Cloud</strong></a>.</p>
</li>
<li><p><strong>Experiment:</strong> Try different VLMs or datasets.</p>
</li>
<li><p><strong>Share Your Thoughts:</strong> Let us know your experiences or questions in the comments below!</p>
</li>
</ul>
<p>This approach represents a significant step towards simpler, more robust, and visually-aware document retrieval systems.</p>
<h2 id="heading-references">References</h2>
<ul>
<li><p><a target="_blank" href="https://huggingface.co/vidore/colqwen2-v1.0">https://huggingface.co/vidore/colqwen2-v1.0</a></p>
</li>
<li><p><a target="_blank" href="https://huggingface.co/blog/manu/colpali">https://huggingface.co/blog/manu/colpali</a></p>
</li>
<li><p><a target="_blank" href="https://blog.vespa.ai/scaling-colpali-to-billions/">https://blog.vespa.ai/scaling-colpali-to-billions/</a></p>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[VectorChord 0.3: Bringing Efficient Multi-Vector Contextual Late Interaction in PostgreSQL]]></title><description><![CDATA[We're thrilled to announce the release of VectorChord 0.3, a major milestone that significantly boosts the performance and applicability of advanced vector search techniques directly within your Postgres database! Building on our 0.2 release, which b...]]></description><link>https://blog.vectorchord.ai/vectorchord-03-bringing-efficient-multi-vector-contextual-late-interaction-in-postgresql</link><guid isPermaLink="true">https://blog.vectorchord.ai/vectorchord-03-bringing-efficient-multi-vector-contextual-late-interaction-in-postgresql</guid><category><![CDATA[multivector]]></category><category><![CDATA[maxsim]]></category><category><![CDATA[VectorSearch]]></category><category><![CDATA[colbert]]></category><category><![CDATA[rabitq]]></category><category><![CDATA[colpali]]></category><category><![CDATA[#multimodalai]]></category><category><![CDATA[OCR ]]></category><category><![CDATA[PostgreSQL]]></category><category><![CDATA[postgres]]></category><category><![CDATA[vector database]]></category><category><![CDATA[vector embeddings]]></category><category><![CDATA[AI]]></category><dc:creator><![CDATA[Keming]]></dc:creator><pubDate>Sun, 13 Apr 2025 06:26:44 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1743741003590/7fda29d7-3626-4b44-a365-83da64be76b4.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>We're thrilled to announce the release of <strong>VectorChord 0.3</strong>, a major milestone that significantly boosts the performance and applicability of advanced vector search techniques directly within your Postgres database! Building on our 0.2 release, which brought ARM support and faster indexing, version 0.3 tackles one of the biggest hurdles in modern retrieval: <strong>efficient multi-vector search with late interaction</strong>.</p>
<h3 id="heading-beyond-single-vectors-why-multi-vector-rocks">Beyond Single Vectors: Why Multi-Vector Rocks</h3>
<p>For years, vector search primarily represented entire documents or queries as single, dense vectors. While powerful, this approach involves averaging or compressing complex information into one representation, inevitably leading to a loss of nuance. Imagine trying to capture the entire meaning of a detailed technical document in a single sentence – you'd lose specifics!</p>
<p>Enter <strong>multi-vector representations</strong> and <strong>late interaction</strong>, pioneered by models like <a target="_blank" href="https://github.com/stanford-futuredata/ColBERT"><strong>ColBERT</strong></a>. Instead of one vector per item, these models generate <em>multiple</em> vectors – often one for each token (word or sub-word). The real magic happens during the search process:</p>
<ol>
<li><p><strong>Query Encoding:</strong> Your search query is also broken down into multiple token vectors (let's call them q1, q2, ..., qN).</p>
</li>
<li><p><strong>Document Encoding:</strong> Similarly, the document is represented by its token vectors (d1, d2, ... dM).</p>
</li>
<li><p><strong>Late Interaction &amp; MaxSim Aggregation:</strong> Instead of one comparison, we perform fine-grained matching. The core idea is the <strong>Maximum Similarity (MaxSim)</strong> operation. For <em>each query token vector</em> (like q_i), we find the document token vector (d_j) that has the <em>highest similarity</em> (e.g., cosine similarity or dot product) to it across <em>all</em> document tokens (d1 through dM). This process is repeated for every query token (q1, q2, ..., qN). The final relevance score for the document is then calculated by <strong>summing up these maximum similarity values obtained for each query token</strong>. This operation is inherently <strong>asymmetric</strong>, focusing on how well each part of the query is represented somewhere in the document.</p>
</li>
</ol>
<p><strong>Why is this better?</strong></p>
<ul>
<li><p><strong>Contextual Nuance:</strong> Late interaction with MaxSim allows the model to capture fine-grained semantic relationships. It can identify if <em>specific important terms</em> from the query strongly match <em>specific parts</em> of the document, rather than relying on a potentially diluted overall average similarity.</p>
</li>
<li><p><strong>Improved Relevance:</strong> By considering focused, token-level interactions and summing the best matches for each query term, models like ColBERT achieve state-of-the-art retrieval quality. This fine-grained approach is particularly powerful for content-rich queries with multiple terms, leading to a better understanding of user intent. On the <a target="_blank" href="https://huggingface.co/answerdotai/ModernBERT-base#base-models">BEIR benchmark</a>, this translates to a significant performance boost: With the same <a target="_blank" href="https://huggingface.co/answerdotai/ModernBERT-base">ModernBert</a> model, Colbert variant achieves 51.6 NDCG@10 compared to 41.6 for dense vector variant.</p>
</li>
</ul>
<p>Our previous blog post explored using ColBERT for reranking, showcasing its power:<br /><a target="_blank" href="https://blog.vectorchord.ai/supercharge-vector-search-with-colbert-rerank-in-postgresql"><strong>Supercharge Vector Search with ColBERT Rerank in PostgreSQL</strong></a></p>
<h3 id="heading-vectorchord-03-bringing-efficient-maxsim-to-postgres-inspired-by-warp">VectorChord 0.3: Bringing Efficient MaxSim to Postgres, Inspired by WARP</h3>
<p>VectorChord 0.3 directly confronts the <strong>high computational cost of late interaction</strong>. We've integrated a highly optimized <strong>multi-vector late-interaction MaxSim operator and index</strong> into our core Rust engine, drawing inspiration from the groundbreaking <strong>WARP engine</strong> (<a target="_blank" href="https://arxiv.org/abs/2501.17788">Paper: WARP: An Efficient Engine for Multi-Vector Retrieval</a>, <a target="_blank" href="https://github.com/jlscheerer/xtr-warp">Code: jlscheerer/xtr-warp</a>).</p>
<p>The genius of the WARP approach lies in a fundamental insight: the complex multi-vector MaxSim calculation (comparing N query vectors to M document vectors) can be cleverly <strong>decomposed into multiple, independent single-vector search processes</strong>. Instead of one massive N x M comparison, WARP effectively performs N separate searches, one for each query token vector, against the indexed document vectors to find its best match.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1743730508359/1040ee64-6dd1-429c-bc6b-95184d7cf9dc.png" alt class="image--center mx-auto" /></p>
<p>Our core improvement in VectorChord 0.3's MaxSim scanner is built on this concept. We've implemented MaxSim by <strong>leveraging and orchestrating VectorChord's existing, highly optimized single-vector search infrastructure</strong>. When a MaxSim query is executed with an index, VectorChord performs <strong>multiple single-vector searches</strong>—one for each query vector—using our underlying index structures (<strong>IVF combined with RaBitQ</strong>). It then efficiently aggregates the results based on the MaxSim summation rule. For estimating missing values, we adopt the approach from WARP, using the <strong>distance to the centroid and the cumulative cluster size</strong> to estimate MaxSim scores for potential candidates.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1743733841309/79f715b2-bbf3-4ac4-bb90-b7ffdea07f6f.png" alt class="image--center mx-auto" /></p>
<p>Reusing our optimized single-vector search components ensures MaxSim benefits from existing performance tuning and stability. Importantly for users, it provides a <strong>seamless experience</strong>: you can leverage <strong>the same familiar index types</strong> (IVF with RaBitQ) for both single-vector and multi-vector MaxSim search, without needing a separate system. VectorChord 0.3 proudly stands as the <strong>first Postgres extension to deliver this efficient, decomposed MaxSim implementation</strong>, making state-of-the-art multi-vector retrieval practical and performant directly within your database.</p>
<h3 id="heading-powerful-use-cases-unlocked">Powerful Use Cases Unlocked</h3>
<p>This new efficiency opens the door to practical implementations of cutting-edge techniques:</p>
<ol>
<li><p><strong>High-Performance ColBERT Reranking:</strong> Apply ColBERT's superior relevance ranking to a candidate set retrieved by a faster first-stage search (like traditional vector search or keyword search) without incurring prohibitive latency penalties. Get the best of both speed and quality.</p>
</li>
<li><p><strong>OCR-Free Document Search (ColPali/ColQwen):</strong> Imagine searching directly within scanned documents, PDFs, or images <em>without</em> needing a separate, often error-prone, OCR (Optical Character Recognition) step. Models like ColPali or ColQwen generate token embeddings directly from image patches. With VectorChord 0.3's efficient MaxSim, you can now perform late-interaction search over these visual token embeddings directly in Postgres, enabling powerful search over documents previously inaccessible to pure text-based methods.</p>
</li>
</ol>
<p>Stay tuned! We have another article coming soon, diving deep into how VectorChord 0.3 enables revolutionary OCR-free RAG pipelines.</p>
<h3 id="heading-performance-highlights">Performance Highlights</h3>
<p>We put VectorChord 0.3's new multi-vector MaxSim capabilities to the test on the <strong>FiQA (Financial Opinion Mining and Question Answering) dataset</strong>. This standard benchmark includes 57,000 documents with approximately 15 million cumulative tokens.</p>
<p>Our initial benchmark results, using <strong>ColBERTv2</strong> powered by VectorChord's optimized engine, show promising performance:</p>
<ul>
<li><p><strong>Relevance:</strong> We achieved an <strong>NDCG@10 score of 34.1</strong>. This compares favorably to the 33.6 NDCG@10 reported for the same dataset in the original WARP paper.</p>
</li>
<li><p><strong>Speed:</strong> Queries were executed efficiently, averaging just <strong>35 milliseconds per query</strong>.</p>
</li>
</ul>
<p>While these are preliminary results based on a single dataset as we continue optimization, they demonstrate VectorChord 0.3's potential. Users can now leverage the advanced relevance capabilities of multi-vector search and late interaction with impressive speed, approaching the latency often associated with simpler single-vector methods, all directly within their Postgres database.</p>
<h3 id="heading-get-started-with-vectorchord-03">Get started with VectorChord 0.3</h3>
<p><strong>Prerequisites:</strong></p>
<ul>
<li>PostgreSQL server with the VectorChord v0.3 installed</li>
</ul>
<p>You can use our VectorChord-Suite image:</p>
<pre><code class="lang-bash">docker run --rm --name vchord_db -d -e POSTGRES_PASSWORD=postgres -p 5432:5432 \
    ghcr.io/tensorchord/vchord-postgres:pg17-v0.3.0
</code></pre>
<h4 id="heading-step-1-create-a-table-for-multi-vector-data">Step 1: Create a Table for Multi-Vector Data</h4>
<p>First, define a table to store your data. The key difference is using the array type for your vector column (<code>vector[]</code>) to hold multiple vectors per row.</p>
<pre><code class="lang-pgsql"><span class="hljs-comment">-- Define a table to store items (e.g., documents), each potentially having multiple vectors.</span>
<span class="hljs-comment">-- Replace '128' with your actual vector dimensionality.</span>
<span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> doc (
    id <span class="hljs-type">SERIAL</span> <span class="hljs-keyword">PRIMARY KEY</span>, <span class="hljs-comment">-- A unique identifier for each item</span>
    vecs vector(<span class="hljs-number">128</span>)[] <span class="hljs-comment">-- The column storing an ARRAY of 128-dimensional vectors</span>
);
</code></pre>
<h4 id="heading-step-2-insert-multi-vector-data">Step 2: Insert Multi-Vector Data</h4>
<p>Insert data into your table. The vecs column takes a PostgreSQL array containing vector types.</p>
<pre><code class="lang-pgsql"><span class="hljs-comment">-- Insert sample data: one document with 2 vectors, another with 3.</span>
<span class="hljs-comment">-- Ensure vector dimensions match your table definition (128 in this example).</span>
<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> doc (id, vecs) <span class="hljs-keyword">VALUES</span>
    (<span class="hljs-number">1</span>, <span class="hljs-keyword">array</span>[<span class="hljs-keyword">array</span>[<span class="hljs-number">0.1</span>, <span class="hljs-number">0.2</span>, ..., <span class="hljs-number">0.9</span>]::vector, <span class="hljs-keyword">array</span>[<span class="hljs-number">0.8</span>, <span class="hljs-number">0.7</span>, ..., <span class="hljs-number">0.1</span>]::vector]),
    (<span class="hljs-number">2</span>, <span class="hljs-keyword">array</span>[<span class="hljs-keyword">array</span>[<span class="hljs-number">0.5</span>, <span class="hljs-number">0.5</span>, ..., <span class="hljs-number">0.5</span>]::vector, <span class="hljs-keyword">array</span>[<span class="hljs-number">0.3</span>, <span class="hljs-number">0.4</span>, ..., <span class="hljs-number">0.7</span>]::vector, <span class="hljs-keyword">array</span>[<span class="hljs-number">0.9</span>, <span class="hljs-number">0.1</span>, ..., <span class="hljs-number">0.4</span>]::vector]);
<span class="hljs-comment">-- Add more data as needed...</span>
</code></pre>
<h4 id="heading-step-3-create-a-vchordrq-index-with-maxsim-support">Step 3: Create a <code>vchordrq</code> Index with MaxSim Support</h4>
<p>To accelerate MaxSim searches, create an index using the vchordrq method and specify the <code>vector_maxsim_ops</code> operator class. A crucial parameter here is <code>build.internal.lists</code>.</p>
<ul>
<li><strong>Calculate</strong> <code>n</code>: Estimate the total number of individual vectors across your entire dataset.</li>
</ul>
<p>$$n=N_{doc}*\text{avg}(N_{vector\_per\_doc})$$</p><ul>
<li><strong>Set</strong> <code>lists</code><em>:</em> The recommended range for <code>build.internal.lists</code> is</li>
</ul>
<p>$$4 * \sqrt n &lt; \text{lists} &lt; 8 * \sqrt n$$</p><p>Choose a value within this range (often powers of 2 work well).</p>
<pre><code class="lang-pgsql"><span class="hljs-comment">-- Example Calculation:</span>
<span class="hljs-comment">-- If you have 1,000,000 documents (rows) with an average of 5 vectors each:</span>
<span class="hljs-comment">-- n = 1,000,000 * 5 = 5,000,000 sqrt(n) ≈ 2236 </span>
<span class="hljs-comment">-- Lower bound: 4 * 2236 ≈ 8944</span>
<span class="hljs-comment">-- Upper bound: 8 * 2236 ≈ 17888</span>
<span class="hljs-comment">-- A good value for 'lists' could be 16384 (power of 2).</span>
<span class="hljs-comment">-- Assume the K-means clusters are balanced, each will have about 305 vectors.</span>

<span class="hljs-comment">-- Create the index using vchordrq and vector_maxsim_ops</span>
<span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">INDEX</span> doc_vecs_idx <span class="hljs-keyword">ON</span> doc <span class="hljs-keyword">USING</span> vchordrq (vecs vector_maxsim_ops)
<span class="hljs-keyword">WITH</span> (<span class="hljs-keyword">options</span> = $$<span class="pgsql">
build.internal.lists = [<span class="hljs-number">16384</span>] <span class="hljs-comment">-- Adjust this value based on your calculation!</span>
$$</span>);
</code></pre>
<h4 id="heading-step-4-configure-search-parameters-optional-but-recommended">Step 4: Configure Search Parameters (Optional but Recommended)</h4>
<p>Before querying, you can tune runtime parameters for performance and accuracy:</p>
<ul>
<li><p><code>vchordrq.probes</code>: Controls how many index lists (clusters) are checked during a search. Higher values increase accuracy (recall) but slow down the search. A common starting point for finding the top 10 results (<code>LIMIT 10</code>) is 32.</p>
</li>
<li><p><code>vchordrq.maxsim_refine</code>: Limits the number of vector pairs re-computed with the original precision (otherwise will use the bit distance) for each candidate query token vector. It’s related to the <code>probes</code>, a value of 20% of probed vectors will be sufficient.</p>
</li>
</ul>
<pre><code class="lang-pgsql"><span class="hljs-comment">-- Set runtime parameters for the current session/transaction</span>
<span class="hljs-keyword">SET</span> vchordrq.probes = <span class="hljs-number">32</span>; <span class="hljs-comment">-- Adjust based on desired recall vs. speed trade-off</span>
<span class="hljs-keyword">SET</span> vchordrq.maxsim_refine = <span class="hljs-number">2000</span>; <span class="hljs-comment">-- Adjust based on desired recall vs. speed trade-off</span>
</code></pre>
<h4 id="heading-step-5-perform-a-maxsim-similarity-search">Step 5: Perform a MaxSim Similarity Search</h4>
<p>Now you can query using the <code>@#</code> MaxSim operator. Provide your query vectors as a PostgreSQL array of vector types.</p>
<pre><code class="lang-pgsql"><span class="hljs-comment">-- Find the top 10 documents most similar to the given set of query vectors</span>
<span class="hljs-keyword">SELECT</span> id <span class="hljs-keyword">FROM</span> doc 
<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> vecs @# <span class="hljs-keyword">ARRAY</span>[<span class="hljs-keyword">array</span>[<span class="hljs-number">0.4</span>, <span class="hljs-number">0.1</span>, ..., <span class="hljs-number">0.8</span>]::vector, <span class="hljs-keyword">array</span>[<span class="hljs-number">0.7</span>, <span class="hljs-number">0.2</span>, ..., <span class="hljs-number">0.3</span>]::vector]
<span class="hljs-keyword">LIMIT</span> <span class="hljs-number">10</span>;
</code></pre>
<p>You've now successfully set up a table, indexed it for multi-vector MaxSim search, and executed your first query using VectorChord 0.3's MaxSim operator! Experiment with the <code>probes</code> and <code>max_maxsim_tuples</code> parameters to find the best balance of speed and accuracy for your specific use case.</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://github.com/tensorchord/VectorChord/">https://github.com/tensorchord/VectorChord/</a></div>
]]></content:encoded></item><item><title><![CDATA[3 Billion Vectors in PostgreSQL to Protect the Earth]]></title><description><![CDATA[Monitoring the health of our planet is one of the most critical challenges of our time. Our planet faces immense environmental pressures, and understanding these changes requires sifting through an unimaginable amount of satellite data – petabytes of...]]></description><link>https://blog.vectorchord.ai/3-billion-vectors-in-postgresql-to-protect-the-earth</link><guid isPermaLink="true">https://blog.vectorchord.ai/3-billion-vectors-in-postgresql-to-protect-the-earth</guid><category><![CDATA[PostgreSQL]]></category><category><![CDATA[vector database]]></category><category><![CDATA[VectorSearch]]></category><category><![CDATA[AI]]></category><category><![CDATA[search]]></category><category><![CDATA[llm]]></category><dc:creator><![CDATA[Jinjing Zhou]]></dc:creator><pubDate>Tue, 08 Apr 2025 02:35:31 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1743725578585/d6debcbb-4977-4a78-9857-1291755f162a.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Monitoring the health of our planet is one of the most critical challenges of our time. Our planet faces immense environmental pressures, and understanding these changes requires sifting through an unimaginable amount of satellite data – petabytes of pixels captured constantly. From tracking deforestation in the Amazon to identifying illegal mining operations in remote regions or monitoring agricultural impacts on sensitive ecosystems, turning this data into actionable intelligence is paramount.</p>
<p>Earth Genome's <a target="_blank" href="https://www.earthgenome.org/earth-index"><strong>Earth Index</strong></a> platform is tackling this challenge head-on. It now offers global coverage, making the entire land surface of our planet searchable using the power of AI. This isn't just a technical achievement; it's a new tool designed to empower environmental stewardship by providing accessible, actionable intelligence from satellite data.</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://www.earthgenome.org/">https://www.earthgenome.org/</a></div>
<p> </p>
<p>We at <a target="_blank" href="https://github.com/tensorchord/VectorChord">VectorChord</a> are incredibly proud that our PostgreSQL vector search extension is playing a crucial role in powering this ambitious and vital mission.</p>
<h2 id="heading-what-is-earth-index-doing-turning-pixels-into-planetary-insights">What is Earth Index Doing? Turning Pixels into Planetary Insights</h2>
<p>Imagine finding a specific type of small-scale environmental impact, like an unreported quarry or a new patch of deforestation, somewhere within satellite imagery covering millions of square kilometers. Traditionally, analyzing this data required significant expertise and resources, making tracking specific threats across vast, remote areas difficult, especially for conservation groups, journalists, and regulators with limited budgets. The animation below illustrates how efficiently Earth Index pinpoints the illegal narcotrafficking airstrips in the Amazon, akin to finding a toothpick on a soccer field.</p>
<p><img src="https://miro.medium.com/v2/resize:fit:1280/0*P1cvHPCMB_I_9DEn" alt class="image--center mx-auto" /></p>
<p>But how does Earth Index make finding environmental features of interest possible on a global scale? The magic lies in a cutting-edge approach using AI foundation models trained via contrastive learning and the resulting vector embeddings.</p>
<p>Here’s a glimpse into their innovative process:</p>
<ol>
<li><p><strong>Tiling the Globe:</strong> Earth Index divides the Earth's entire land surface into over <strong>3.2 billion</strong> overlapping tiles, each roughly 10 hectares.</p>
</li>
<li><p><strong>Learning the "Planetary DNA" with Contrastive Self-Supervision:</strong> Earth Index leverages breakthroughs in AI, specifically foundation models trained for Earth Observation (EO) using self-supervised, contrastive learning methods (like the DINO objective). The foundation model (e.g., an SSL4EO Vision Transformer) learns the fundamental visual language of satellite imagery by distinguishing between pairs of images from the <em>same</em> location (positive pairs) and pairs from <em>different</em> locations (negative pairs). This happens across diverse global data (like Harmonized Landsat Sentinel-2, Sentinel-1 radar) without needing explicit human labels for features like "forest" or "city."</p>
</li>
<li><p><strong>Generating Vector Embeddings:</strong> Once pre-trained, this model acts as an encoder. Imagery for each tile is fed into it, producing a <strong>vector embedding</strong> – a rich numerical representation or fingerprint capturing the tile's essential visual and structural characteristics in a highly compressed format (&gt;10,000x compression). This high-dimensional vector acts like a unique "Planetary DNA" for that piece of land.</p>
</li>
<li><p><strong>Creating a Searchable Planet:</strong> These 3.2 billion vector embeddings form the core of Earth Index. They need to be stored in a specialized system optimized for rapidly finding vectors that are numerically similar to each other. Users can provide an example of what they're looking for (e.g., by clicking on a known illegal mine or a specific type of agricultural infrastructure on the map), and Earth Index searches this massive dataset to find other locations across the globe (or within a specific region) with similar vector fingerprints.</p>
</li>
</ol>
<h2 id="heading-real-world-impact-earth-index-in-action-for-environmental-protection">Real-World Impact: Earth Index in Action for Environmental Protection</h2>
<p>This technological foundation translates directly into powerful tools for understanding and protecting our environment. It empowers journalists (like those at Mongabay and Radio Free Europe), indigenous communities, conservation organizations (like the World Bank), and researchers to uncover hidden environmental changes, monitor industrial footprints, and gather evidence for conservation efforts with unprecedented ease and speed. The figure shows the shrimp farms in the Gulf of Fonseca.</p>
<p><img src="https://miro.medium.com/v2/resize:fit:1400/1*ChyVYAx3EIS_b27QmLS5iQ.png" alt /></p>
<p>Here are some examples of how Earth Index is being used:</p>
<ul>
<li><p><strong>Combating Deforestation and Illegal Activities:</strong> In collaboration with <strong>Mongabay</strong>, Earth Index was used to pierce through dense canopy cover and identify hidden, illegal narcotrafficking airstrips in the Amazon. Similarly, the platform can rapidly locate illegal gold mining scars, like those devastating parts of the Yanomami Indigenous Territory, helping to quantify the scale of the problem and guide enforcement efforts.</p>
</li>
<li><p><strong>Monitoring Resource Extraction:</strong> <strong>Radio Free Europe</strong> utilized Earth Index to identify over 1300 quarries across the Balkans, revealing that roughly half lacked proper permits, highlighting potential environmental damage and unregulated extraction. This approach can monitor fracking wellpads, rubber processing plants, or other industrial facilities.</p>
</li>
<li><p><strong>Protecting Water Quality and Ecosystems:</strong> Earth Index can locate Concentrated Animal Feeding Operations (CAFOs) near sensitive waterways to assess runoff risks or identify expansions of shrimp farming into vital mangrove ecosystems.</p>
</li>
<li><p><strong>Tracking Agricultural Expansion:</strong> The tool helps monitor the conversion of natural habitats into agricultural land by searching for patterns associated with new clearings or specific crop types.</p>
</li>
</ul>
<h2 id="heading-the-infrastructure-challenge-searching-billions-of-vectors-affordably">The Infrastructure Challenge: Searching Billions of Vectors Affordably</h2>
<p>Generating 3.2 billion high-dimensional vectors is one challenge; hosting and searching them efficiently presents another significant hurdle. Earth Index needed a solution that could:</p>
<ul>
<li><p><strong>Handle Massive Scales:</strong> Searching billions of vectors requires specialized indexing and querying capabilities.</p>
</li>
<li><p><strong>Integrate Seamlessly with Geospatial Data:</strong> Earth Index heavily relies on geospatial operations (like filtering results within specific administrative boundaries or proximity to rivers). This made PostgreSQL with PostGIS a natural and essential choice for their core database. Standalone vector databases often lack the rich geospatial features required.</p>
</li>
<li><p><strong>Maintain Performance:</strong> Users need results relatively quickly for the platform to be effective for investigation and monitoring.</p>
</li>
<li><p><strong>Control Costs:</strong> Storing and searching billions of vectors, especially if requiring vast amounts of RAM for purely in-memory indexes, can become prohibitively expensive for many organizations, particularly non-profits.</p>
</li>
</ul>
<p>Earth Index needed a vector search solution that lived within their chosen PostgreSQL environment, could scale massively, perform well, and wouldn't break the bank on hardware costs.</p>
<h2 id="heading-why-postgresql-rocks-for-planetary-scale-vectors">Why PostgreSQL Rocks for Planetary-Scale Vectors</h2>
<p>While specialized vector databases exist, Earth Index faced a critical need: seamlessly integrating vector similarity search with rich geospatial data and operations. This is where PostgreSQL, combined with extensions, truly shines:</p>
<ul>
<li><p><strong>Unified Data Platform:</strong> PostgreSQL, especially with the powerful PostGIS extension, allowed Earth Index to keep their vector embeddings alongside their crucial geospatial metadata (tile locations, administrative boundaries, proximity to features like rivers). This avoids the complexity and potential synchronization issues of managing separate databases for vector search and relational/geospatial data.</p>
</li>
<li><p><strong>Leveraging the Power of SQL &amp; PostGIS:</strong> Earth Index could combine vector similarity searches with standard SQL filters and complex PostGIS spatial queries (e.g., "find vectors similar to this example <em>within</em> this specific national park boundary" or "<em>near</em> this river system") all within a single query interface. Standalone vector databases often lack this mature geospatial query capability.</p>
</li>
<li><p><strong>Maturity and Ecosystem:</strong> Building on PostgreSQL means inheriting its decades of development, renowned stability, robust tooling, and wide community support – essential for a mission-critical platform like Earth Index.</p>
</li>
</ul>
<p>However, searching 3.2 billion vectors efficiently requires more than just standard PostgreSQL capabilities.</p>
<h2 id="heading-why-earth-index-chose-vectorchord-for-performance-and-affordability">Why Earth Index Chose VectorChord for Performance and Affordability</h2>
<p>While the PostgreSQL ecosystem provided the ideal foundation, achieving the required performance and affordability at the scale of 3.2 billion vectors presented a significant challenge. Earth Index needed a solution that could deliver speed without breaking the bank.</p>
<p>Finding the right balance of performance, cost, and integration was challenging. <a target="_blank" href="https://www.linkedin.com/in/tom-ingold/">Hutch Ingold</a>, CTO of Earth Genome, explains their considerations:</p>
<blockquote>
<p>When we evaluated solutions for our 3.2 billion vectors – recognizing the rapidly evolving tech landscape – standard pgvector didn't meet our performance requirements at that scale. Dedicated options like Qdrant posed difficulties: hosted versions were too expensive, and self-hosting clustered deployments on our Kubernetes infrastructure seemed insufficiently mature at the time. We also examined cloud services, such as Google's Vertex AI Vector Search, but the list price of $237,000 per month for our dataset was simply prohibitive.</p>
</blockquote>
<p>This is where <a target="_blank" href="https://github.com/tensorchord/VectorChord">VectorChord</a> became the enabling technology:</p>
<ul>
<li><p><strong>Performance Beyond Vanilla PGVector:</strong> VectorChord delivers the necessary speed enhancements – <strong>up to 5x faster queries, 16x higher insert throughput, and 16x quicker index building</strong> – addressing the performance limitations encountered with standard pgvector at this massive scale.</p>
</li>
<li><p><strong>Affordable Scale via Disk-Based Indexing:</strong> Critically, VectorChord's optimized <strong>disk-based indexing</strong> (leveraging cost-effective SSDs) directly counters the prohibitive costs highlighted by the CTO. It avoids the massive RAM requirements of purely in-memory approaches and the exorbitant monthly fees of managed cloud vector services, making the 3.2 billion vector index financially feasible.</p>
</li>
<li><p><strong>Native Integration &amp; Scalability:</strong> As a PostgreSQL extension, VectorChord provided these benefits <em>within</em> their chosen database, requiring no infrastructure changes. It's built to <strong>scale</strong> efficiently, handling the billions of vectors needed for Earth Index's global monitoring platform while maintaining interactive performance.</p>
</li>
</ul>
<p>VectorChord provided the crucial combination of high performance, massive scalability, and cost-effectiveness directly within the familiar and powerful PostgreSQL environment that Earth Index needed.</p>
<h2 id="heading-earth-index-technical-architecture-and-setup">Earth Index: Technical Architecture and Setup</h2>
<p>To manage its planetary dataset effectively and affordably, Earth Index built its backend on a cluster of three powerful AWS EC2 <strong>i8g.16xlarge</strong> instances, powered by the latest Graviton4 ARM processor and Nitro SSD. Each provides <strong>512 GB RAM, 64 vCPUs, and 15 TB of NVMe storage</strong>, offering a solid foundation for both computation and disk I/O. This setup, costing around $12,000 per month, delivers massive cost savings compared to the estimated $237,000 monthly fee for a comparable managed cloud vector service, making the project economically viable. These instances run PostgreSQL enhanced with VectorChord and PostGIS, efficiently hosting both the vector data and other application components.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1743665649499/4294a141-fc88-4134-93f7-8f4f88e29bfc.png" alt class="image--center mx-auto" /></p>
<p>The core data resides in a partitioned table structure across the three nodes. The primary table, <code>globe24</code>, stores <strong>each tile's unique ID</strong>, its <strong>geographic point geometry</strong> (<code>geometry(Point, 4326)</code>), a <strong>partition key</strong> (<code>partkey</code>), and the crucial <strong>384-dimensional AI-generated embedding</strong> stored efficiently as <code>halfvec(384)</code>. This sharding distributes the 3.2 billion rows across the cluster, allowing for parallel processing.</p>
<p>A key advantage of this architecture is the ability to perform complex, integrated queries. Earth Index commonly executes SQL statements that seamlessly combine VectorChord's fast vector search (<code>embedding &lt;=&gt; '{vector}'::vector</code>) with PostGIS geospatial filtering (<code>st_within(geom, '{geom}')</code>) and partition pruning (<code>partkey IN (...)</code>). This allows users to find visually similar locations within specific geographic boundaries efficiently. Even with billions of vectors, this integrated approach yields practical performance, achieving a median query latency (p50) of approximately <strong>761 ms</strong> for typical searches returning the top 500 results.</p>
<h2 id="heading-enabling-environmental-action-at-scale">Enabling Environmental Action at Scale</h2>
<p>By providing a cost-effective, scalable, and tightly integrated vector search solution within PostgreSQL, VectorChord helps enable Earth Index's groundbreaking work. It allows them to manage their enormous "Planetary DNA" database affordably, ensuring that this powerful tool for environmental monitoring can reach the organizations and communities who need it most.</p>
<p><a target="_blank" href="https://app.earthindex.ai/waitlist/"><strong>Join the Earth Index waitlist</strong></a> today to explore the environmental challenges in your neighborhood, understand your ecological footprint, and be part of a global effort to protect the planet.</p>
<p>We are thrilled to see Earth Index leverage <a target="_blank" href="https://github.com/tensorchord/VectorChord">VectorChord</a> to turn satellite data into actionable environmental intelligence. Their success demonstrates the power of combining cutting-edge AI, robust database technology, and efficient vector search to address critical global challenges.</p>
<p>Interested in bringing large-scale vector search to your PostgreSQL database? Learn more about <a target="_blank" href="https://github.com/tensorchord/VectorChord">VectorChord</a> or get started today!</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://github.com/tensorchord/VectorChord">https://github.com/tensorchord/VectorChord</a></div>
]]></content:encoded></item><item><title><![CDATA[Vector Search Over PostgreSQL: A Comparative Analysis of Memory and Disk Solutions]]></title><description><![CDATA[Introduction: real-world challenges
Specialized vector databases aren’t always needed. With PostgreSQL extensions like pgvector, pgvectorscale, VectorChord (which evolved from pgvecto.rs)​​, you get vector search plus relational power—no extra infras...]]></description><link>https://blog.vectorchord.ai/vector-search-over-postgresql-a-comparative-analysis-of-memory-and-disk-solutions</link><guid isPermaLink="true">https://blog.vectorchord.ai/vector-search-over-postgresql-a-comparative-analysis-of-memory-and-disk-solutions</guid><category><![CDATA[vectorchord]]></category><category><![CDATA[pgvectorscale]]></category><category><![CDATA[PostgreSQL]]></category><category><![CDATA[pgvector]]></category><category><![CDATA[vector database]]></category><category><![CDATA[VectorSearch]]></category><category><![CDATA[Databases]]></category><category><![CDATA[indexing]]></category><category><![CDATA[Benchmark]]></category><category><![CDATA[postgres]]></category><category><![CDATA[performance]]></category><category><![CDATA[RAG ]]></category><category><![CDATA[search]]></category><dc:creator><![CDATA[Junyu Chen]]></dc:creator><pubDate>Thu, 03 Apr 2025 07:22:11 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1743662556243/5c9f8edc-b534-46ba-947d-d6c935329540.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-introduction-real-world-challenges">Introduction: real-world challenges</h2>
<p>Specialized vector databases aren’t always needed. With PostgreSQL extensions like <a target="_blank" href="https://github.com/pgvector/pgvector">pgvector</a>, <a target="_blank" href="https://github.com/timescale/pgvectorscale">pgvectorscale</a>, <a target="_blank" href="https://github.com/tensorchord/VectorChord">VectorChord</a> (which evolved from <a target="_blank" href="http://pgvecto.rs">pgvecto.rs</a>)<strong>​</strong>​, you get vector search <em>plus</em> relational power—no extra infrastructure. But adoption isn’t seamless. Here’s what real users say:</p>
<blockquote>
<ul>
<li><p>The query <strong>takes 30 seconds</strong> for the first time and 100ms if it's a repeat query. The problem is very slow IO.</p>
</li>
<li><p>Is it normal to take 3 hours to build the index for 12 million rows with 75 dimension vector?</p>
</li>
<li><p>After 13 hours of building index for 200 million vectors of 75 dimensions each, I have noticed a heavy memory usage at 100GB/124GB at 30% of building index, and server will eventually crash.</p>
</li>
<li><p>We're hitting some challenges, particularly around need for frequent index updates, throughput constraints, lack of parallel index scans and no built-in quantization.</p>
</li>
</ul>
</blockquote>
<p>In this article, we'll take a deep dive into these questions and help you make informed decisions by understanding not only the "what" but also the "<strong>why</strong>" behind these tools. Let's explore how to navigate these complexities and choose the right solution for your needs.</p>
<h2 id="heading-what-do-we-value-most-in-a-vector-database">What do we value most in a vector database?</h2>
<blockquote>
<p>In PostgreSQL, <a target="_blank" href="https://www.postgresql.org/docs/current/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-MEMORY"><code>shared_buffers</code></a> is a crucial memory area acting as the primary cache for data pages, <em>including table data and indexes</em>. By storing frequently accessed vector index blocks in this memory buffer, PostgreSQL significantly reduces slow disk I/O. When a query needs data, PostgreSQL first checks <code>shared_buffers</code>. If the data is present, it's retrieved quickly without disk access. If not, the data is fetched from disk and loaded into <code>shared_buffers</code> for future queries. To configure this, use <code>ALTER SYSTEM SET shared_buffers = 'xGB'</code> in psql and restart the server.</p>
</blockquote>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1743385828242/4ae2279a-27d2-42ce-bb59-2b462a01e851.png" alt class="image--center mx-auto" /></p>
<p>Let's say you have some vectors in a table, after the index is built, it's time to start a query. Even though you have set enough <code>shared_buffers</code> for PostgreSQL, it still needs to read the index from disk for the <code>first query</code>. However, for the <code>repeated query</code>, since the vector can be retrieved directly from <code>shared_buffers</code>, it can be much faster.</p>
<p>While the <code>first query</code> after a cold start or index change matters, sustained performance often depends on how quickly queries execute once the 'hot' or frequently accessed parts of the index are cached in memory (shared_buffers), mimicking the <code>repeated query</code> scenario. Therefore, optimizing for this cached performance is crucial for many real-world applications.</p>
<p>In this blog, we will compare the performance of vector similarity search extensions in these situations:</p>
<ul>
<li><p>Query performance when <strong>memory</strong> is sufficient</p>
</li>
<li><p>Query performance when memory is low and <strong>disk</strong> becomes a bottleneck</p>
</li>
</ul>
<p>We will also discuss other aspects related to index building, such as:</p>
<ul>
<li><p><code>Index build speed</code>: How long does it take to build the index? If I have multiple cores in the instance, can it use them all to speed things up?</p>
</li>
<li><p><code>Memory usage</code>: How much memory should be reserved for index creation? This is the most important consideration when choosing an instance.</p>
</li>
<li><p><code>Disk usage</code>: How many vectors can be hosted on a given amount of disk space? This is also a consideration when choosing an instance that uses non-expandable NVMe SSD (a type of high-performance solid state drive) storage.</p>
</li>
<li><p><code>Ease of use</code>: Can I use a simple <code>CREATE INDEX</code> command to create an index? How difficult is it to tune for query performance?</p>
</li>
</ul>
<p>Finally, we will discuss the <code>insertion performance</code> after index building is complete. This is important if the user has streamed data that will trigger frequent index updates.</p>
<h2 id="heading-experimental-setup">Experimental setup</h2>
<p>Here we present information about the hardware and datasets used in all of the following experiments.</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td></td><td><strong>Vectors in memory</strong></td><td><strong>Vectors on disk</strong></td><td><strong>Index building</strong></td><td><strong>Insertion performance</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Instance type</td><td><a target="_blank" href="https://instances.vantage.sh/aws/ec2/i4i.xlarge"><code>I4I Extra Large</code></a></td><td><code>C6ID Large</code></td><td><a target="_blank" href="https://instances.vantage.sh/aws/ec2/i4i.xlarge"><code>I4I Extra Large</code></a></td><td><a target="_blank" href="https://instances.vantage.sh/aws/ec2/i4i.xlarge"><code>I4I Extra Large</code></a></td></tr>
<tr>
<td>Vcpus</td><td>4</td><td>2</td><td>4</td><td>4</td></tr>
<tr>
<td>Storage</td><td>937 GB NVMe SSD</td><td>118 GB NVMe SSD</td><td>937 GB NVMe SSD</td><td>937 GB NVMe SSD</td></tr>
<tr>
<td>Memory</td><td>32.0 GiB</td><td>4.0 GiB</td><td>32.0 GiB</td><td>32.0 GiB</td></tr>
<tr>
<td>PostgreSQL shared_buffers</td><td>24.0 GiB</td><td>2.0 GiB</td><td>1.0 GiB</td><td>24.0 GiB</td></tr>
<tr>
<td>Inserted rows</td><td>5,000,000</td><td>5,000,000</td><td>5,000,000</td><td>100 after the index is created</td></tr>
<tr>
<td>Distance metric</td><td>L2 distance</td><td>L2 distance</td><td>/</td><td>/</td></tr>
<tr>
<td>Test queries</td><td>10,000</td><td>10,000</td><td>/</td><td>/</td></tr>
</tbody>
</table>
</div><p>In this table, we show the version of all the extensions we use in these experiments:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td></td><td>Version</td><td>Image</td></tr>
</thead>
<tbody>
<tr>
<td>VectorChord</td><td>v0.2.2</td><td>tensorchord/vchord-postgres:pg17-v0.2.2</td></tr>
<tr>
<td>pgvector</td><td>v0.8.0</td><td>tensorchord/vchord-postgres:pg17-v0.2.2 (<code>with a pgvector v0.8.0 installed</code>)</td></tr>
<tr>
<td>pgvectorscale</td><td>v0.6.0</td><td>timescale/timescaledb-ha:pg17.4-ts2.18.2-oss</td></tr>
</tbody>
</table>
</div><h2 id="heading-vectors-in-memory-for-small-scale-data">Vectors in memory: for small scale data</h2>
<p>For a vector search system, if you want the best performance, it's important to make sure that the entire index is cached in <code>shared_buffers</code>, which is more appropriate on a memory-optimized instance. In this situation, the size of <code>shared_buffers</code> should ideal be large enough to hold the entire index.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1743143311992/45b592c0-37f1-4cac-9e3a-370bc7f7bee4.png" alt class="image--center mx-auto" /></p>
<p>For our experiment, we choose a medium-sized dataset, <a target="_blank" href="https://github.com/myscale/benchmark/blob/4784b3820627c6bc458c1acf6a8cc48c8efba450/public/home.md#the-datasets-"><code>LAION-5m</code></a>, which consists of 5 million vectors with 768 dimensions. To host <code>LAION-5m</code> in <code>shared_buffers</code>, we can use the AWS instance <a target="_blank" href="https://instances.vantage.sh/aws/ec2/i4i.xlarge"><code>I4I Extra Large</code></a>, with a 937GB NVMe SSD and 32GB of memory.</p>
<p>On this machine, we set the <code>shared_buffers</code> to 24GB and measure the performance of the <code>first query</code> and then the <code>repeated query</code>. The following is the result of the <code>repeated query</code> performance over <code>pgvector</code>, <code>pgvectorscale</code> and <code>VectorChord</code>.</p>
<p>The following graphs illustrate the trade-off between query speed (QPS) and search quality (Recall@10/100). Recall@10 measures the percentage of true nearest neighbors found within the top 10 results. In general, higher QPS can be achieved at the cost of lower recall, and vice versa. A better system may achieve higher recall at the same QPS, or higher QPS at the same recall level. Our goal is typically to maximize QPS while maintaining a high recall target (e.g., 95%).</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1743412050649/4def8f06-694a-4021-a306-a54da58ea6ab.png" alt class="image--center mx-auto" /></p>
<details><summary>Why VectorChord is so fast?</summary><div data-type="detailsContent">VectorChord's <a target="_self" href="https://arxiv.org/abs/2405.12497">RabitQ</a> index achieves its speed through optimized vector compression, which allows for faster distance calculations. This compression technique minimizes data movement, resulting in efficient hardware utilization and improved query performance.</div></details>

<p>To measure first query performance, we cleared <code>shared_buffers</code> before running the 10,000 test queries sequentially. Although the buffer starts empty, index/data pages loaded into the cache by earlier queries within the <strong>same test run</strong> can sometimes be reused by subsequent queries. This can cause the instantaneous QPS to increase during the run. We report the <strong>average</strong> QPS across all 10,000 queries to provide a representative measure of performance with a cold cache.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1743350840716/cf131853-8131-4676-8e80-2125632e9b56.png" alt class="image--center mx-auto" /></p>
<p>In conclusion, <code>VectorChord</code> consistently achieves higher QPS at high recall levels (e.g., &gt;95% Recall@10) compared to the other extensions. When it comes to the other extensions, <code>pgvectorscale</code> performs better for <code>first query</code>, but once data is cached, <code>pgvector</code> will outperform it.</p>
<h2 id="heading-vectors-in-disk-for-large-scale-data"><strong>Vectors in disk</strong>: for large scale data</h2>
<p>There are times when things are different. For billions of vectors, it is not possible to use memory to store the entire index on a single machine. When the size of the vectors to be read exceeds the size of memory, the speedup effect of <code>shared_buffers</code> is greatly reduced, and most of the data is fetched from disk.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1743414426953/e512a560-e4f1-4f17-8307-ef823ad12086.png" alt class="image--center mx-auto" /></p>
<p>While <code>shared_buffers</code> becomes less effective, the operating system's page cache can still provide some caching benefits, although its performance is generally lower than <code>shared_buffers</code>.</p>
<details><summary>Indexing and Migration</summary><div data-type="detailsContent"><code>VectorChord's internal build</code> and <code>pgvectorscale</code> require additional memory to build the index, so it is not possible to build the index on such a small instance with 4GB of memory. To solve this problem, we build the index on another <code>I4I Extra Large</code> instance with 24GB of memory, and then backup and restore the database data directory from <code>AWS S3</code> <strong>for all extensions</strong>.</div></details>

<p>To illustrate this situation, we can host <code>LAION-5m</code> on a much smaller machine, <code>C6ID Large</code>, with <strong>4GB</strong> of memory and set <code>shared_buffers</code> to <strong>2GB</strong>. This configuration ensures that memory will be insufficient and most queries will be forced to go to <strong>disk</strong>.</p>
<p>In these cases, the difference in performance between the <code>first and repeated queries</code> will be much smaller due to insufficient memory. Therefore, it's acceptable to evaluate only the <code>repeated query</code> performance of each extension.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1743350651982/ae12fa1b-ae4a-435c-aa00-def89921992b.png" alt class="image--center mx-auto" /></p>
<p>As always, <code>VectorChord</code> consistently achieves higher QPS at high recall levels (e.g., &gt;95% Recall@10) compared to the other extensions, followed by <code>pgvector</code> and <code>pgvectorscale</code> with similar QPS.</p>
<h2 id="heading-experience-in-index-building"><strong>Experience in index building</strong></h2>
<p>As we discussed earlier, users also care about some index building metrics: Index build time, memory usage, disk usage and usability. We measured these metrics during the <code>CREATE INDEX</code> step of the above experiment on the <code>Laion-5m</code> dataset:</p>
<ul>
<li><p><code>Core utilization</code>: Percentage of available CPU cores effectively used during index creation. Higher values indicate better parallelism.</p>
</li>
<li><p><code>Index build time</code>: The time taken by the SQL command to build the index. If the index build requires an additional step (<code>VectorChord external build</code>), this should also be included.</p>
</li>
<li><p><code>Memory usage</code>: <strong>Peak</strong> memory usage during the index build, including the 1G <code>shared_buffers</code> allocated for PostgreSQL.</p>
</li>
<li><p><code>Disk usage</code>: The size of <code>PGDATA</code> on disk after index building is finished, including vectors and the index.</p>
</li>
</ul>
<details><summary>External build vs internal build in VectorChord</summary><div data-type="detailsContent">Vectorchord supports two different ways of build indexing, <code>internal build</code> and <code>external build</code>. While internal build is similar to pgvector and pgvectorscale, external build requires an additional <code>k-means</code> clustering. External build usually has <strong>better query performance</strong> in query and it's the recommended way for indexing large vectors. If you want to know more about external build in VectorChord, please read our previous <a target="_self" href="https://blog.vectorchord.ai/benefits-and-steps-of-external-centroids-building-in-vectorchord">blog</a>.</div></details>

<p>All query performance is measured after an <strong>external build</strong>, on a standalone machine with <code>A10 GPU</code>. If you insist on using the same <code>I4I Extra Large</code> instance for external builds, the index build time will increase to 50 minutes, which is still competitive.</p>
<p>In this case, we set <code>shared_buffers</code> to 1GB and logged the peak memory usage of each extension. Both <code>VectorChord's internal build</code> and <code>VectorChord's external build</code> are set to <strong>25 iters</strong> for K-means clustering. For ease of use, <code>pgvector</code>, <code>pgvectorscale</code> and <code>VectorChord's internal build</code> have similar syntax with a single SQL to build the index, while <code>VectorChord's external build</code> is more complicated.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1743503547507/9847a9c4-fcaa-4d76-bf82-3f5c5e959ef4.png" alt class="image--center mx-auto" /></p>
<p>For most aspects, <code>VectorChord's external build</code> gained a leading position, but this partly sacrificed the ease of use, as developers need to <strong>implement an additional K-means clustering step outside of PostgreSQL</strong> and manage the resulting centroids.</p>
<p>If you want to start with a simple <code>CREATE INDEX</code> command for a straightforward approach, then <code>VectorChord's external build</code> may not be the best choice , you can turn to other extensions / methods:</p>
<ul>
<li><p>Choose <code>VectorChord's internal build</code> for a faster index build</p>
</li>
<li><p>Choose <code>pgvector</code> if you really don't have enough memory</p>
</li>
<li><p>Choose <code>pgvectorscale</code> to reduce disk requirements or get better capacity</p>
</li>
</ul>
<h2 id="heading-insertion-performance-for-stream-generated-data">Insertion performance: for stream-generated data</h2>
<p>With streaming vector data, insertion speed can become a new concern. In real-time data analysis systems, the number of new vectors written per second can be very high. Each time new vectors are added to an index, the index itself must be updated. This capability is usually referred to as <code>insertion performance</code>.</p>
<details><summary>Why there is difference in insertion performance between indexes?</summary><div data-type="detailsContent">For graph-based indexes(pgvector hnsw, pgvectorscale), an index update usually performs multiple searches in the graph and creates new edges for the inserted vector. While clustering-based indexes(VectorChord) just find the right group and add the new vector to it.</div></details>

<p>So, we set up an experiment to see how well the insertion worked. After building the index for 5 million vectors, we insert another 100,000 random generated vectors into the table one after the other and calculate the insertion time of all.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1743497738743/68305934-b301-4f53-85ae-d614eb070577.png" alt class="image--center mx-auto" /></p>
<p>As you can see, VectorChord (<strong>1565 Insert/Sec</strong>) performs much better than pgvector (<strong>246 Insert/Sec</strong>) and pgvectorscale (<strong>107 Insert/Sec</strong>) when inserting data. For most workloads, 100 Insert/Sec is more than enough. However, if you find that a lot of vectors are continuously inserted in your production (for example, <strong>500 Insert/Sec</strong>), this could be a problem.</p>
<h2 id="heading-summary">Summary</h2>
<p>Based on the results above, this table shows the differences between the three extensions in various aspects. We can see that the main advantage of <code>VectorChord</code> is concentrated in <strong>performance</strong>, while <code>pgvector</code> and <code>pgvectorscale</code> have their own advantages in terms of <strong>usability</strong> and <strong>capacity</strong>.</p>
<p>If you want to see more detailed differences, this table shows them between the three extensions in various aspects. We can see that the main advantage of <code>VectorChord</code> is concentrated in <strong>performance</strong>, while <code>pgvector</code> and <code>pgvectorscale</code> have their own advantages in terms of <strong>usability</strong> and <strong>capacity</strong>.</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td></td><td>pgvector</td><td>pgvectorscale</td><td>VectorChord</td></tr>
</thead>
<tbody>
<tr>
<td>Ease of use</td><td><strong>Easy</strong></td><td><strong>Easy, similar to pgvector</strong></td><td><strong>Easy</strong> (internal), hard (external) with 2-stage build</td></tr>
<tr>
<td>Max dimension can index</td><td>2000</td><td><strong>16000</strong></td><td><strong>16000</strong></td></tr>
<tr>
<td>Ease of tuning</td><td><strong>Easy</strong></td><td>Moderate, need manually tune query_rescore for different queries</td><td><strong>Easy</strong></td></tr>
<tr>
<td>Highest Achievable Recall</td><td>Bottom-tier</td><td>Mid-tier</td><td><strong>Top-tier</strong></td></tr>
<tr>
<td>QPS of repeated query</td><td>Mid-tier</td><td>Bottom-tier</td><td><strong>Top-tier</strong></td></tr>
<tr>
<td>QPS of first query</td><td>Mid-tier</td><td><strong>Top-tier</strong></td><td><strong>Top-tier</strong></td></tr>
<tr>
<td>QPS of disk index</td><td>Mid-tier</td><td>Mid-tier</td><td><strong>Top-tier</strong></td></tr>
<tr>
<td>Speed of index building</td><td>Mid-tier, support multiple cores</td><td>Bottom-tier, does not support multiple cores</td><td><strong>Top-tier, support multiple cores</strong></td></tr>
<tr>
<td>Peak Index Build Memory Usage</td><td><strong>2.1 GB</strong></td><td>15.2 GB</td><td><strong>1.1 GB</strong> (external), 9.8 GB (internal)</td></tr>
<tr>
<td>Total Disk Usage</td><td>Nearly largest</td><td><strong>Smallest</strong></td><td>Largest</td></tr>
</tbody>
</table>
</div><p>Want to know more about how VectorChord compares to other extensions? In future blogs, we will explore the differences of PostgreSQL vector extensions in more scenarios, such as <a target="_blank" href="https://blog.vectorchord.ai/vectorchord-bm25-revolutionize-postgresql-search-with-bm25-ranking-3x-faster-than-elasticsearch"><strong>bm25 hybrid search</strong></a> (combining traditional keyword search with vector similarity search) and <strong>filtered vector search</strong>. Ready to try VectorChord for your vector search needs? Download the <a target="_blank" href="https://github.com/tensorchord/VectorChord">extension</a> and feel free to join our <a target="_blank" href="https://discord.gg/KqswhpVgdU">Discord community</a> or contact us at <a target="_blank" href="mailto:support@tensorchord.ai">support@tensorchord.ai</a>.</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://github.com/tensorchord/VectorChord">https://github.com/tensorchord/VectorChord</a></div>
]]></content:encoded></item><item><title><![CDATA[PostgreSQL Full-Text Search: Fast When Done Right (Debunking the Slow Myth)]]></title><description><![CDATA[You might have come across discussions or blog posts suggesting that PostgreSQL's built-in full-text search (FTS) struggles with performance compared to dedicated search engines or specialized extensions. A notable recent example comes from Neon's bl...]]></description><link>https://blog.vectorchord.ai/postgresql-full-text-search-fast-when-done-right-debunking-the-slow-myth</link><guid isPermaLink="true">https://blog.vectorchord.ai/postgresql-full-text-search-fast-when-done-right-debunking-the-slow-myth</guid><category><![CDATA[bm25]]></category><category><![CDATA[search]]></category><category><![CDATA[pgvector]]></category><category><![CDATA[vector database]]></category><category><![CDATA[PostgreSQL]]></category><category><![CDATA[postgres]]></category><category><![CDATA[Performance Optimization]]></category><category><![CDATA[performance]]></category><category><![CDATA[full text search]]></category><dc:creator><![CDATA[Jinjing Zhou]]></dc:creator><pubDate>Wed, 02 Apr 2025 07:43:16 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1743577782038/dc9c7ce8-3eb1-4735-bc65-46f24917050e.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>You might have come across discussions or blog posts suggesting that PostgreSQL's built-in full-text search (FTS) struggles with performance compared to dedicated search engines or specialized extensions. A notable recent example comes from <strong>Neon's blog post, "Performance Benchmark: pg_search on Neon"</strong> (<a target="_blank" href="https://neon.tech/blog/pgsearch-on-neon">link</a>).</p>
<p>In their benchmark, Neon compared query performance on their database platform <em>with</em> their <code>pg_search</code> extension (based on Rust's Tantivy library via <code>pgrx</code>) against the Postgres built-in fulltext search setting with tsvector and GIN index. They commendably stated they optimized this standard setup by adding GIN indexes where appropriate (benchmark code available <a target="_blank" href="https://github.com/paradedb/paradedb/tree/dev/benchmarks">here</a>).</p>
<p>However, while adding GIN indexes is a necessary first step, their results showing significantly slower performance for the "standard" setup suggest crucial <em>additional</em> optimization steps for PostgreSQL FTS were likely missed. The conclusion that standard FTS is inherently much slower than <code>pg_search</code> might be based on an unintentionally handicapped baseline.</p>
<p>Let's dive into how to <em>correctly</em> set up and use standard PostgreSQL FTS for optimal performance, addressing the specific configuration flaws likely present in the baseline used in the Neon benchmark, and demonstrating the true speed of built-in FTS. We'll show concrete numbers demonstrating a <strong>~50x performance increase</strong> just by applying these standard optimizations to the baseline configuration.</p>
<h2 id="heading-the-benchmark-setup-recap-from-neons-analysis"><strong>The Benchmark Setup (Recap from Neon's Analysis)</strong></h2>
<p>The analysis used a table structure similar to this with 10 million log entries:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> benchmark_logs (
    <span class="hljs-keyword">id</span> <span class="hljs-built_in">SERIAL</span> PRIMARY <span class="hljs-keyword">KEY</span>,
    message <span class="hljs-built_in">TEXT</span>,
    country <span class="hljs-built_in">VARCHAR</span>(<span class="hljs-number">255</span>),
    severity <span class="hljs-built_in">INTEGER</span>,
    <span class="hljs-built_in">timestamp</span> <span class="hljs-built_in">TIMESTAMP</span>,
    metadata JSONB
);
</code></pre>
<p>They tested various query types. Many involved searching the <code>message</code> text field, using queries structured like this in their "standard Postgres" examples:</p>
<pre><code class="lang-sql"><span class="hljs-comment">-- Create index on to_tsvector('english', message)</span>
<span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">INDEX</span> message_gin <span class="hljs-keyword">ON</span> benchmark_logs <span class="hljs-keyword">USING</span> gin (to_tsvector(<span class="hljs-string">'english'</span>, message));
<span class="hljs-comment">-- Example problematic query structure (likely used in Neon's baseline)</span>
<span class="hljs-keyword">SELECT</span> country, <span class="hljs-keyword">COUNT</span>(*)
<span class="hljs-keyword">FROM</span> benchmark_logs
<span class="hljs-keyword">WHERE</span> to_tsvector(<span class="hljs-string">'english'</span>, message) @@ to_tsquery(<span class="hljs-string">'english'</span>, <span class="hljs-string">'research'</span>)
<span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span> country
<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> country;
</code></pre>
<p>Even with a GIN index present on <code>message</code>, this query structure and the likely default GIN index settings are where the performance issues for the <em>standard FTS baseline</em> begin.</p>
<h4 id="heading-our-test-environment-details-replicating-the-baseline-scenario"><strong>Our Test Environment Details (Replicating the Baseline Scenario)</strong></h4>
<p>To demonstrate the impact of proper optimization on standard FTS, we used the following setup:</p>
<ul>
<li><p><strong>Instance:</strong> AWS EC2 <code>i7ie.xlarge</code> with <strong>local NVMe SSD</strong> (minimizing I/O impact).</p>
</li>
<li><p><strong>CPU:</strong> <strong>4 vCPUs</strong>.</p>
</li>
<li><p><strong>PostgreSQL Configuration:</strong> PostgreSQL 16 via Docker, configured for the hardware:</p>
<pre><code class="lang-bash">  docker run -d \
    --name my-postgres \
    --network=host \
    -v /mnt/localssd/pgdata:/var/lib/postgresql/data \
    -e POSTGRES_PASSWORD=mysecretpassword \
    postgres:latest \
    -c shared_buffers=8GB \
    -c maintenance_work_mem=8GB \
    -c max_parallel_workers=4 \
    -c max_worker_processes=4
</code></pre>
</li>
<li><p><strong>Parallelism Note:</strong> This setup provides <code>max_parallel_workers_per_gather = 2</code>. The Neon post mentioned their test environment used 8 parallel workers due to larger instances, implying potentially higher parallelism (<code>max_parallel_workers_per_gather</code> up to 8) than our setup. Our optimized results for <em>standard</em> FTS were achieved with <em>less</em> query parallelism.</p>
</li>
</ul>
<p><strong>Mistake #1:</strong> Calculating <code>tsvector</code> On-the-Fly (Major issue)</p>
<p>The sample queries shown in the Neon blog (and common in basic FTS examples) calculate the <code>tsvector</code> within the <code>WHERE</code> clause:</p>
<pre><code class="lang-sql">WHERE to_tsvector('english', message) @@ to_tsquery('english', 'research')
</code></pre>
<p>This forces PostgreSQL to:</p>
<ol>
<li><p><strong>Perform Expensive Computation:</strong> Run <code>to_tsvector()</code> (parsing, stemming, etc.) repeatedly for many rows during query execution.</p>
</li>
<li><p><strong>Limit Index Efficiency:</strong> Prevent the most direct and efficient use of the GIN index, even if one exists on the base <code>message</code> column.</p>
</li>
</ol>
<h4 id="heading-the-fix-for-a-proper-standard-fts-baseline-pre-calculate-and-store-the-tsvector"><strong>The Fix (For a Proper Standard FTS Baseline):</strong> Pre-calculate and store the <code>tsvector</code>.</h4>
<ol>
<li><p><strong>Add a</strong> <code>tsvector</code> column:</p>
<pre><code class="lang-sql"> <span class="hljs-keyword">ALTER</span> <span class="hljs-keyword">TABLE</span> benchmark_logs <span class="hljs-keyword">ADD</span> <span class="hljs-keyword">COLUMN</span> message_tsvector tsvector;
</code></pre>
</li>
<li><p><strong>Populate the column:</strong></p>
<pre><code class="lang-sql"> <span class="hljs-keyword">UPDATE</span> benchmark_logs <span class="hljs-keyword">SET</span> message_tsvector = to_tsvector(<span class="hljs-string">'english'</span>, message);
</code></pre>
</li>
<li><p><strong>Index the</strong> <code>tsvector</code> column (with <code>fastupdate=off</code>): (As shown in Fix #1)</p>
</li>
<li><p><strong>Rewrite queries:</strong></p>
<pre><code class="lang-sql"> <span class="hljs-comment">-- Optimized standard FTS query</span>
 <span class="hljs-keyword">SELECT</span> country, <span class="hljs-keyword">COUNT</span>(*)
 <span class="hljs-keyword">FROM</span> benchmark_logs
 <span class="hljs-keyword">WHERE</span> message_tsvector @@ to_tsquery(<span class="hljs-string">'english'</span>, <span class="hljs-string">'research'</span>) <span class="hljs-comment">-- Use the indexed column!</span>
 <span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span> country
 <span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> country;
</code></pre>
</li>
</ol>
<p><strong>Mistake #2 :</strong> Ignoring GIN Index <code>fastupdate</code> (Minor issue)</p>
<p>While Neon's benchmark correctly identified GIN as the index type for standard FTS, it likely used the default setting: <code>fastupdate=on</code>.</p>
<ul>
<li><p><code>fastupdate=on</code> (Default): Prioritizes index <em>update</em> speed by using pending lists. This significantly slows down <em>searches</em> over time, especially on large, static datasets used for benchmarks, as both the main index and growing pending lists must be scanned. It also contributes to index bloat and less efficient page structure.</p>
</li>
<li><p><code>fastupdate=off</code>: Prioritizes <em>search</em> speed. Index updates are slower, but the resulting index is more compact and significantly faster to query as there are no pending lists to scan. Essential for read-heavy workloads or benchmarking search performance.</p>
</li>
</ul>
<p>For a fair comparison focused on <em>search</em> speed, the standard FTS baseline should have used <code>fastupdate=off</code>.</p>
<p><strong>The Fix (For a Proper Standard FTS Baseline):</strong></p>
<pre><code class="lang-sql"><span class="hljs-comment">-- Create the GIN index correctly for search speed</span>
<span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">INDEX</span> idx_gin_logs_message_tsvector
<span class="hljs-keyword">ON</span> benchmark_logs <span class="hljs-keyword">USING</span> GIN (message_tsvector) <span class="hljs-comment">-- Index the dedicated tsvector column</span>
<span class="hljs-keyword">WITH</span> (fastupdate = <span class="hljs-keyword">off</span>);
</code></pre>
<h4 id="heading-the-performance-impact-a-50x-speedup-for-standard-fts"><strong>The Performance Impact: A ~50x Speedup for Standard FTS</strong></h4>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1743581394771/adecbc81-f0d7-4bf2-9696-1206b197386b.png" alt class="image--center mx-auto" /></p>
<p>What happens when we apply these <em>standard</em> optimizations to the <em>standard</em> PostgreSQL FTS setup? We tested a query in Neon's examples on the 10-million-row dataset:</p>
<ul>
<li><p><strong>Unoptimized Standard FTS (Neon's Baseline):</strong> ~41301 ms (41.3 seconds)</p>
</li>
<li><p><strong>Optimized Standard FTS (Our Fixes):</strong> ~877 ms (0.88 seconds)</p>
</li>
</ul>
<p>This demonstrates a <strong>~50x speed improvement</strong> for <em>standard PostgreSQL FTS</em> achieved simply by applying well-established best practices. This transforms the baseline performance dramatically, suggesting the comparison point used in the Neon benchmark may not have represented the true potential of built-in FTS. This speedup was achieved even with potentially less query parallelism available than in Neon's tests.</p>
<h2 id="heading-the-grain-of-truth-ranking-tsrank-performance"><strong>The Grain of Truth: Ranking (</strong><code>ts_rank</code><strong>) Performance</strong></h2>
<p>Neon's analysis <em>did</em> likely highlight a valid point regarding ranking speed. Functions like <code>ts_rank</code> or <code>ts_rank_cd</code> <em>can</em> be relatively slow in standard PostgreSQL compared to just finding matches. This is because ranking requires fetching and processing data for all matching rows <em>before</em> limiting, which can be I/O and CPU intensive.</p>
<h2 id="heading-beyond-standard-fts-advanced-ranking-with-vectorchord-bm25"><strong>Beyond Standard FTS: Advanced Ranking with VectorChord-BM25</strong></h2>
<p>While standard FTS excels at <em>finding</em> matches quickly, if sophisticated, high-performance <em>ranking</em> is critical, specialized tools are often required. Instead of relying solely on built-in functions, consider <a target="_blank" href="https://github.com/tensorchord/VectorChord-bm25"><strong>VectorChord-BM25</strong></a>.</p>
<p><a target="_blank" href="https://github.com/tensorchord/VectorChord-bm25"><strong>VectorChord-BM25</strong></a> is a PostgreSQL extension specifically designed for fast, relevance-based ranking using the well-regarded <strong>BM25 algorithm</strong>. BM25 goes beyond simple term matching, estimating relevance based on term frequency within a document and inverse document frequency across the entire collection.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1740189817247/5a81c020-1aa2-452d-8e8d-b9653c5b3489.png?auto=compress,format&amp;format=webp" alt="Benchmark against ElasticSearch" /></p>
<p>Key advantages reported for VectorChord-BM25 include:</p>
<ul>
<li><p><strong>High Performance:</strong> Designed for speed, reportedly outperforming other solutions like Elasticsearch (up to 3x faster) and potentially pg_search for ranking tasks.</p>
</li>
<li><p><strong>BM25 Algorithm:</strong> Implements a standard, effective relevance ranking model.</p>
</li>
<li><p><strong>Specialized Indexing:</strong> Uses a dedicated bm25 index type (leveraging optimizations like Block WeakAnd) which is crucial for both performance and calculating the global statistics needed for BM25 scoring.</p>
</li>
<li><p><strong>Dedicated Data Type:</strong> Introduces a bm25vector type to store tokenized representations suitable for BM25.</p>
</li>
</ul>
<p>If your application heavily relies on the quality and speed of relevance ranking, and standard ts_rank proves insufficient, <a target="_blank" href="https://github.com/tensorchord/VectorChord-bm25"><strong>VectorChord-BM25</strong></a> presents a powerful, integrated alternative within PostgreSQL.</p>
<h2 id="heading-conclusion-standard-fts-is-faster-than-you-might-think"><strong>Conclusion: Standard FTS is Faster Than You Might Think</strong></h2>
<p>Standard PostgreSQL FTS is a highly capable and performant feature for <em>finding</em> documents when correctly optimized using tsvector columns and properly configured GIN indexes (with fastupdate=off for reads). Benchmarks suggesting otherwise may be comparing against unoptimized baselines.</p>
<p>For advanced <em>relevance ranking</em> where standard <code>ts_rank</code> falls short, specialized extensions like <strong>VectorChord-BM25</strong> offer significant performance improvements by implementing algorithms like BM25 with dedicated data types and optimized indexing strategies.</p>
<p>Choose the right tool for the job: optimize standard FTS first, and if high-performance relevance ranking is paramount, explore dedicated extensions built for that purpose.</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://github.com/tensorchord/VectorChord-bm25">https://github.com/tensorchord/VectorChord-bm25</a></div>
]]></content:encoded></item><item><title><![CDATA[VectorChord-BM25: Introducing pg_tokenizer—A Standalone, Multilingual Tokenizer for Advanced Search]]></title><description><![CDATA[We're excited to announce the release of VectorChord-BM25 version 0.2, our PostgreSQL extension designed to bring advanced BM25-based full-text search ranking capabilities directly into your database!
VectorChord-BM25 allows you to leverage the power...]]></description><link>https://blog.vectorchord.ai/vectorchord-bm25-introducing-pgtokenizera-standalone-multilingual-tokenizer-for-advanced-search</link><guid isPermaLink="true">https://blog.vectorchord.ai/vectorchord-bm25-introducing-pgtokenizera-standalone-multilingual-tokenizer-for-advanced-search</guid><category><![CDATA[vector database]]></category><category><![CDATA[PostgreSQL]]></category><category><![CDATA[llm]]></category><category><![CDATA[full text search]]></category><category><![CDATA[Data Science]]></category><category><![CDATA[Databases]]></category><dc:creator><![CDATA[Jinjing Zhou]]></dc:creator><pubDate>Wed, 02 Apr 2025 03:25:38 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1743564280747/b9d95907-7ce7-4c26-9151-9d2e14f2d472.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>We're excited to announce the release of <strong>VectorChord-BM25 version 0.2</strong>, our PostgreSQL extension designed to bring advanced BM25-based full-text search ranking capabilities directly into your database!</p>
<p>VectorChord-BM25 allows you to leverage the power of the BM25 algorithm, a standard in information retrieval, without needing external search engines. This release marks a significant step forward, focusing heavily on enhancing the core text processing component: <strong>tokenization</strong>, unlocking greater flexibility and <strong>significantly improved multilingual support</strong>.</p>
<h3 id="heading-the-big-news-introducing-pgtokenizerrs">The Big News: Introducing <code>pg_tokenizer.rs</code></h3>
<p>The cornerstone of VectorChord-BM25 0.2 is a <strong>completely refactored and decoupled tokenizer extension:</strong> <code>pg_tokenizer.rs</code>.</p>
<p>Why the change? We realized that tokenization – the process of breaking down text into meaningful units (tokens) – is incredibly complex and highly dependent on the specific language and use case. Supporting multiple languages with their unique rules, custom dictionaries, stemming, stop words, and different tokenization strategies within a single monolithic extension was becoming cumbersome.</p>
<p>By moving the tokenizer into its own dedicated project (<code>pg_tokenizer.rs</code>) under the permissive <strong>Apache License</strong>, we achieve several key benefits:</p>
<ol>
<li><p><strong>Modularity &amp; Flexibility:</strong> Developers can now use or customize the tokenizer independently of the core BM25 ranking logic.</p>
</li>
<li><p><strong>Easier Contribution:</strong> The focused nature of <code>pg_tokenizer.rs</code> makes it simpler for the community to contribute <strong>new language support</strong>, tokenization techniques, or custom filters.</p>
</li>
<li><p><strong>Faster Iteration:</strong> We can now develop and release improvements to the tokenizer more rapidly without needing a full VectorChord-BM25 release cycle.</p>
</li>
<li><p><strong>Enhanced Customization:</strong> Users gain significantly more control over how their text, <strong>regardless of language</strong>, is processed before ranking.</p>
</li>
</ol>
<h3 id="heading-whats-new-in-tokenization-thanks-to-pgtokenizerrs">What's New in Tokenization (Thanks to <code>pg_tokenizer.rs</code>)</h3>
<p>This new architecture enables several powerful features in v0.2:</p>
<ul>
<li><p><strong>Expanded Language Support:</strong> Directly handle diverse linguistic needs with dedicated tokenizers like <strong>Jieba (Chinese)</strong> and <strong>Lindera (Japanese)</strong>, alongside powerful <strong>multilingual LLM-based tokenizers</strong> (like <strong>Gemma2</strong> and <strong>LLMLingua2</strong>) trained on vast datasets covering a wide array of languages.</p>
</li>
<li><p><strong>Richer Tokenization Features:</strong> You now have more granular control over the tokenization pipeline:</p>
<ul>
<li><p><strong>Custom Stop Words:</strong> Define your own lists of words to ignore during indexing and search.</p>
</li>
<li><p><strong>Custom Stemmers:</strong> Apply stemming rules for various supported languages or even define custom ones.</p>
</li>
<li><p><strong>Custom Synonyms:</strong> Define synonym lists to treat different words as equivalent (e.g., "postgres", "postgresql", "pgsql").</p>
</li>
<li><p><strong>Language-Specific Options:</strong> Leverage fine-grained controls available within specific tokenizers (like Lindera or Jieba) when needed.</p>
</li>
</ul>
</li>
</ul>
<h3 id="heading-show-me-the-code">Show Me the Code!</h3>
<p>Let's see how easy it is to use the new tokenizer features.</p>
<p><strong>1. Using a Pre-trained Multilingual LLM Tokenizer (LLMLingua2)</strong></p>
<p>LLM-based tokenizers are great for handling text from many different languages.</p>
<pre><code class="lang-sql"><span class="hljs-comment">-- Enable the extensions (if not already done)</span>
<span class="hljs-keyword">CREATE</span> EXTENSION <span class="hljs-keyword">IF</span> <span class="hljs-keyword">NOT</span> <span class="hljs-keyword">EXISTS</span> vchord_bm25;
<span class="hljs-keyword">CREATE</span> EXTENSION <span class="hljs-keyword">IF</span> <span class="hljs-keyword">NOT</span> <span class="hljs-keyword">EXISTS</span> pg_tokenizer;

<span class="hljs-comment">-- Update search_path for the first time</span>
<span class="hljs-keyword">ALTER</span> <span class="hljs-keyword">SYSTEM</span> <span class="hljs-keyword">SET</span> search_path <span class="hljs-keyword">TO</span> <span class="hljs-string">"$user"</span>, <span class="hljs-keyword">public</span>, tokenizer_catalog, bm25_catalog;
<span class="hljs-keyword">SELECT</span> pg_reload_conf();

<span class="hljs-comment">-- Create a tokenizer configuration using the LLMLingua2 model</span>
<span class="hljs-keyword">SELECT</span> create_tokenizer(<span class="hljs-string">'llm_tokenizer'</span>, $$
<span class="hljs-keyword">model</span> = <span class="hljs-string">"llmlingua2"</span>
$$);

<span class="hljs-comment">-- Tokenize some English text</span>
<span class="hljs-keyword">SELECT</span> tokenize(<span class="hljs-string">'PostgreSQL is a powerful, open source database.'</span>, <span class="hljs-string">'llm_tokenizer'</span>);
<span class="hljs-comment">-- Output: {2795,7134,158897,83,10,113138,4,9803,31344,63399,5} -- Example token IDs</span>

<span class="hljs-comment">-- Tokenize some Spanish text (LLMLingua2 handles multiple languages)</span>
<span class="hljs-keyword">SELECT</span> tokenize(<span class="hljs-string">'PostgreSQL es una potente base de datos de código abierto.'</span>, <span class="hljs-string">'llm_tokenizer'</span>);
<span class="hljs-comment">-- Output: {2795,7134,158897,198,220,105889,3647,8,13084,8,55845,118754,5} -- Example token IDs</span>

<span class="hljs-comment">-- Integrate with a table</span>
<span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> documents (
    <span class="hljs-keyword">id</span> <span class="hljs-built_in">SERIAL</span> PRIMARY <span class="hljs-keyword">KEY</span>,
    passage <span class="hljs-built_in">TEXT</span>,
    embedding bm25vector
);

<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> documents (passage) <span class="hljs-keyword">VALUES</span> (<span class="hljs-string">'PostgreSQL is a powerful, open source database.'</span>);

<span class="hljs-keyword">UPDATE</span> documents
<span class="hljs-keyword">SET</span> embedding = tokenize(passage, <span class="hljs-string">'llm_tokenizer'</span>)
<span class="hljs-keyword">WHERE</span> <span class="hljs-keyword">id</span> = <span class="hljs-number">1</span>; <span class="hljs-comment">-- Or process the whole table</span>
</code></pre>
<p><strong>2. Creating a Custom Tokenizer with Filters (Example: English)</strong></p>
<p>This example defines a custom pipeline, including lowercasing, Unicode normalization, skipping non-alphanumeric tokens, using NLTK English stop words, and the Porter2 stemmer. It then automatically trains a model and sets up a trigger to tokenize text on insert/update.</p>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> articles (
    <span class="hljs-keyword">id</span> <span class="hljs-built_in">SERIAL</span> PRIMARY <span class="hljs-keyword">KEY</span>,
    <span class="hljs-keyword">content</span> <span class="hljs-built_in">TEXT</span>,
    embedding bm25vector
);

<span class="hljs-comment">-- Define a custom text analysis pipeline</span>
<span class="hljs-keyword">SELECT</span> create_text_analyzer(<span class="hljs-string">'english_analyzer'</span>, $$
pre_tokenizer = <span class="hljs-string">"unicode_segmentation"</span>  <span class="hljs-comment"># Basic word splitting</span>
[[character_filters]]
to_lowercase = {}                       <span class="hljs-comment"># Lowercase everything</span>
[[character_filters]]
unicode_normalization = <span class="hljs-string">"nfkd"</span>          <span class="hljs-comment"># Normalize Unicode</span>
[[token_filters]]
skip_non_alphanumeric = {}              <span class="hljs-comment"># Remove punctuation-only tokens</span>
[[token_filters]]
stopwords = <span class="hljs-string">"nltk_english"</span>              <span class="hljs-comment"># Use built-in English stopwords</span>
[[token_filters]]
stemmer = <span class="hljs-string">"english_porter2"</span>             <span class="hljs-comment"># Apply Porter2 stemming</span>
$$);

<span class="hljs-comment">-- Create tokenizer, custom model based on 'articles.content', and trigger</span>
<span class="hljs-keyword">SELECT</span> create_custom_model_tokenizer_and_trigger(
    tokenizer_name =&gt; <span class="hljs-string">'custom_english_tokenizer'</span>,
    model_name =&gt; <span class="hljs-string">'article_model'</span>,
    text_analyzer_name =&gt; <span class="hljs-string">'english_analyzer'</span>,
    table_name =&gt; <span class="hljs-string">'articles'</span>,
    source_column =&gt; <span class="hljs-string">'content'</span>,
    target_column =&gt; <span class="hljs-string">'embedding'</span>
);

<span class="hljs-comment">-- Now, inserts automatically generate tokens</span>
<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> articles (<span class="hljs-keyword">content</span>) <span class="hljs-keyword">VALUES</span>
(<span class="hljs-string">'VectorChord-BM25 provides advanced ranking features for PostgreSQL users.'</span>);

<span class="hljs-keyword">SELECT</span> embedding <span class="hljs-keyword">FROM</span> articles <span class="hljs-keyword">WHERE</span> <span class="hljs-keyword">id</span> = <span class="hljs-number">1</span>;
<span class="hljs-comment">-- Output: {1:1, 2:1, 3:1, 4:1, 5:1, 6:1, 7:1, 8:1}</span>
<span class="hljs-comment">-- Bm25vector based on the custom model and pipeline</span>
</code></pre>
<p><strong>3. Using Jieba for Chinese Text</strong></p>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> chinese_docs (
    <span class="hljs-keyword">id</span> <span class="hljs-built_in">SERIAL</span> PRIMARY <span class="hljs-keyword">KEY</span>,
    passage <span class="hljs-built_in">TEXT</span>,
    embedding bm25vector
);

<span class="hljs-comment">-- Define a text analyzer using the Jieba pre-tokenizer</span>
<span class="hljs-keyword">SELECT</span> create_text_analyzer(<span class="hljs-string">'jieba_analyzer'</span>, $$
[pre_tokenizer.jieba]
<span class="hljs-comment"># Optional Jieba configurations can go here</span>
$$);

<span class="hljs-comment">-- Create tokenizer, custom model, and trigger for Chinese text</span>
<span class="hljs-keyword">SELECT</span> create_custom_model_tokenizer_and_trigger(
    tokenizer_name =&gt; <span class="hljs-string">'chinese_tokenizer'</span>,
    model_name =&gt; <span class="hljs-string">'chinese_model'</span>,
    text_analyzer_name =&gt; <span class="hljs-string">'jieba_analyzer'</span>,
    table_name =&gt; <span class="hljs-string">'chinese_docs'</span>,
    source_column =&gt; <span class="hljs-string">'passage'</span>,
    target_column =&gt; <span class="hljs-string">'embedding'</span>
);

<span class="hljs-comment">-- Insert Chinese text</span>
<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> chinese_docs (passage) <span class="hljs-keyword">VALUES</span>
(<span class="hljs-string">'红海早过了，船在印度洋面上开驶着。'</span>); <span class="hljs-comment">-- Example sentence</span>

<span class="hljs-keyword">SELECT</span> embedding <span class="hljs-keyword">FROM</span> chinese_docs <span class="hljs-keyword">WHERE</span> <span class="hljs-keyword">id</span> = <span class="hljs-number">1</span>;
<span class="hljs-comment">-- Output: {1:1, 2:1, 3:1, 4:1, 5:1, 6:1, 7:1, 8:1, 9:1, 10:1, 11:1, 12:1}</span>
<span class="hljs-comment">-- Bm25vector based on Jieba segmentation</span>
</code></pre>
<p><em>(For full examples, including custom stop words and synonyms, please refer to the</em> <code>pg_tokenizer.rs</code> documentation.)</p>
<h3 id="heading-understanding-the-tokenizer-configuration">Understanding the Tokenizer Configuration</h3>
<p>The new tokenizer system revolves around two core concepts:</p>
<ol>
<li><p><strong>Text Analyzer:</strong> Defines <em>how</em> raw text is processed into a sequence of tokens. It consists of:</p>
<ul>
<li><p><code>character_filters</code>: Modify text <em>before</em> splitting (e.g., lowercasing, Unicode normalization).</p>
</li>
<li><p><code>pre_tokenizer</code>: Splits the text into initial tokens (e.g., based on Unicode rules, Jieba, Lindera).</p>
</li>
<li><p><code>token_filters</code>: Modify or filter tokens <em>after</em> splitting (e.g., stop word removal, stemming, synonym replacement).</p>
</li>
</ul>
</li>
<li><p><strong>Model:</strong> Defines the mapping from the processed tokens to the final integer token IDs used by BM25. Models can be:</p>
<ul>
<li><p><code>pre-trained</code>: Use established vocabularies and rules (like <code>bert-base-uncased</code>, <code>llmlingua2</code>). Great for general purpose and multilingual use.</p>
</li>
<li><p><code>custom</code>: Build a vocabulary dynamically from your own data, tailored specifically to your corpus and language(s).</p>
</li>
</ul>
</li>
</ol>
<p>You can define these components separately or inline them when creating a tokenizer using a simple TOML configuration format passed as a string in SQL.</p>
<h3 id="heading-get-started-with-vectorchord-bm25-02">Get Started with VectorChord-BM25 0.2!</h3>
<p>This release significantly boosts the flexibility and power of VectorChord-BM25, especially for users dealing with <strong>multiple languages</strong> or needing fine-grained control over text processing.</p>
<ul>
<li><p><strong>GitHub Repository:</strong> <a target="_blank" href="https://github.com/tensorchord/VectorChord-bm25">https://github.com/tensorchord/VectorChord-bm25</a></p>
</li>
<li><p><strong>Tokenizer Documentation:</strong> <a target="_blank" href="https://github.com/tensorchord/pg_tokenizer.rs">https://github.com/tensorchord/pg_tokenizer.rs</a></p>
</li>
</ul>
<p>We encourage you to try out version 0.2 and explore the new tokenization capabilities. Your feedback is invaluable – please report any issues or suggest features on our GitHub repository.</p>
<p>Upgrade your PostgreSQL full-text search today with the enhanced multilingual flexibility of VectorChord-BM25 0.2 and <code>pg_tokenizer.rs</code>!</p>
<hr />
]]></content:encoded></item><item><title><![CDATA[Hybrid search with Postgres Native BM25 and VectorChord]]></title><description><![CDATA[In the era of RAG (Retrieval-Augmented Generation), efficiently retrieving relevant data from vast datasets is crucial for businesses and developers. Traditional keyword-based retrieval methods, such as those utilizing BM25 as a scoring mechanism, ex...]]></description><link>https://blog.vectorchord.ai/hybrid-search-with-postgres-native-bm25-and-vectorchord</link><guid isPermaLink="true">https://blog.vectorchord.ai/hybrid-search-with-postgres-native-bm25-and-vectorchord</guid><category><![CDATA[PostgreSQL]]></category><category><![CDATA[vector database]]></category><category><![CDATA[nlp]]></category><category><![CDATA[llm]]></category><dc:creator><![CDATA[xieydd]]></dc:creator><pubDate>Thu, 13 Mar 2025 08:56:04 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1741856002748/0b002f69-2a55-4a21-9a9e-e757c1c073cc.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In the era of RAG (Retrieval-Augmented Generation), efficiently retrieving relevant data from vast datasets is crucial for businesses and developers. Traditional keyword-based retrieval methods, such as those utilizing <strong>BM25</strong> as a scoring mechanism, excel in ranking documents based on term frequency and exact keyword matches. This makes them highly effective for structured queries and keyword-heavy content. However, they struggle with understanding synonyms, contextual nuances, and semantic intent. Conversely, vector-based retrieval captures deep semantic meaning, enabling better generalization across varied queries but sometimes sacrificing precision in exact keyword matches. Hybrid Search bridges this gap by combining BM25’s precision with vector search’s contextual understanding, delivering faster, more accurate, and semantically aware results.</p>
<p>This article explores how to implement Hybrid Search in Postgres using <a target="_blank" href="https://github.com/tensorchord/VectorChord-bm25">VectorChord-bm25</a> for keyword-based retrieval and <a target="_blank" href="https://github.com/tensorchord/VectorChord">VectorChord</a> for semantic search. By leveraging VectorChord-bm25's BM25 ranking with the <strong>Block-WeakAnd</strong> algorithm and VectorChord's advanced vector similarity capabilities, you can build a robust search system that seamlessly integrates keyword precision with semantic understanding. Whether you're building a recommendation engine, a document retrieval system, or an enterprise search solution, this guide will walk you through the steps to unlock the full potential of hybrid search.</p>
<p>All related benchmark codes can be found <a target="_blank" href="https://github.com/xieydd/vectorchord-hybrid-search">here</a>.</p>
<h2 id="heading-hybrid-search-explained">Hybrid Search Explained</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1741269653573/b0e91529-52dc-4072-a7cb-f5410fdc43f3.jpeg" alt class="image--center mx-auto" /></p>
<p>Hybrid Search merges the results from vector search and keyword search. Vector search is based on the semantic similarity between the query and the documents, while keyword search relies on the BM25 ranking algorithm. However, it is important to note that BM25 itself is only responsible for scoring documents, while the broader keyword search process also includes steps such as <strong>tokenization</strong>, <strong>indexing</strong>, and <strong>query parsing</strong>. Hybrid Search utilizes reranker or fuses the results from both methods. Fuses include RRF (Reciprocal Rank Fusion) or weighted scoring for multiple recall strategies. Model-based rerank includes cross encoder models like <code>bge-reranker-v2-m3</code> and multi-vector representation model <code>ColBERT</code>.</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://github.com/tensorchord/VectorChord">https://github.com/tensorchord/VectorChord</a></div>
<p> </p>
<h2 id="heading-what-is-the-vectorchord">What is the VectorChord?</h2>
<p>You may be familiar with <a target="_blank" href="https://github.com/pgvector/pgvector">pgvector</a>, an open-source vector similarity search extension for Postgres. However, scalable or performance-critical scenarios may require a more advanced vector search solution. <a target="_blank" href="https://github.com/tensorchord/VectorChord">VectorChord</a> is an excellent choice.</p>
<p>Check out the blog post "<a target="_blank" href="https://blog.vectorchord.ai/vectorchord-store-400k-vectors-for-1-in-postgresql">VectorChord: Store 400k Vectors for $1 in PostgreSQL</a>" to learn more about its motivation and design.</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://github.com/tensorchord/VectorChord-bm25">https://github.com/tensorchord/VectorChord-bm25</a></div>
<p> </p>
<h2 id="heading-what-is-vectorchord-bm25">What is VectorChord-BM25</h2>
<p>VectorChord-BM25 is a PostgreSQL extension for keyword search. It not only implements BM25 ranking but also includes a tokenizer and a Block-WeakAnd index to improve speed.</p>
<h3 id="heading-what-is-bm25">What is BM25?</h3>
<p>BM25 (Best Matching 25) is a probabilistic ranking function used in information retrieval to assess how well a document matches a query. It calculates relevance scores based on term frequency (TF) and inverse document frequency (IDF), while also applying document length normalization. The formula ensures that terms appearing frequently within a document (TF) and those rare across the corpus (IDF) are appropriately weighted, improving search accuracy and relevance. <strong>IDF</strong> measures how often a word appears across the document collection. The fewer the occurrences, the higher its value. <strong>TF</strong> represents the frequency of a specific word from the query appearing in the given document. A higher TF indicates a stronger relevance between the query and the document.</p>
<h3 id="heading-why-a-postgres-native-bm25-ranking-implementation-is-needed">Why a Postgres native BM25 Ranking implementation is needed?</h3>
<p>In our previous blog post, <a target="_blank" href="https://blog.vectorchord.ai/vectorchord-bm25-revolutionize-postgresql-search-with-bm25-ranking-3x-faster-than-elasticsearch"><em>VectorChord-BM25: Revolutionizing PostgreSQL Search with BM25 Ranking — 3x Faster Than Elasticsearch</em></a>, my colleague Allen has already provided a detailed and sufficient explanation, which can be referred to.</p>
<h2 id="heading-tutorial-use-vectorchord-bm25-and-vectorchord-implement-hybrid-search">Tutorial: Use VectorChord-BM25 and VectorChord implement Hybrid Search</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1741271200353/52e0faba-1d9d-4a61-972c-219df22f1922.jpeg" alt class="image--center mx-auto" /></p>
<p>In this tutorial, we will guide you through the steps to implement Hybrid Search using VectorChord-BM25 and VectorChord in PostgreSQL. We will cover the following topics:</p>
<ul>
<li><p><strong>Semantic Search with VectorChord</strong></p>
</li>
<li><p><strong>Keyword Search with VectorChord-BM25</strong></p>
</li>
<li><p><strong>Rerank</strong></p>
</li>
</ul>
<h3 id="heading-prerequisites">Prerequisites</h3>
<ol>
<li>Postgres with VectorChord-BM25 and VectorChord</li>
</ol>
<p>If you want to reproduce the tutorial quickly, you can use the <code>tensorchord/vchord-suite</code> image to run multiple extensions that TensorChord provides.</p>
<p>You can run the following command to build and start Postgres with VectorChord-BM25 and VectorChord.</p>
<pre><code class="lang-bash">docker run   \           
  --name vchord-suite  \
  -e POSTGRES_PASSWORD=postgres  \
  -p 5432:5432 \
  -d tensorchord/vchord-suite:pg17-latest
</code></pre>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> EXTENSION <span class="hljs-keyword">IF</span> <span class="hljs-keyword">NOT</span> <span class="hljs-keyword">EXISTS</span> vchord <span class="hljs-keyword">CASCADE</span>;
<span class="hljs-keyword">CREATE</span> EXTENSION <span class="hljs-keyword">IF</span> <span class="hljs-keyword">NOT</span> <span class="hljs-keyword">EXISTS</span> pg_tokenizer <span class="hljs-keyword">CASCADE</span>;
<span class="hljs-keyword">CREATE</span> EXTENSION <span class="hljs-keyword">IF</span> <span class="hljs-keyword">NOT</span> <span class="hljs-keyword">EXISTS</span> vchord_bm25 <span class="hljs-keyword">CASCADE</span>;
\dx
pg_tokenizer | 0.1.0   | tokenizer_catalog | pg_tokenizer
vchord       | 0.3.0   | public            | vchord: Vector database plugin for Postgres, written in Rust, specifically designed for LLM
vchord_bm25  | 0.2.0   | bm25_catalog      | vchord_bm25: A postgresql extension for bm25 ranking algorithm
vector       | 0.8.0   | public            | vector data type and ivfflat and hnsw access methods
</code></pre>
<ol start="2">
<li>Prepare the Embedding Model and Data</li>
</ol>
<p>For embedding, we can use a pre-trained embedding model like <code>BGE-M3</code> to generate embeddings for the documents. <code>BGE-M3</code> is a high-quality embedding model known for its versatility in multi-functionality, multi-linguality, and multi-granularity.</p>
<p>For validation, we use the <a target="_blank" href="https://github.com/beir-cellar/beir">BEIR</a> dataset, a heterogeneous benchmark for information retrieval. It is easy to use and allows you to evaluate your models across 15+ diverse IR datasets.</p>
<p>First, you need to load your data into PostgreSQL.Then you can use the following SQL query to generate embeddings for the documents.</p>
<pre><code class="lang-python"><span class="hljs-keyword">with</span> self.conn.cursor() <span class="hljs-keyword">as</span> cursor:
    cursor.execute(
        <span class="hljs-string">f"CREATE TABLE IF NOT EXISTS <span class="hljs-subst">{self.dataset}</span>_corpus (id TEXT, text TEXT, emb vector(<span class="hljs-subst">{self.vector_dim}</span>), bm25 bm25vector);"</span>
    )

    <span class="hljs-keyword">for</span> did, doc <span class="hljs-keyword">in</span> tqdm(zip(doc_ids, docs), desc=<span class="hljs-string">"insert corpus"</span>):
        emb = self.sentence_encoder.encode_doc(doc)
        cursor.execute(
            <span class="hljs-string">f"INSERT INTO <span class="hljs-subst">{self.dataset}</span>_corpus (id, text, emb) VALUES (%s, %s, %s)"</span>,
            (did, doc, emb),
        )
</code></pre>
<p>For embedding:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> FlagEmbedding <span class="hljs-keyword">import</span> BGEM3FlagModel
<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">SentenceEmbedding</span>:</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self, model_name: str = <span class="hljs-string">"BAAI/bge-m3"</span></span>):</span>
        self.model = BGEM3FlagModel(
            model_name,
            use_fp16=<span class="hljs-literal">True</span>,
        )
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">encode_docs</span>(<span class="hljs-params">self, documents: list[str]</span>):</span>
        <span class="hljs-keyword">return</span> self.model.encode(
            documents,
            batch_size=<span class="hljs-number">32</span>,
            max_length=<span class="hljs-number">8192</span>,
        )[<span class="hljs-string">'dense_vecs'</span>]
</code></pre>
<p>The data process and embedding generation code can be found <a target="_blank" href="https://github.com/xieydd/vectorchord-hybrid-search">here</a>.</p>
<h3 id="heading-semantic-search-with-vectorchord">Semantic Search with VectorChord</h3>
<p>For semantic search, we utilize the <a target="_blank" href="https://dev.to/gaoj0017/quantization-in-the-counterintuitive-high-dimensional-space-4feg">RabitQ</a> algorithm, which is highly optimized in the VectorChord PostgreSQL extension. RabitQ is a quantization algorithm for high-dimensional spaces, designed to improve the storage and retrieval efficiency of high-dimensional data, such as embedded vectors.</p>
<p>It achieves this by mapping high-dimensional vectors to a low-dimensional discrete space while preserving the similarity information of the original vectors. This process reduces storage requirements and computational costs while maintaining high retrieval accuracy.</p>
<pre><code class="lang-python">centroids = min(<span class="hljs-number">4</span> * int(self.num**<span class="hljs-number">0.5</span>), self.num // <span class="hljs-number">40</span>)
ivf_config = <span class="hljs-string">f"""
residual_quantization = true
[build.internal]
lists = [<span class="hljs-subst">{centroids}</span>]
build_threads = <span class="hljs-subst">{workers}</span>
spherical_centroids = false
"""</span>
<span class="hljs-keyword">with</span> self.conn.cursor() <span class="hljs-keyword">as</span> cursor:
    cursor.execute(<span class="hljs-string">f"SET max_parallel_maintenance_workers TO <span class="hljs-subst">{workers}</span>"</span>)
    cursor.execute(<span class="hljs-string">f"SET max_parallel_workers TO <span class="hljs-subst">{workers}</span>"</span>)
    cursor.execute(
        <span class="hljs-string">f"CREATE INDEX <span class="hljs-subst">{self.dataset}</span>_rabitq ON <span class="hljs-subst">{self.dataset}</span>_corpus USING vchordrq (emb vector_l2_ops) WITH (options = $$<span class="hljs-subst">{ivf_config}</span>$$)"</span>
    )
</code></pre>
<p>If you find that building the index on your own dataset is too slow, you can utilize external build to accelerate the process. For more details, please refer to our previous blog: <a target="_blank" href="https://blog.vectorchord.ai/benefits-and-steps-of-external-centroids-building-in-vectorchord">Benefits and Steps of External Centroids Building in VectorChord</a>.</p>
<p>Once the index is built, you can use the following SQL query to search for similar documents:</p>
<pre><code class="lang-python">probe = int(<span class="hljs-number">0.1</span> * min(<span class="hljs-number">4</span> * int(self.num**<span class="hljs-number">0.5</span>), self.num // <span class="hljs-number">40</span>))
<span class="hljs-keyword">with</span> self.conn.cursor() <span class="hljs-keyword">as</span> cursor:
    cursor.execute(<span class="hljs-string">f"SET vchordrq.probes = <span class="hljs-subst">{probe}</span>"</span>)
    cursor.execute(
        <span class="hljs-string">f"select q.id as qid, c.id, c.score from <span class="hljs-subst">{self.dataset}</span>_query q, lateral ("</span>
        <span class="hljs-string">f"select id, <span class="hljs-subst">{self.dataset}</span>_corpus.emb &lt;-&gt; q.emb as score from "</span>
        <span class="hljs-string">f"<span class="hljs-subst">{self.dataset}</span>_corpus order by score limit <span class="hljs-subst">{topk}</span>) c;"</span>
    )
</code></pre>
<h3 id="heading-key-word-search-with-vectorchord-bm25">Key-Word Search with VectorChord-BM25</h3>
<p>For keyword search, we need to use a <strong>tokenizer</strong> to convert the text into a <strong>BM25 vector</strong>, which is similar to a sparse vector that stores the vocabulary ID and frequency.</p>
<pre><code class="lang-python">cursor.execute(
    <span class="hljs-string">f"SELECT create_tokenizer('<span class="hljs-subst">{self.dataset}</span>_token', $$"</span>,
    <span class="hljs-string">f"tokenizer = 'unicode'"</span>,
    <span class="hljs-string">f"stopwords = 'nltk'"</span>,
    <span class="hljs-string">f"table = '<span class="hljs-subst">{self.dataset}</span>_corpus'"</span>,
    <span class="hljs-string">f"column = 'text'"</span>,
    <span class="hljs-string">f"$$);"</span>
)
cursor.execute(
    <span class="hljs-string">f"UPDATE <span class="hljs-subst">{self.dataset}</span>_corpus SET bm25 = tokenize(text, '<span class="hljs-subst">{self.dataset}</span>_token')"</span>
)
</code></pre>
<p>Let me explain this operation in detail. The <code>tokenize</code> function is used to tokenize the text with the specified tokenizer. In this case, we used the <strong>Unicode tokenizer</strong>. The output of the <code>tokenize</code> function is a BM25 vector, which is a sparse vector that stores the vocabulary ID and frequency of each word in the text. For example, <code>1035:7</code> means the word with vocabulary ID <code>1035</code> appears 7 times in the text.</p>
<pre><code class="lang-pgsql">postgres=# <span class="hljs-keyword">select</span> bm25 <span class="hljs-keyword">from</span> fiqa_corpus <span class="hljs-keyword">limit</span> <span class="hljs-number">1</span>;
<span class="hljs-comment">-- Output: {1035:7, 1041:1, 1996:1, 1997:1, 1999:3, 2010:3, 2015:7, 2019:1, 2022:1, 2028:4, 2036:2, 2041:1, 2051:2, 2054...</span>
</code></pre>
<p>After creating the index, you can calculate the BM25 score between the query and the vectors. Note that the BM25 score is <strong>negative</strong>, meaning that the higher the score, the more relevant the document is. We intentionally make it negative so that you can use the default <code>ORDER BY</code> to retrieve the most relevant documents first.</p>
<pre><code class="lang-python"><span class="hljs-keyword">with</span> self.conn.cursor() <span class="hljs-keyword">as</span> cursor:
    cursor.execute(
        <span class="hljs-string">f"SELECT q.id AS qid, c.id, c.score FROM <span class="hljs-subst">{self.dataset}</span>_query q, LATERAL ("</span>
        <span class="hljs-string">f"SELECT id, <span class="hljs-subst">{self.dataset}</span>_corpus.bm25 &lt;&amp;&gt; to_bm25query('<span class="hljs-subst">{self.dataset}</span>_text_bm25', q.text , '<span class="hljs-subst">{self.dataset}</span>_token') AS score "</span>
        <span class="hljs-string">f"FROM <span class="hljs-subst">{self.dataset}</span>_corpus "</span>
        <span class="hljs-string">f"ORDER BY score "</span>
        <span class="hljs-string">f"LIMIT <span class="hljs-subst">{topk}</span>) c;"</span>
    )
</code></pre>
<h3 id="heading-rerankfuse">Rerank/Fuse</h3>
<p>Once you obtain the results from both semantic search and keyword search, you can use a <strong>fuse or rerank</strong> process to merge the results.</p>
<ol>
<li>Fuse (<strong>Reciprocal Rank Fusion (RRF)</strong>)</li>
</ol>
<p>The advantage of <strong>RRF</strong> lies in its independence from specific score units; it is based on ranking for fusion, making it suitable for sorting systems with different scoring criteria.</p>
<p>$$\text{RRF}(d) = \sum_{i=1}^{n} \frac{1}{k + \text{rank}_i(d)}$$</p><p>Among them:</p>
<ul>
<li><p><code>d</code> is the document.</p>
</li>
<li><p><code>n</code> is the number of ranking systems.</p>
</li>
<li><p><code>rank</code> is the ranking in the sorting system.</p>
</li>
<li><p><code>k</code> is the adjustment parameter, the value is usually 60 (empirical value), used to control the impact of ranking on scores.</p>
</li>
</ul>
<pre><code class="lang-python"><span class="hljs-keyword">for</span> rank, (query_id, doc_id, _) <span class="hljs-keyword">in</span> enumerate(result, start=<span class="hljs-number">1</span>):
    <span class="hljs-keyword">if</span> query_id <span class="hljs-keyword">not</span> <span class="hljs-keyword">in</span> rrf_scores:
        rrf_scores[query_id] = {}
    <span class="hljs-keyword">if</span> doc_id <span class="hljs-keyword">not</span> <span class="hljs-keyword">in</span> rrf_scores[query_id]:
        rrf_scores[query_id][doc_id] = <span class="hljs-number">0</span>
    <span class="hljs-comment"># Calculate and accumulate RRF scores</span>
    rrf_scores[query_id][doc_id] += <span class="hljs-number">1</span> / (k + rank)
</code></pre>
<ol start="2">
<li>Cross-Encoder model Rerank</li>
</ol>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1741785632054/96d3b52b-5370-40ba-bfef-01b0c8af8dd9.jpeg" alt class="image--center mx-auto" /></p>
<p>In semantic search, we already use Bi-Encoder vectorized the documents and queries separately. But this independent encoding leads to a lack of interaction between the query and the document. Cross-Encoder model will input the query and document as a whole into the model, the model will see the content of both at the same time, capturing the fine-grained semantic relationship between them through the Transformer layer. Compared with RRF, Cross-Encoder model can capture the fine-grained semantic relationship between the query and the document, and it can be more accurate but slower.</p>
<pre><code class="lang-python">reranker = FlagReranker(
    <span class="hljs-string">'BAAI/bge-reranker-v2-m3'</span>, 
    query_max_length=<span class="hljs-number">256</span>,
    passage_max_length=<span class="hljs-number">512</span>,
    use_fp16=<span class="hljs-literal">True</span>,
    devices=[<span class="hljs-string">"cuda:0"</span>] <span class="hljs-comment"># change ["cpu"] if you do not have gpu, but it will be very slow</span>
) <span class="hljs-comment"># Setting use_fp16 to True speeds up computation with a slight performance degradation</span>

<span class="hljs-keyword">for</span> query_id, docs <span class="hljs-keyword">in</span> tqdm(results.items()):
    scores = reranker.compute_score(pairs, normalize=<span class="hljs-literal">True</span>) 
    <span class="hljs-keyword">for</span> i, doc_id <span class="hljs-keyword">in</span> enumerate(docs):
        bge_scores[query_id][doc_id] = scores[i]
</code></pre>
<h3 id="heading-evaluation">Evaluation</h3>
<p>We have tested the <strong>NDCG@10</strong> with those methods on several BEIR datasets. Here are the top10 results:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Dataset</td><td>Semantic Search</td><td>BM25</td><td>Cross-Encoder Rerank</td><td>RRF</td></tr>
</thead>
<tbody>
<tr>
<td>FiQA-2018</td><td>0.40343</td><td>0.25301</td><td><strong>0.42706</strong></td><td>0.37632</td></tr>
<tr>
<td>Quora</td><td>0.88433</td><td>0.78687</td><td><strong>0.89069</strong></td><td>0.87014</td></tr>
<tr>
<td>SCIDOCS</td><td>0.16055</td><td>0.15584</td><td><strong>0.17942</strong></td><td>0.17299</td></tr>
<tr>
<td>SciFact</td><td>0.57631</td><td>0.68495</td><td><strong>0.74635</strong></td><td>0.67855</td></tr>
</tbody>
</table>
</div><p>The table above demonstrates several conclusions:</p>
<ul>
<li><p>Compared to BM25 and semantic search, the cross-encoder model rerank can significantly improve search performance across different datasets.</p>
</li>
<li><p>On some datasets, RRF may lead the performance decrease. Please conduct verification tests before deciding whether to use RRF. If effective, choosing RRF will be a very economical choice, as it is much, much faster than reranking with the cross-encoder model, and almost no resource consumption.</p>
</li>
</ul>
<p>All related benchmark codes can be found <a target="_blank" href="https://github.com/xieydd/vectorchord-hybrid-search">here</a>. If you have any questions, please feel free to contact us on <a target="_blank" href="https://discord.gg/KqswhpVgdU">Discord</a> or email us at <a target="_blank" href="https://mailto:vectorchord-inquiry@tensorchord.ai/">vectorchord-inquiry@tensorchord.ai</a>.</p>
<h2 id="heading-future-work">Future Work</h2>
<p>The results highlight the great potential of the hybrid search method. In the future, we will focus on the following aspects to advance RAG:</p>
<ul>
<li><p><strong>Improve the reranking method</strong>: Explore the other model-based rerank, such as ColBERT or ColPali, to enhance reranking performance.</p>
</li>
<li><p><strong>Integrate graph search</strong>: Investigate the use of the <a target="_blank" href="https://github.com/apache/age"><strong>Apache AGE</strong></a> extension for graph search and integrate it with the hybrid search method to further improve search performance.</p>
</li>
</ul>
<h2 id="heading-references">References</h2>
<ul>
<li><p><a target="_blank" href="https://github.com/tensorchord/vectorChord/">https://github.com/tensorchord/vectorChord/</a></p>
</li>
<li><p><a target="_blank" href="https://github.com/tensorchord/VectorChord-bm25">https://github.com/tensorchord/VectorChord-bm25</a></p>
</li>
<li><p><a target="_blank" href="https://techcommunity.microsoft.com/blog/adforpostgresql/introducing-the-graphrag-solution-for-azure-database-for-postgresql/4299871">https://techcommunity.microsoft.com/blog/adforpostgresql/introducing-the-graphrag-solution-for-azure-database-for-postgresql/4299871</a></p>
</li>
<li><p><a target="_blank" href="https://www.pinecone.io/learn/hybrid-search-intro/">https://www.pinecone.io/learn/hybrid-search-intro/</a></p>
</li>
<li><p><a target="_blank" href="https://www.pinecone.io/learn/series/rag/rerankers/">https://www.pinecone.io/learn/series/rag/rerankers/</a></p>
</li>
<li><p><a target="_blank" href="https://blog.vespa.ai/improving-zero-shot-ranking-with-vespa-part-two/">https://blog.vespa.ai/improving-zero-shot-ranking-with-vespa-part-two/</a></p>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Vector Search at 10,000 QPS in PostgreSQL with VectorChord]]></title><description><![CDATA[VectorChord is a PostgreSQL extension designed for scalable, high-performance, and disk-efficient vector similarity search, and serves as the successor to pgvecto.rs. In our previous blog post, we showed that with just $250 per month, VectorChord ach...]]></description><link>https://blog.vectorchord.ai/vector-search-at-10000-qps-in-postgresql-with-vectorchord</link><guid isPermaLink="true">https://blog.vectorchord.ai/vector-search-at-10000-qps-in-postgresql-with-vectorchord</guid><category><![CDATA[VectorSearch]]></category><category><![CDATA[PostgreSQL]]></category><category><![CDATA[Databases]]></category><category><![CDATA[vector database]]></category><category><![CDATA[scalability]]></category><category><![CDATA[cloud native]]></category><dc:creator><![CDATA[Junyu Chen]]></dc:creator><pubDate>Thu, 27 Feb 2025 09:31:33 GMT</pubDate><content:encoded><![CDATA[<p>VectorChord is a PostgreSQL extension designed for scalable, high-performance, and disk-efficient vector similarity search, and serves as the successor to <a target="_blank" href="https://github.com/tensorchord/pgvecto.rs">pgvecto.rs</a>. In our previous <a target="_blank" href="https://blog.vectorchord.ai/vectorchord-store-400k-vectors-for-1-in-postgresql">blog post</a>, we showed that with just $250 per month, VectorChord achieved 131 QPS with 0.95 precision on 100 million vectors—demonstrating impressive cost-effective performance for large-scale vector search.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1733194599461/eb1cf21d-4eb0-4427-8b39-bbb96d125182.png?auto=compress,format&amp;format=webp" alt class="image--center mx-auto" /></p>
<p>Building on that success, this post focuses on high performance by exploring how VectorChord can <strong>reach 10,000 QPS in PostgreSQL</strong>. We’ll walk you through how PostgreSQL’s reliable relational foundation integrates with advanced vector indexing to deliver fast, scalable search capabilities, and provide a production-ready setup along with a detailed architecture guide for managing large-scale vector computations even under heavy loads.</p>
<h2 id="heading-background">Background</h2>
<p>High queries per second (QPS) vector search is essential for real-time applications such as recommendation engines, image recognition, and natural language processing. These applications demand rapid, efficient searches to deliver immediate results, which is critical for maintaining user engagement and operational effectiveness. By leveraging PostgreSQL for vector search, you not only benefit from its mature, robust relational capabilities and advanced indexing but also avoid the overhead of managing a separate, specialized vector database. This unified approach simplifies integration and reduces system complexity while still meeting the performance needs of modern applications.</p>
<p>In our experiments, we evaluated performance on LAION datasets of 5 million and 100 million vectors, targeting a 90% precision level to reflect real-world scenarios. We also explored scaling the system with multiple replicas to boost QPS, demonstrating how PostgreSQL can effectively handle demanding vector search workloads while maintaining high performance.</p>
<h2 id="heading-architecture"><strong>Architecture</strong></h2>
<p>Using VectorChord on a multi-node cluster is more difficult than on a single machine. As with other services, a reliable distributed vector search usually has the following requirements:</p>
<ul>
<li><p><strong>Usability</strong>: Declarative configuration management</p>
</li>
<li><p><strong>Elasticity</strong>: Automatic scaling for nodes based on QPS requirements</p>
</li>
<li><p><strong>Load Balancing</strong>: Requests are handled effectively across different replica instances</p>
</li>
<li><p><strong>High Availability</strong>: Having multiple copies of the database and automatically recovering from a replica in case of failure</p>
</li>
<li><p><strong>Durability:</strong> Secure, persistent storage to ensure data integrity and prevent loss in the event of failures</p>
</li>
</ul>
<p>Therefore, we deploy VectorChord services on AWS EKS with OpenTofu and CloudNativePG to manage the infrastructure that provides the above capabilities. Here are some modules to build the example clusters:</p>
<ul>
<li><p><a target="_blank" href="https://opentofu.org/">OpenTofu</a>: Open source alternative of terraform, used for infrastructure as code (IaC) on AWS EKS</p>
</li>
<li><p><a target="_blank" href="https://cloudnative-pg.io/">CloudNativePG</a>: Manage the full lifecycle of a highly available PostgreSQL database cluster with a primary/standby architecture</p>
</li>
<li><p><a target="_blank" href="https://www.pgbouncer.org/">PGBouncer</a>: Lightweight connection pooler for PostgreSQL, provided by module <a target="_blank" href="https://cloudnative-pg.io/documentation/1.25/connection_pooling/">Pooler</a> of CloudNativePG</p>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1740621480659/d68f7e47-0dc6-48bc-939d-d87f00be6861.png" alt class="image--center mx-auto" /></p>
<p>Similar to <a target="_blank" href="https://proceedings.neurips.cc/paper_files/paper/2019/file/09853c7fb1d3f8ee67a61b6bf4a7f8e6-Paper.pdf">DiskANN</a>, VectorChord can fully utilize disk resources and still maintain competitive performance when memory is insufficient. Besides, VectorChord can also cache indexes in memory to achieve higher query performance.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">For more on the technical details of VectorChord, please see our previous <a target="_self" href="https://blog.vectorchord.ai/vectorchord-store-400k-vectors-for-1-in-postgresql#heading-vectorchords-solution-disk-friendly-ivfrabitq">post</a>.</div>
</div>

<p>To reach the QPS limit, we should assume that the memory can fully hold the dataset and select the AWS machine type based on this standard.</p>
<p>In the following table, we provide the appropriate machine which can hold everything in memory for different sizes of datasets:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>dataset</strong></td><td><strong>dim / size</strong></td><td><strong>machine / memory</strong></td></tr>
</thead>
<tbody>
<tr>
<td>LAION-5m</td><td>768 / 5,000,000</td><td>r7i.xlarge / 32GB</td></tr>
<tr>
<td>LAION-100m</td><td>768 / 100,000,000</td><td>r7i.16xlarge / 512GB</td></tr>
</tbody>
</table>
</div><h2 id="heading-results"><strong>Results</strong></h2>
<p>As outlined in <a target="_blank" href="https://github.com/tensorchord/VectorChord#query-performance-tuning">our documentation</a>, VectorChord accepts two key parameters: <code>nprob</code> and <code>epsilon</code>. Here, <code>nprob</code> defines the number of candidate vectors to scan during a query, while <code>epsilon</code> serves as the threshold for determining whether to rerank results. Both parameters play a critical role in influencing both <strong>QPS</strong> and <strong>recall</strong>, allowing users to fine-tune the balance between search speed and accuracy.</p>
<ul>
<li><p>Lower <code>nprob</code> will result in higher QPS and lower recall.</p>
</li>
<li><p>Lower <code>epsilon</code> will result in higher <strong>but more unstable</strong> QPS and lower recall.</p>
</li>
</ul>
<p>If these parameters are fixed, the precision of each experiment will remain consistent. However, the QPS will still vary depending on the CPU and I/O performance of the underlying hardware. To ensure the precision stays above 0.9, we configured the parameters based on earlier experiments conducted with smaller datasets.</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>dataset</strong></td><td><strong>nprob / epsilon</strong></td><td><strong>recall</strong></td></tr>
</thead>
<tbody>
<tr>
<td>LAION-5m</td><td>20 / 1.0</td><td>0.9324</td></tr>
<tr>
<td>LAION-100m</td><td>20 / 1.0</td><td>0.9042</td></tr>
</tbody>
</table>
</div><p>We deployed an <strong>r7i.4xlarge</strong> instance as the client within the same Availability Zone to query the VectorChord clusters. The clusters were tested with multiple workers ranging from <strong>16 to 128</strong>, simulating the concurrency levels typically applied to the database in real-world scenarios. This setup allows us to evaluate how VectorChord performs under varying workloads and concurrency conditions.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">With 16 vCPUs, <code>r7i.4xlarge</code> can effectively reduce CPU contention from the client side when the concurrency requests reach up to 128.</div>
</div>

<p>In line with common pgbouncer practices, we employed two types of instances: a primary instance for data ingestion and multiple replica instances for read-only workloads. Pgbouncer served as the load balancer, efficiently distributing queries across these backend instances. For the benchmark, we directed our query endpoint to the <code>read-only (ro)</code> pooler, while the <code>read-write (rw)</code> primary instance remained idle during the test. This setup ensures optimal resource utilization and scalability for read-heavy workloads.</p>
<p>For the <strong>5 million dataset</strong>, we scaled the number of instances from <strong>5 to 15</strong> to achieve <strong>10,000 QPS</strong>. In contrast, for the <strong>100 million dataset</strong>, only <strong>2 instances</strong> were sufficient to meet the performance standard. The results of these experiments are illustrated in the accompanying figure, demonstrating the scalability and efficiency of VectorChord across different dataset sizes.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1740537380494/5e9a1e66-1cc6-4dc7-b26d-a68c83891289.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-cost"><strong>Cost</strong></h2>
<p>In summary, this is the minimum hardware cost if you need 10000 QPS on such datasets:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>dataset</strong></td><td><strong>minimal hardware for 10000+ QPS</strong></td><td><strong>number of vcpu</strong></td><td><strong>estimate cost per month</strong></td></tr>
</thead>
<tbody>
<tr>
<td>LAION-5m</td><td>r7i.xlarge × 7</td><td>4 × 7 = 28 vcpu</td><td>$0.2646 × 7 × 720 = $1334</td></tr>
<tr>
<td>LAION-100m</td><td>r7i.16xlarge × 2</td><td>64 × 2 = 128 vcpu</td><td>$4.2336 × 2 × 720 = $6096</td></tr>
</tbody>
</table>
</div><div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">The cost of the primary instance is excluded from this calculation, as it can be shared and utilized with other PostgreSQL connection poolers, further optimizing resource allocation and reducing overall expenses.</div>
</div>

<p>This experiment empowers you to design and scale your own cluster to achieve the ideal level of VectorChord search performance, whether in the public cloud or a private data center. Even in scenarios where performance is the top priority, VectorChord delivers exceptionally competitive pricing, making it a cost-effective solution for high-performance vector search needs.</p>
<h2 id="heading-summary"><strong>Summary</strong></h2>
<p>Whether you’re launching a startup on a budget-friendly VPS or scaling a large cluster for uncompromised performance, VectorChord opens up unlimited possibilities. While 10,000 QPS is the focus of this post, it’s far from the limit of what VectorChord can achieve.</p>
<p>With its powerful indexing algorithm, VectorChord not only surpasses standard PostgreSQL solutions like pgvector by efficiently utilizing memory and high-speed SSDs, but also outperforms specialized vector databases in handling massive data volumes and high QPS demands. At the same cost, it delivers superior performance while eliminating the complexity of maintaining a separate vector search system.</p>
<p>If you’re eager to explore large-scale vector search solutions, join our <a target="_blank" href="https://discord.gg/KqswhpVgdU">Discord community</a> or reach out to us at <a target="_blank" href="mailto:support@tensorchord.ai">support@tensorchord.ai</a>. Let’s unlock the full potential of your data together!</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://github.com/tensorchord/VectorChord">https://github.com/tensorchord/VectorChord</a></div>
]]></content:encoded></item><item><title><![CDATA[VectorChord-BM25: Revolutionize PostgreSQL Search with BM25 Ranking — 3x Faster Than ElasticSearch]]></title><description><![CDATA[We’re excited to share something special with you: VectorChord-BM25, a new extension designed to make PostgreSQL’s full-text search even better. Whether you’re building a small app or managing a large-scale system, this tool brings advanced BM25 scor...]]></description><link>https://blog.vectorchord.ai/vectorchord-bm25-revolutionize-postgresql-search-with-bm25-ranking-3x-faster-than-elasticsearch</link><guid isPermaLink="true">https://blog.vectorchord.ai/vectorchord-bm25-revolutionize-postgresql-search-with-bm25-ranking-3x-faster-than-elasticsearch</guid><category><![CDATA[PostgreSQL]]></category><category><![CDATA[elasticsearch]]></category><category><![CDATA[full text search]]></category><category><![CDATA[llm]]></category><category><![CDATA[vector database]]></category><category><![CDATA[VectorSearch]]></category><dc:creator><![CDATA[Jinjing Zhou]]></dc:creator><pubDate>Mon, 24 Feb 2025 10:46:29 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1740191163205/63ff35b6-fc86-49af-90ef-49afc9dd723b.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>We’re excited to share something special with you: <strong>VectorChord-BM25</strong>, a new extension designed to make PostgreSQL’s full-text search even better. Whether you’re building a small app or managing a large-scale system, this tool brings advanced BM25 scoring and ranking right into PostgreSQL, making your searches smarter and faster.</p>
<h3 id="heading-whats-new"><strong>What’s New?</strong></h3>
<ul>
<li><p><strong>BM25 Scoring &amp; Ranking</strong>: Get more precise and relevant search results with BM25, helping you find what matters most.</p>
</li>
<li><p><strong>Optimized Indexing</strong>: Thanks to the Block WeakAnd algorithm, searches are quicker and more efficient, even with large datasets.</p>
</li>
<li><p><strong>Enhanced Tokenization</strong>: Improved stemming and stop word handling mean better accuracy and smoother performance.</p>
</li>
</ul>
<p>We built VectorChord-BM25 to be simple, powerful, and fully integrated with PostgreSQL—because we believe great tools should make your life easier, not more complicated. We hope you’ll give it a try and let us know what you think.</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://github.com/tensorchord/VectorChord-BM25">https://github.com/tensorchord/VectorChord-BM25</a></div>
<p> </p>
<h2 id="heading-bm25">BM25</h2>
<p>Before we get to the exciting news, let’s take a quick look at <strong>BM25</strong>, the algorithm that powers modern search engines. BM25 is a probabilistic ranking function that determines how relevant a document is to a search query.</p>
<p>The BM25 formula might look a bit intimidating at first, but let’s break it down step by step to make it easy to understand. Here’s it:</p>
<p>$$\text{score}(Q, D) = \sum_{q \in Q} \text{IDF}(q) \cdot \frac{f(q, D) \, (k_1 + 1)}{f(q, D) + k_1 \cdot \left(1 - b + b \cdot \frac{|D|}{\text{avgdl}}\right)}$$</p><p><strong>Term Frequency (TF)</strong>: The term f(q,D) represents how often the query term <em>q</em> appears in document <em>D</em>. The more a term appears in a document, the higher its relevance score. For example, if "AI" appears 5 times in Document A and 2 times in Document B, Document A gets a higher score for this term.</p>
<p><strong>Inverse Document Frequency (IDF)</strong>: The term IDF(q) measures how rare or common the term <em>q</em> is across all documents. Rare terms (e.g., "NVIDIA") are given more weight than common terms (e.g., "revenue"). This ensures that unique terms have a greater impact on the relevance score.</p>
<p><strong>Document Length Normalization</strong>: The term ∣<em>D</em>∣ represents the length of document <em>D</em>, while avgdl is the average length of all documents in the collection. This part of the formula adjusts the score to account for document length, ensuring shorter documents aren’t unfairly penalized. For instance, a concise report won’t be overshadowed by a lengthy one that only briefly mentions the query term.</p>
<p><strong>Tuning Parameters</strong>: The parameters <em>k</em>1 and <em>b</em> allow the formula to be fine-tuned. <em>k</em>1 controls the extent to which term frequency impacts the score, while <em>b</em> balances the effect of document length normalization. These parameters can be adjusted to optimize results for specific datasets.</p>
<p>Consider searching a database of financial reports for "Tesla stock performance." A report that mentions "Tesla" ten times will score higher than one that mentions it only twice. However, the term "stock" might appear in many reports, so it’s given less weight than "Tesla," which is more specific. Furthermore, a short, concise report about Tesla’s stock performance won’t be overshadowed by a lengthy report that only briefly mentions Tesla.</p>
<h2 id="heading-existing-solution-in-postgres">Existing Solution in Postgres</h2>
<p>Now that we’ve covered the basics, let’s take a look at the existing solutions for full-text search in PostgreSQL.</p>
<p>PostgreSQL provides built-in support for full-text search through the combination of the <code>tsvector</code> data type and GIN (Generalized Inverted Index) indexes. Here’s how it works: text is first converted into a <code>tsvector</code>, which tokenizes the content into lexemes—standardized word forms that make searching more efficient. A GIN index is then created on the <code>tsvector</code> column, significantly speeding up query performance and enabling fast retrieval of relevant documents, even when dealing with large text fields. This integration of text processing and advanced indexing makes PostgreSQL a powerful and scalable solution for applications that require efficient full-text search.</p>
<pre><code class="lang-sql"><span class="hljs-comment">-- Create a table with a text column</span>
<span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> documents (
    <span class="hljs-keyword">id</span> <span class="hljs-built_in">SERIAL</span> PRIMARY <span class="hljs-keyword">KEY</span>,
    <span class="hljs-keyword">content</span> <span class="hljs-built_in">TEXT</span>,
    content_vector tsvector
);

<span class="hljs-comment">-- Insert sample data with tsvector conversion</span>
<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> documents (<span class="hljs-keyword">content</span>, content_vector) <span class="hljs-keyword">VALUES</span>
(<span class="hljs-string">'PostgreSQL is a powerful, open-source database system.'</span>, to_tsvector(...)),
(<span class="hljs-string">'Full-text search in PostgreSQL is efficient and scalable.'</span>, to_tsvector(...)),
(<span class="hljs-string">'BM25 is a ranking function used by search engines.'</span>, to_tsvector(...));

<span class="hljs-comment">-- Create a GIN index on the tsvector column</span>
<span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">INDEX</span> idx_content_vector <span class="hljs-keyword">ON</span> documents <span class="hljs-keyword">USING</span> GIN (content_vector);

<span class="hljs-comment">-- Query using tsvector, rank results with ts_rank, and leverage the GIN index</span>
<span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">id</span>, <span class="hljs-keyword">content</span>, ts_rank(content_vector, to_tsquery(<span class="hljs-string">'english'</span>, <span class="hljs-string">'PostgreSQL &amp; search'</span>)) <span class="hljs-keyword">AS</span> <span class="hljs-keyword">rank</span>
<span class="hljs-keyword">FROM</span> documents
<span class="hljs-keyword">WHERE</span> content_vector @@ to_tsquery(<span class="hljs-string">'english'</span>, <span class="hljs-string">'PostgreSQL &amp; search'</span>)
<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> <span class="hljs-keyword">rank</span> <span class="hljs-keyword">DESC</span>;
</code></pre>
<p>However, PostgreSQL has a limitation: it lacks modern relevance scoring mechanisms like BM25. Instead, it returns all matching documents and relies on <code>ts_rank</code> to re-rank them, which can be inefficient. This makes it challenging for users to quickly identify the most important results, especially when dealing with large datasets.</p>
<p>Another solution is ParadeDB, which pushes full-text search queries down to Tantivy for results. It supports BM25 scoring and complex query patterns like negative terms, aiming to be a complete replacement for ElasticSearch. However, it uses its own unique syntax for filtering and querying and delegates filtering operations to Tantivy instead of relying on Postgres directly. Its implementation requires several hooks into Postgres' query planning and storage, potentially leading to compatibility issues.</p>
<p>In contrast, VectorChord-BM25 takes a different approach. It focuses exclusively on bringing BM25 ranking to PostgreSQL in a lightweight and native way. We implemented the BM25 ranking algorithm and the Block WeakAnd technique from scratch, building it as a custom operator and index (similar to <code>pgvector</code>) to accelerate queries. Designed to be intuitive and efficient, VectorChord-BM25 <strong>provides a seamless API for enhanced full-text search and ranking, all while staying fully integrated with PostgreSQL’s ecosystem</strong>.</p>
<h2 id="heading-vectorchord-bm25">VectorChord-BM25</h2>
<p>Our implementation introduces a novel approach by developing the BM25 index and search algorithm from the ground up, while seamlessly integrating with PostgreSQL’s existing development interfaces to ensure maximum compatibility.</p>
<p>Inspired by the PISA engine, Tantivy, and Lucene, we incorporated the BlockMax WeakAnd algorithm to facilitate efficient score-based filtering and ranking. Furthermore, we employ bitpacking for ID compression to enhance overall efficiency and re-implemented a user data-based tokenizer to more closely align with ElasticSearch’s performance.</p>
<p>The table below compares the <strong>Top1000 Queries Per Second (QPS)</strong>—a metric that measures how many queries a system can process per second while retrieving the top 1000 results for each query—between our implementation and ElasticSearch across various datasets from <a target="_blank" href="https://github.com/xhluca/bm25-benchmarks">bm25-benchmarks</a>:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1740189817247/5a81c020-1aa2-452d-8e8d-b9653c5b3489.png" alt class="image--center mx-auto" /></p>
<p>On average, our implementation achieves <strong>3 times higher QPS</strong> compared to ElasticSearch across the tested datasets, showcasing its efficiency and scalability. However, speed alone isn’t sufficient—we also prioritize accuracy. To ensure relevance, we evaluated <strong>NDCG@10 (Normalized Discounted Cumulative Gain at 10)</strong>, a key metric for assessing ranking quality.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text"><strong>NDCG@10</strong> (Normalized Discounted Cumulative Gain at rank 10) is a metric that assesses the relevance of documents within the top 10 positions of a ranked list. It emphasizes not only the relevance of each document but also its position, ensuring higher-ranked relevant documents contribute more to the overall score.</div>
</div>

<p>The table below compares the NDCG@10 scores between VectorChord-BM25 and ElasticSearch across various datasets:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1740190423528/10f75122-34fc-4fa2-bb66-3e26e09516ec.png" alt class="image--center mx-auto" /></p>
<p>We have dedicated substantial effort to align VectorChord-BM25 with ElasticSearch’s behavior, ensuring a fair and precise comparison. As demonstrated in the table, our implementation achieves NDCG@10 scores that are comparable across most datasets, with certain cases even surpassing ElasticSearch (e.g., trec-covid and scifact).</p>
<p>We’ll share the technical details of our alignment efforts in a later section, including how we addressed tokenization, scoring, and other critical components to achieve these results. Before that, let’s explore how to use VectorChord-BM25 in PostgreSQL.</p>
<h2 id="heading-quick-start">Quick start</h2>
<p>To get started with VectorChord-BM25, we’ve put together a detailed guide in our <a target="_blank" href="https://github.com/tensorchord/VectorChord-bm25/">GitHub README</a> that walks you through the installation and configuration process. Below, you’ll find a complete example showing how to use VectorChord-BM25 for BM25 full-text search in Postgres. Each SQL snippet comes with a clear explanation of what it does and why it’s useful.</p>
<p>The extension consists of three main components:</p>
<ul>
<li><p><strong>Tokenizer</strong>: Converts text into a bm25vector, which is similar to a sparse vector that stores vocabulary IDs and their frequencies.</p>
</li>
<li><p><strong>bm25vector</strong>: Represents the tokenized text in a format suitable for BM25 scoring.</p>
</li>
<li><p><strong>bm25vector Index</strong>: Speeds up the search and ranking process, making it more efficient.</p>
</li>
</ul>
<p>If you’d like to tokenize some text, you can use the <code>tokenize</code> function. It takes two arguments: the text you want to tokenize and the name of the tokenizer.</p>
<pre><code class="lang-sql"><span class="hljs-comment">-- tokenize text with bert tokenizer</span>
<span class="hljs-keyword">SELECT</span> tokenize(<span class="hljs-string">'A quick brown fox jumps over the lazy dog.'</span>, <span class="hljs-string">'Bert'</span>);
<span class="hljs-comment">-- Output: {2474:1, 2829:1, 3899:1, 4248:1, 4419:1, 5376:1, 5831:1}</span>
<span class="hljs-comment">-- The output is a bm25vector, 2474:1 means the word with id 2474 appears once in the text.</span>
</code></pre>
<p>One unique aspect of the BM25 score is that it relies on a global document frequency. This means the score of a word in a document is influenced by how frequently that word appears across all documents in the set. To calculate the BM25 score between a bm25vector and a query, you’ll first need a document set. Once that’s in place, you can use the <code>&lt;&amp;&gt;</code> operator to perform the calculation.</p>
<p>Here is an example step by step. First, create a table and insert some documents:</p>
<pre><code class="lang-sql"><span class="hljs-comment">-- Setup the document table</span>
<span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> documents (
    <span class="hljs-keyword">id</span> <span class="hljs-built_in">SERIAL</span> PRIMARY <span class="hljs-keyword">KEY</span>,
    passage <span class="hljs-built_in">TEXT</span>,
    embedding bm25vector
);

<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> documents (passage) <span class="hljs-keyword">VALUES</span>
(<span class="hljs-string">'PostgreSQL is a powerful, open-source object-relational database system. It has over 15 years of active development.'</span>),
(<span class="hljs-string">'Full-text search is a technique for searching in plain-text documents or textual database fields. PostgreSQL supports this with tsvector.'</span>),
...
(<span class="hljs-string">'Effective search ranking algorithms, such as BM25, improve search results by understanding relevance.'</span>);
</code></pre>
<p>Then tokenize it</p>
<pre><code class="lang-sql"><span class="hljs-keyword">UPDATE</span> documents <span class="hljs-keyword">SET</span> embedding = tokenize(passage, <span class="hljs-string">'Bert'</span>);
</code></pre>
<p>Create the index on the bm25vector column so that we can collect the global document frequency.</p>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">INDEX</span> documents_embedding_bm25 <span class="hljs-keyword">ON</span> documents <span class="hljs-keyword">USING</span> bm25 (embedding bm25_ops);
</code></pre>
<p>Now we can compute the BM25 score between the query and the vectors. It’s worth noting that the BM25 score is negative—this is by design. A higher (less negative) score indicates a more relevant document. We intentionally made the score negative so that you can use the default <code>ORDER BY</code> clause to easily retrieve the most relevant documents first.</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">id</span>, passage, embedding &lt;&amp;&gt; to_bm25query(<span class="hljs-string">'documents_embedding_bm25'</span>, <span class="hljs-string">'PostgreSQL'</span>, <span class="hljs-string">'Bert'</span>) <span class="hljs-keyword">AS</span> <span class="hljs-keyword">rank</span>
<span class="hljs-keyword">FROM</span> documents
<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> <span class="hljs-keyword">rank</span>
<span class="hljs-keyword">LIMIT</span> <span class="hljs-number">10</span>;
</code></pre>
<h3 id="heading-other-tokenizers">Other Tokenizers</h3>
<p>In addition to BERT, VectorChord-BM25 also supports <code>Tocken</code> and <code>Unicode</code> Tokenizers. <code>Tocken</code> is a Unicode tokenizer pre-trained on the <code>wiki-103-raw</code> dataset with a minimum frequency (<code>min_freq</code>) of 10. Since it’s a pre-trained tokenizer, you can use it similarly to BERT.</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> tokenize(<span class="hljs-string">'A quick brown fox jumps over the lazy dog.'</span>, <span class="hljs-string">'Tocken'</span>);
</code></pre>
<p><code>Unicode</code> is a tokenizer that builds a vocabulary list from your data, similar to the standard behavior in Elasticsearch and other full-text search engines. To enable this, you need to create a specific one for your data using <code>create_unicode_tokenizer_and_trigger(vocab_list_name, table_name, source_text_column, tokenized_vec_column)</code>.</p>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> documents (<span class="hljs-keyword">id</span> <span class="hljs-built_in">SERIAL</span>, <span class="hljs-built_in">text</span> <span class="hljs-built_in">TEXT</span>, embedding bm25vector);
<span class="hljs-keyword">SELECT</span> create_unicode_tokenizer_and_trigger(<span class="hljs-string">'test_token'</span>, <span class="hljs-string">'documents'</span>, <span class="hljs-string">'text'</span>, <span class="hljs-string">'embedding'</span>);

<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> documents (<span class="hljs-built_in">text</span>) <span class="hljs-keyword">VALUES</span> (<span class="hljs-string">'PostgreSQL is a powerful, open-source object-relational database system.'</span>);
<span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">INDEX</span> documents_embedding_bm25 <span class="hljs-keyword">ON</span> documents <span class="hljs-keyword">USING</span> bm25 (embedding bm25_ops);
<span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">id</span>, <span class="hljs-built_in">text</span>, embedding &lt;&amp;&gt; to_bm25query(<span class="hljs-string">'documents_embedding_bm25'</span>, <span class="hljs-string">'PostgreSQL'</span>, <span class="hljs-string">'test_token'</span>) <span class="hljs-keyword">AS</span> <span class="hljs-keyword">rank</span>
    <span class="hljs-keyword">FROM</span> documents
    <span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> <span class="hljs-keyword">rank</span>
    <span class="hljs-keyword">LIMIT</span> <span class="hljs-number">10</span>;
</code></pre>
<p>Please check out the <a target="_blank" href="https://github.com/tensorchord/VectorChord-bm25/blob/main/tokenizer.md">tokenizer documentation</a> if you want to know more about it.</p>
<h2 id="heading-faster-than-elasticsearch-really">Faster than ElasticSearch (Really?)</h2>
<p><img src="https://i.imgflip.com/6b4xz9.jpg" alt="math Memes &amp; GIFs - Imgflip" class="image--center mx-auto" /></p>
<p>To ensure our new plugin provides meaningful and relevant results—not just faster performance—we rigorously evaluated VectorChord-BM25 using <a target="_blank" href="https://github.com/xhluca/bm25-benchmarks">bm25-benchmarks</a>. We leveraged the widely recognized BEIR benchmark for information retrieval, focusing on two key metrics: <strong>QPS (Queries Per Second)</strong> to measure speed and <strong>NDCG@10 (Normalized Discounted Cumulative Gain at 10)</strong> to assess relevance and ranking accuracy.</p>
<p>In our initial version, we were thrilled to see our system achieve <strong>3-5 times higher QPS</strong> compared to ElasticSearch. However, we also noticed that on certain datasets, our NDCG@10 scores were significantly lower than ElasticSearch’s. After a thorough analysis, we realized these gaps weren’t caused by our indexing implementation but rather by differences in tokenization approaches.</p>
<p>To address this, we invested considerable effort to align our tokenization process with ElasticSearch’s, ensuring a more accurate and fair comparison.</p>
<h3 id="heading-issue-1-stopword-list">Issue 1: Stopword List</h3>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">A <strong>stopword</strong> is a common word filtered out during text processing because it adds little meaning—for example, words like "the," "is," and "at."</div>
</div>

<p>ElasticSearch defaults to using Lucene’s Standard Tokenizer, which relies on a relatively short stopword list. In contrast, our initial implementation utilized NLTK’s much more comprehensive stopword list. As a result, ElasticSearch’s searches include certain stopwords, leading to longer inverted index lists that require additional scanning time—ultimately impacting performance.</p>
<h3 id="heading-issue-2-stemming">Issue 2: Stemming</h3>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">A <strong>stemmer</strong> reduces words to their base form, "running" and "ran" both become "run."</div>
</div>

<p>We initially used the <strong>snowball stemmer</strong> from the rust-stemmer library to handle word variations, but we observed discrepancies compared to ElasticSearch’s implementation. Upon investigation, we discovered that the rust-stemmer’s version of snowball was outdated. Following the guidelines from the official snowball repository, we regenerated the files using the latest version.</p>
<p>When we aligned both the <strong>stopword list</strong> and <strong>stemmer</strong> between our system and ElasticSearch, our performance advantage decreased from <strong>300% to 40%</strong>. Even so, on three datasets—nq, fever, and climate-fever—a noticeable performance gap persisted. A deeper comparison revealed a subtle but critical detail: in the bm25-benchmark, ElasticSearch preprocesses data differently from other frameworks, which contributed to the remaining discrepancy.</p>
<h3 id="heading-issue-3-data-preprocessing">Issue 3: Data Preprocessing</h3>
<p>In the BEIR dataset, each document includes both a title and a text field. While most frameworks concatenate these fields into a single string for indexing, ElasticSearch accepts JSON input and indexes the title and text separately. During querying, ElasticSearch performs a multi_match operation, searching both fields independently and combining their scores (using the higher score plus 0.5 times the lower score). This approach yields significantly better NDCG@10 results but requires searching two separate indexes, which can substantially impact performance.</p>
<p>To ensure a fair comparison, we re-ran our ElasticSearch tests by concatenating the title and text fields. With this adjustment, VectorChord-BM25 was able to match ElasticSearch’s results. Interestingly, ElasticSearch’s <strong>QPS (geometric mean)</strong> increased from <strong>135 to 341</strong> when using concatenated fields, making it <strong>25% faster</strong> than VectorChord-BM25’s <strong>271.91 QPS</strong> in Top10 query tests. However, in Top1000 tests, VectorChord-BM25 achieved a QPS of <strong>112</strong> compared to ElasticSearch’s <strong>49</strong>—making our implementation <strong>2.26 times faster</strong>.</p>
<p>This experience highlights the challenges of conducting fair performance comparisons between different BM25 full-text search implementations. The <strong>tokenizer</strong> is an exceptionally complex and influential component, and ElasticSearch’s intricate default settings add another layer of complexity to the evaluation process.</p>
<h2 id="heading-future">Future</h2>
<p>We’re still in the early stages of this project, and we’ve already pinpointed several areas for performance optimization. Tokenization is an inherently complex process—even for English, we face numerous decisions and trade-offs. Our next step is to <strong>fully decouple the tokenization process</strong>, transforming it into an independent and extensible extension. This will enable us to support multiple languages, allow users to customize tokenization for better results, and even incorporate advanced features like synonym handling.</p>
<p>Our ultimate goal is to empower users to perform <strong>high-quality, relevance-based full-text searches</strong> on PostgreSQL with ease. Combined with <a target="_blank" href="https://github.com/tensorchord/VectorChord/"><strong>VectorChord</strong></a>, we aim to deliver a <strong>first-class data infrastructure for RAG (Retrieval-Augmented Generation)</strong>. After all, choosing PostgreSQL is always a solid decision!</p>
<p>If you have any questions about vector search or full-text search in PostgreSQL, feel free to reach out to us! You can connect with us at:</p>
<ul>
<li><p>Discord: <a target="_blank" href="https://discord.gg/KqswhpVgdU">https://discord.gg/KqswhpVgdU</a></p>
</li>
<li><p>Email: <a target="_blank" href="mailto:vectorchord-inquery@tensorchord.ai">vectorchord-inquery@tensorchord.ai</a></p>
</li>
</ul>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://github.com/tensorchord/VectorChord-bm25">https://github.com/tensorchord/VectorChord-bm25</a></div>
<p> </p>
<h2 id="heading-more-benchmarks">More Benchmarks</h2>
<p>In addition to the Top1000 benchmarks, we have also conducted extensive evaluations for Top10 results. These benchmarks provide further insights into the performance of our implementation across different datasets. If you’re interested in exploring these results in detail, feel free to refer to the data provided here.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1740190947932/67305460-724d-4dc5-aa7f-606a6ceb13cb.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1740190965658/ca72272a-49af-43f8-87b7-16d14b383e8a.png" alt class="image--center mx-auto" /></p>
]]></content:encoded></item></channel></rss>