elasticsearch

History

Jim Ferenczi c7a482a462 Remove vectors from `_source` transparently (#130382 ) ## Summary This PR introduces a new hybrid mode for the `_source` field that stores the original source without dense vector fields. The goal is to reduce storage overhead and improve performance, especially as vector sizes grow. The setting also affects whether vectors are returned in search and get APIs, which matters even for synthetic source, since reconstructing vectors from doc values can be expensive. ## Background Today, Elasticsearch supports two modes for `_source`: * Stored: Original JSON is persisted as-is. * Synthetic: `_source` is reconstructed from doc values at read time. However, dense vector fields have become problematic: * They don’t compress well, unlike text. * They are already stored in doc values, so storing them again in `_source` is wasteful. * Their `_source` representation is often overly precise (double precision), which isn’t needed for search/indexing. While switching to full synthetic is an option, retrieving the full original `_source` (minus vectors) is often faster and more practical than pulling individual fields from individual storage when the number of metadata fields is high. ## What This PR Adds We’re introducing a hybrid source mode: * Keeps the original `_source`, minus any `dense_vector` fields. * Built on top of the synthetic source infrastructure, reusing parts of it. * Controlled via a single index-level setting. ### Key Behavior * When enabled, `dense_vector` fields are excluded from `_source` at index time. * The setting also controls whether vectors are returned in search and get APIs: * This matters even for synthetic source, as rebuilding vectors is costly. * You can override behavior at query time using the `exclude_vectors` option. * The setting is: * Disabled by default * Protected by a feature flag * Intended to be enabled by default for new indices in a follow-up ## Motivation This hybrid option is designed for use cases where users: * Want faster reads than full synthetic offers. * Don’t want the storage cost of large vectors in `_source`. * Are okay with some loss of precision when vectors are rehydrated. By making this setting default for newly created indices in a follow up, we can help users avoid surprises from the hidden cost of storing and returning high-dimensional vectors. ## Benchmark Results Benchmarking this PR against `main` using the `openai` rally track shows substantial improvements at the cost of a loss of precision when retrieving the original vectors: \| Metric \| Main (Baseline) \| This PR (Contender) \| Change \| % Change \| \| :----------------------------------------- \| :-------------- \| :------------------ \| :-------- \| :---------- \| \| Indexing throughput (mean) \| 1690.77 docs/s \| 2704.57 docs/s \| +1013.79 \| +59.96% \| \| Indexing time \| 120.25 min \| 74.32 min \| –45.93 \| –38.20% \| \| Merge time \| 132.56 min \| 69.28 min \| –63.28 \| –47.74% \| \| Merge throttle time \| 100.99 min \| 36.30 min \| –64.69 \| –64.06% \| \| Flush time \| 2.71 min \| 1.48 min \| –1.23 \| –45.29% \| \| Refresh count \| 60 \| 42 \| –18 \| –30.00% \| \| Dataset / Store size \| 52.29 GB \| 19.30 GB \| –32.99 GB \| –63.09% \| \| Young Gen GC time \| 30.64 s \| 22.17 s \| –8.47 \| –27.65% \| \| Search throughput (k=10, multi-client) \| 613 ops/s \| 677 ops/s \| +64 ops/s \| +10.42% \| \| Search latency (p99, k=10) \| 29.5 ms \| 26.5 ms \| –3.0 ms \| –10.43% \| ## Miscellaneous Reindexing is not covered in this PR. Since it's one of the main use cases for returning vectors, the plan is for reindex to force the inclusion of vectors by default. This will be addressed in a follow-up, as this PR is already quite large.	2025-07-07 10:34:37 +01:00
..
src/yamlRestTest	Remove vectors from `_source` transparently (#130382 )	2025-07-07 10:34:37 +01:00
build.gradle	Remove vectors from `_source` transparently (#130382 )	2025-07-07 10:34:37 +01:00

Remove vectors from `_source` transparently (#130382 )

## Summary

This PR introduces a new **hybrid mode for the `_source` field** that stores the original source **without dense vector fields**. The goal is to reduce storage overhead and improve performance, especially as vector sizes grow. The setting also affects whether vectors are returned in **search and get APIs**, which matters even for synthetic source, since reconstructing vectors from doc values can be expensive.

## Background

Today, Elasticsearch supports two modes for `_source`:

* **Stored**: Original JSON is persisted as-is.
* **Synthetic**: `_source` is reconstructed from doc values at read time.

However, dense vector fields have become problematic:

* They **don’t compress well**, unlike text.
* They are **already stored in doc values**, so storing them again in `_source` is wasteful.
* Their `_source` representation is often **overly precise** (double precision), which isn’t needed for search/indexing.

While switching to full synthetic is an option, retrieving the full original `_source` (minus vectors) is often faster and more practical than pulling individual fields from individual storage when the number of metadata fields is high.

## What This PR Adds

We’re introducing a **hybrid source mode**:

* Keeps the original `_source`, **minus any `dense_vector` fields**.
* Built on top of the **synthetic source infrastructure**, reusing parts of it.
* Controlled via a **single index-level setting**.

### Key Behavior

* When enabled, `dense_vector` fields are **excluded from `_source` at index time**.
* The setting **also controls whether vectors are returned in search and get APIs**:

  * This matters even for **synthetic source**, as **rebuilding vectors is costly**.
* You can override behavior at query time using the `exclude_vectors` option.
* The setting is:

  * **Disabled by default**
  * **Protected by a feature flag**
  * Intended to be **enabled by default for new indices** in a follow-up

## Motivation

This hybrid option is designed for use cases where users:

* Want faster reads than full synthetic offers.
* Don’t want the storage cost of large vectors in `_source`.
* Are okay with **some loss of precision** when vectors are rehydrated.

By making this setting default for newly created indices in a follow up, we can help users avoid surprises from the hidden cost of storing and returning high-dimensional vectors.

## Benchmark Results

Benchmarking this PR against `main` using the `openai` rally track shows substantial improvements at the cost of a loss of precision when retrieving the original vectors:

| Metric                                     | Main (Baseline) | This PR (Contender) | Change    | % Change    |
| :----------------------------------------- | :-------------- | :------------------ | :-------- | :---------- |
| **Indexing throughput (mean)**             | 1690.77 docs/s  | 2704.57 docs/s      | +1013.79  | **+59.96%** |
| **Indexing time**                          | 120.25 min      | 74.32 min           | –45.93    | **–38.20%** |
| **Merge time**                             | 132.56 min      | 69.28 min           | –63.28    | **–47.74%** |
| **Merge throttle time**                    | 100.99 min      | 36.30 min           | –64.69    | **–64.06%** |
| **Flush time**                             | 2.71 min        | 1.48 min            | –1.23     | **–45.29%** |
| **Refresh count**                          | 60              | 42                  | –18       | **–30.00%** |
| **Dataset / Store size**                   | 52.29 GB        | 19.30 GB            | –32.99 GB | **–63.09%** |
| **Young Gen GC time**                      | 30.64 s         | 22.17 s             | –8.47     | **–27.65%** |
| **Search throughput (k=10, multi-client)** | 613 ops/s       | 677 ops/s           | +64 ops/s | **+10.42%** |
| **Search latency (p99, k=10)**             | 29.5 ms         | 26.5 ms             | –3.0 ms   | **–10.43%** |

## Miscellaneous

Reindexing is not covered in this PR. Since it's one of the main use cases for returning vectors, the plan is for reindex to **force the inclusion of** vectors by default. This will be addressed in a follow-up, as this PR is already quite large.

2025-07-07 10:34:37 +01:00

src/yamlRestTest

Remove vectors from `_source` transparently (#130382 )

2025-07-07 10:34:37 +01:00

build.gradle

Remove vectors from `_source` transparently (#130382 )

2025-07-07 10:34:37 +01:00