elasticsearch/test/test-clusters
Jim Ferenczi c7a482a462
Remove vectors from `_source` transparently (#130382)
## Summary

This PR introduces a new **hybrid mode for the `_source` field** that stores the original source **without dense vector fields**. The goal is to reduce storage overhead and improve performance, especially as vector sizes grow. The setting also affects whether vectors are returned in **search and get APIs**, which matters even for synthetic source, since reconstructing vectors from doc values can be expensive.

## Background

Today, Elasticsearch supports two modes for `_source`:

* **Stored**: Original JSON is persisted as-is.
* **Synthetic**: `_source` is reconstructed from doc values at read time.

However, dense vector fields have become problematic:

* They **don’t compress well**, unlike text.
* They are **already stored in doc values**, so storing them again in `_source` is wasteful.
* Their `_source` representation is often **overly precise** (double precision), which isn’t needed for search/indexing.

While switching to full synthetic is an option, retrieving the full original `_source` (minus vectors) is often faster and more practical than pulling individual fields from individual storage when the number of metadata fields is high.

## What This PR Adds

We’re introducing a **hybrid source mode**:

* Keeps the original `_source`, **minus any `dense_vector` fields**.
* Built on top of the **synthetic source infrastructure**, reusing parts of it.
* Controlled via a **single index-level setting**.

### Key Behavior

* When enabled, `dense_vector` fields are **excluded from `_source` at index time**.
* The setting **also controls whether vectors are returned in search and get APIs**:

  * This matters even for **synthetic source**, as **rebuilding vectors is costly**.
* You can override behavior at query time using the `exclude_vectors` option.
* The setting is:

  * **Disabled by default**
  * **Protected by a feature flag**
  * Intended to be **enabled by default for new indices** in a follow-up

## Motivation

This hybrid option is designed for use cases where users:

* Want faster reads than full synthetic offers.
* Don’t want the storage cost of large vectors in `_source`.
* Are okay with **some loss of precision** when vectors are rehydrated.

By making this setting default for newly created indices in a follow up, we can help users avoid surprises from the hidden cost of storing and returning high-dimensional vectors.

## Benchmark Results

Benchmarking this PR against `main` using the `openai` rally track shows substantial improvements at the cost of a loss of precision when retrieving the original vectors:

| Metric                                     | Main (Baseline) | This PR (Contender) | Change    | % Change    |
| :----------------------------------------- | :-------------- | :------------------ | :-------- | :---------- |
| **Indexing throughput (mean)**             | 1690.77 docs/s  | 2704.57 docs/s      | +1013.79  | **+59.96%** |
| **Indexing time**                          | 120.25 min      | 74.32 min           | –45.93    | **–38.20%** |
| **Merge time**                             | 132.56 min      | 69.28 min           | –63.28    | **–47.74%** |
| **Merge throttle time**                    | 100.99 min      | 36.30 min           | –64.69    | **–64.06%** |
| **Flush time**                             | 2.71 min        | 1.48 min            | –1.23     | **–45.29%** |
| **Refresh count**                          | 60              | 42                  | –18       | **–30.00%** |
| **Dataset / Store size**                   | 52.29 GB        | 19.30 GB            | –32.99 GB | **–63.09%** |
| **Young Gen GC time**                      | 30.64 s         | 22.17 s             | –8.47     | **–27.65%** |
| **Search throughput (k=10, multi-client)** | 613 ops/s       | 677 ops/s           | +64 ops/s | **+10.42%** |
| **Search latency (p99, k=10)**             | 29.5 ms         | 26.5 ms             | –3.0 ms   | **–10.43%** |

## Miscellaneous

Reindexing is not covered in this PR. Since it's one of the main use cases for returning vectors, the plan is for reindex to **force the inclusion of** vectors by default. This will be addressed in a follow-up, as this PR is already quite large.
2025-07-07 10:34:37 +01:00
..
src Remove vectors from `_source` transparently (#130382) 2025-07-07 10:34:37 +01:00
build.gradle [Tests] Fix copying files for test cluster (#124628) 2025-03-12 16:09:55 +01:00