elasticsearch/qa
Jim Ferenczi c7a482a462
Remove vectors from `_source` transparently (#130382)
## Summary

This PR introduces a new **hybrid mode for the `_source` field** that stores the original source **without dense vector fields**. The goal is to reduce storage overhead and improve performance, especially as vector sizes grow. The setting also affects whether vectors are returned in **search and get APIs**, which matters even for synthetic source, since reconstructing vectors from doc values can be expensive.

## Background

Today, Elasticsearch supports two modes for `_source`:

* **Stored**: Original JSON is persisted as-is.
* **Synthetic**: `_source` is reconstructed from doc values at read time.

However, dense vector fields have become problematic:

* They **don’t compress well**, unlike text.
* They are **already stored in doc values**, so storing them again in `_source` is wasteful.
* Their `_source` representation is often **overly precise** (double precision), which isn’t needed for search/indexing.

While switching to full synthetic is an option, retrieving the full original `_source` (minus vectors) is often faster and more practical than pulling individual fields from individual storage when the number of metadata fields is high.

## What This PR Adds

We’re introducing a **hybrid source mode**:

* Keeps the original `_source`, **minus any `dense_vector` fields**.
* Built on top of the **synthetic source infrastructure**, reusing parts of it.
* Controlled via a **single index-level setting**.

### Key Behavior

* When enabled, `dense_vector` fields are **excluded from `_source` at index time**.
* The setting **also controls whether vectors are returned in search and get APIs**:

  * This matters even for **synthetic source**, as **rebuilding vectors is costly**.
* You can override behavior at query time using the `exclude_vectors` option.
* The setting is:

  * **Disabled by default**
  * **Protected by a feature flag**
  * Intended to be **enabled by default for new indices** in a follow-up

## Motivation

This hybrid option is designed for use cases where users:

* Want faster reads than full synthetic offers.
* Don’t want the storage cost of large vectors in `_source`.
* Are okay with **some loss of precision** when vectors are rehydrated.

By making this setting default for newly created indices in a follow up, we can help users avoid surprises from the hidden cost of storing and returning high-dimensional vectors.

## Benchmark Results

Benchmarking this PR against `main` using the `openai` rally track shows substantial improvements at the cost of a loss of precision when retrieving the original vectors:

| Metric                                     | Main (Baseline) | This PR (Contender) | Change    | % Change    |
| :----------------------------------------- | :-------------- | :------------------ | :-------- | :---------- |
| **Indexing throughput (mean)**             | 1690.77 docs/s  | 2704.57 docs/s      | +1013.79  | **+59.96%** |
| **Indexing time**                          | 120.25 min      | 74.32 min           | –45.93    | **–38.20%** |
| **Merge time**                             | 132.56 min      | 69.28 min           | –63.28    | **–47.74%** |
| **Merge throttle time**                    | 100.99 min      | 36.30 min           | –64.69    | **–64.06%** |
| **Flush time**                             | 2.71 min        | 1.48 min            | –1.23     | **–45.29%** |
| **Refresh count**                          | 60              | 42                  | –18       | **–30.00%** |
| **Dataset / Store size**                   | 52.29 GB        | 19.30 GB            | –32.99 GB | **–63.09%** |
| **Young Gen GC time**                      | 30.64 s         | 22.17 s             | –8.47     | **–27.65%** |
| **Search throughput (k=10, multi-client)** | 613 ops/s       | 677 ops/s           | +64 ops/s | **+10.42%** |
| **Search latency (p99, k=10)**             | 29.5 ms         | 26.5 ms             | –3.0 ms   | **–10.43%** |

## Miscellaneous

Reindexing is not covered in this PR. Since it's one of the main use cases for returning vectors, the plan is for reindex to **force the inclusion of** vectors by default. This will be addressed in a follow-up, as this PR is already quite large.
2025-07-07 10:34:37 +01:00
..
ccs-common-rest Remove vectors from `_source` transparently (#130382) 2025-07-07 10:34:37 +01:00
ccs-rolling-upgrade-remote-cluster Add ability to redirect ingestion failures on data streams to a failure store (#126973) 2025-04-18 16:33:03 -04:00
ccs-unavailable-clusters [Build] Require reason for usesDefaultDistribution (#124707) 2025-03-17 08:25:39 +01:00
custom-rest-controller Add AGPLv3 as a supported license 2024-09-13 15:29:46 -07:00
evil-tests Remove doPrivileged uses from server (#127781) 2025-05-07 07:24:53 -07:00
full-cluster-restart Mute testSnapshotRestore in bcUpgradeTest (#129767) 2025-06-20 19:04:09 +01:00
logging-config Remove security manager policy files (#127727) 2025-05-06 19:37:46 +02:00
logging-spi Update Gradle wrapper to 8.12 (#118683) 2024-12-30 15:34:24 +01:00
lucene-index-compatibility Increase timeout for index migration in FullClusterRestartSystemIndexCompatibilityIT (#127710) 2025-05-05 19:37:31 +02:00
mixed-cluster Include mapper extras yaml tests into mixed cluster qa module. (#130023) 2025-06-26 10:48:33 +02:00
multi-cluster-search [Build] Remove deprecated BuildParams (#116984) 2024-11-22 16:30:57 +01:00
no-bootstrap-tests Cleanup missing use of StandardCharsets (#125424) 2025-03-21 20:10:15 +01:00
packaging Restructure docker files for docker distributions (#127960) 2025-05-19 19:47:34 +02:00
remote-clusters Update Gradle wrapper to 8.12 (#118683) 2024-12-30 15:34:24 +01:00
repository-multi-version [Build] Remove deprecated BuildParams (#116984) 2024-11-22 16:30:57 +01:00
restricted-loggers Add AGPLv3 as a supported license 2024-09-13 15:29:46 -07:00
rolling-upgrade [Build] Extract logsdb rolling-upgrade tests (#129673) 2025-06-19 22:04:36 +02:00
rolling-upgrade-legacy [Gradle] Make rolling upgrade tests configuration cache compatible (#119577) 2025-01-16 23:23:04 +11:00
smoke-test-http Always log data node failures (#127420) 2025-04-29 09:40:31 -04:00
smoke-test-ingest-disabled Migrate legacy QA projects to new test clusters framework (#125545) 2025-03-26 10:05:56 -07:00
smoke-test-ingest-with-all-dependencies Migrate legacy QA projects to new test clusters framework (#125545) 2025-03-26 10:05:56 -07:00
smoke-test-multinode Remove vectors from `_source` transparently (#130382) 2025-07-07 10:34:37 +01:00
smoke-test-plugins Migrate legacy QA projects to new test clusters framework (#125545) 2025-03-26 10:05:56 -07:00
stable-api Validate that stable plugins do not break compatibility (#92776) 2023-01-18 06:48:48 -05:00
system-indices [main] Move system indices migration to migrate plugin (#125437) 2025-04-04 18:49:38 +01:00
unconfigured-node-name Remove security manager policy files (#127727) 2025-05-06 19:37:46 +02:00
vector Adding num_searchers to KnnIndexTester to simulate multiple callers (#130492) 2025-07-03 09:28:51 -04:00
verify-version-constants Re-enable VerifyVersionConstantsIT (#125605) 2025-03-25 12:16:53 -07:00
build.gradle Do not create unused testCluster (#77581) 2021-09-23 03:45:59 -04:00