Conditionally force sequential reading in LuceneSyntheticSourceChangesSnapshot (#128473)

Change LuceneSyntheticSourceChangesSnapshot to force sequential stored field reading when index.code is best_compression.

In CCR benchmarks I see that relatively often we spend a lot of time compressing the same stored field block over and over again when the doc ids are not dense. It is likely when a seqno range is requested that the corresponding doc id list contains gaps. However most docids are monotonically increasing, so not sequential reading harms performance. The reason that currently we're not loading sequentially is because of the logic in `StoredFieldLoader#hasSequentialDocs(...)`, which requires all requested docids to be in monotonically order (no gaps allowed). In the case of `LuceneSyntheticSourceChangesSnapshot` with stored field best compression that is too conservative. In practice, we end decompressing stored field blocks for each docid we need to synthetisize source for recovery.

I think it makes sense to do sequential reading in this case, given that it is very likely that many of the requested doc id ranges will contain monotonically increasing ranges. Note that the requested docids will always sort in ascending order (this happens in `LuceneSyntheticSourceChangesSnapshot#transformScoreDocsToRecords(...)`.
This commit is contained in:
Martijn van Groningen 2025-05-27 13:44:12 +02:00 committed by GitHub
parent 3bc6a4368a
commit 6a4a285284
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
2 changed files with 10 additions and 1 deletions

View File

@ -0,0 +1,5 @@
pr: 128473
summary: Conditionally force sequential reading in `LuceneSyntheticSourceChangesSnapshot`
area: Logs
type: enhancement
issues: []

View File

@ -17,6 +17,7 @@ import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.util.ArrayUtil;
import org.elasticsearch.index.IndexSettings;
import org.elasticsearch.index.IndexVersion;
import org.elasticsearch.index.codec.CodecService;
import org.elasticsearch.index.fieldvisitor.LeafStoredFieldLoader;
import org.elasticsearch.index.fieldvisitor.StoredFieldLoader;
import org.elasticsearch.index.mapper.MapperService;
@ -85,7 +86,10 @@ public class LuceneSyntheticSourceChangesSnapshot extends SearchBasedChangesSnap
this.maxMemorySizeInBytes = maxMemorySizeInBytes > 0 ? maxMemorySizeInBytes : 1;
this.sourceLoader = mapperService.mappingLookup().newSourceLoader(null, SourceFieldMetrics.NOOP);
Set<String> storedFields = sourceLoader.requiredStoredFields();
this.storedFieldLoader = StoredFieldLoader.create(false, storedFields);
String defaultCodec = EngineConfig.INDEX_CODEC_SETTING.get(mapperService.getIndexSettings().getSettings());
// zstd best compression stores upto 2048 docs in a block, so it is likely that in this case docs are co-located in same block:
boolean forceSequentialReader = CodecService.BEST_COMPRESSION_CODEC.equals(defaultCodec);
this.storedFieldLoader = StoredFieldLoader.create(false, storedFields, forceSequentialReader);
this.lastSeenSeqNo = fromSeqNo - 1;
}