Fork TDigest library (#96086)
* Initial import for TDigest forking. * Fix MedianTest. More work needed for TDigestPercentile*Tests and the TDigestTest (and the rest of the tests) in the tdigest lib to pass. * Fix Dist. * Fix AVLTreeDigest.quantile to match Dist for uniform centroids. * Update docs/changelog/96086.yaml * Fix `MergingDigest.quantile` to match `Dist` on uniform distribution. * Add merging to TDigestState.hashCode and .equals. Remove wrong asserts from tests and MergingDigest. * Fix style violations for tdigest library. * Fix typo. * Fix more style violations. * Fix more style violations. * Fix remaining style violations in tdigest library. * Update results in docs based on the forked tdigest. * Fix YAML tests in aggs module. * Fix YAML tests in x-pack/plugin. * Skip failing V7 compat tests in modules/aggregations. * Fix TDigest library unittests. Remove redundant serializing interfaces from the library. * Remove YAML test versions for older releases. These tests don't address compatibility issues in mixed cluster tests as the latter contain a mix of older and newer nodes, so the output depends on which node is picked as a data node since the forked TDigest library is not backwards compatible (produces slightly different results). * Fix test failures in docs and mixed cluster. * Reduce buffer sizes in MergingDigest to avoid oom. * Exclude more failing V7 compatibility tests. * Update results for JdbcCsvSpecIT tests. * Update results for JdbcDocCsvSpecIT tests. * Revert unrelated change. * More test fixes. * Use version skips instead of blacklisting in mixed cluster tests. * Switch TDigestState back to AVLTreeDigest. * Update docs and tests with AVLTreeDigest output. * Update flaky test. * Remove dead code, esp around tracking of incoming data. * Update docs/changelog/96086.yaml * Delete docs/changelog/96086.yaml * Remove explicit compression calls. This was added to prevent concurrency tests from failing, but it leads to reduces precision. Submit this to see if the concurrency tests are still failing. * Revert "Remove explicit compression calls." This reverts commit5352c96f65
. * Remove explicit compression calls to MedianAbsoluteDeviation input. * Add unittests for AVL and merging digest accuracy. * Fix spotless violations. * Delete redundant tests and benchmarks. * Fix spotless violation. * Use the old implementation of AVLTreeDigest. The latest library version is 50% slower and less accurate, as verified by ComparisonTests. * Update docs with latest percentile results. * Update docs with latest percentile results. * Remove repeated compression calls. * Update more percentile results. * Use approximate percentile values in integration tests. This helps with mixed cluster tests, where some of the tests where blocked. * Fix expected percentile value in test. * Revert in-place node updates in AVL tree. Update quantile calculations between centroids and min/max values to match v.3.2. * Add SortingDigest and HybridDigest. The SortingDigest tracks all samples in an ArrayList that gets sorted for quantile calculations. This approach provides perfectly accurate results and is the most efficient implementation for up to millions of samples, at the cost of bloated memory footprint. The HybridDigest uses a SortingDigest for small sample populations, then switches to a MergingDigest. This approach combines to the best performance and results for small sample counts with very good performance and acceptable accuracy for effectively unbounded sample counts. * Remove deps to the 3.2 library. * Remove unused licenses for tdigest. * Revert changes for SortingDigest and HybridDigest. These will be submitted in a follow-up PR for enabling MergingDigest. * Remove unused Histogram classes and unit tests. Delete dead and commented out code, make the remaining tests run reasonably fast. Remove unused annotations, esp. SuppressWarnings. * Remove Comparison class, not used. * Small fixes. * Add javadoc and tests. * Remove special logic for singletons in the boundaries. While this helps with the case where the digest contains only singletons (perfect accuracy), it has a major issue problem (non-monotonic quantile function) when the first singleton is followed by a non-singleton centroid. It's preferable to revert to the old version from 3.2; inaccuracies in a singleton-only digest should be mitigated by using a sorted array for small sample counts. * Revert changes to expected values in tests. This is due to restoring quantile functions to match head. * Revert changes to expected values in tests. This is due to restoring quantile functions to match head. * Tentatively restore percentile rank expected results. * Use cdf version from 3.2 Update Dist.cdf to use interpolation, use the same cdf version in AVLTreeDigest and MergingDigest. * Revert "Tentatively restore percentile rank expected results." This reverts commit7718dbba59
. * Revert remaining changes compared to main. * Revert excluded V7 compat tests. * Exclude V7 compat tests still failing. * Exclude V7 compat tests still failing. * Restore bySize function in TDigest and subclasses.
This commit is contained in:
parent
4543bfbc0e
commit
67211be81d
|
@ -0,0 +1,79 @@
|
|||
/*
|
||||
* Licensed to Elasticsearch B.V. under one or more contributor
|
||||
* license agreements. See the NOTICE file distributed with
|
||||
* this work for additional information regarding copyright
|
||||
* ownership. Elasticsearch B.V. licenses this file to you under
|
||||
* the Apache License, Version 2.0 (the "License"); you may
|
||||
* not use this file except in compliance with the License.
|
||||
* You may obtain a copy of the License at
|
||||
*
|
||||
* http://www.apache.org/licenses/LICENSE-2.0
|
||||
*
|
||||
* Unless required by applicable law or agreed to in writing,
|
||||
* software distributed under the License is distributed on an
|
||||
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||
* KIND, either express or implied. See the License for the
|
||||
* specific language governing permissions and limitations
|
||||
* under the License.
|
||||
*
|
||||
* This project is based on a modification of https://github.com/tdunning/t-digest which is licensed under the Apache 2.0 License.
|
||||
*/
|
||||
|
||||
package org.elasticsearch.benchmark.tdigest;
|
||||
|
||||
import org.elasticsearch.tdigest.Sort;
|
||||
import org.openjdk.jmh.annotations.Benchmark;
|
||||
import org.openjdk.jmh.annotations.BenchmarkMode;
|
||||
import org.openjdk.jmh.annotations.Fork;
|
||||
import org.openjdk.jmh.annotations.Measurement;
|
||||
import org.openjdk.jmh.annotations.Mode;
|
||||
import org.openjdk.jmh.annotations.OutputTimeUnit;
|
||||
import org.openjdk.jmh.annotations.Param;
|
||||
import org.openjdk.jmh.annotations.Scope;
|
||||
import org.openjdk.jmh.annotations.Setup;
|
||||
import org.openjdk.jmh.annotations.State;
|
||||
import org.openjdk.jmh.annotations.Threads;
|
||||
import org.openjdk.jmh.annotations.Warmup;
|
||||
|
||||
import java.util.Arrays;
|
||||
import java.util.Random;
|
||||
import java.util.concurrent.TimeUnit;
|
||||
|
||||
/** Explores the performance of Sort on pathological input data. */
|
||||
@BenchmarkMode(Mode.AverageTime)
|
||||
@OutputTimeUnit(TimeUnit.MILLISECONDS)
|
||||
@Warmup(iterations = 10, time = 3, timeUnit = TimeUnit.SECONDS)
|
||||
@Measurement(iterations = 20, time = 2, timeUnit = TimeUnit.SECONDS)
|
||||
@Fork(1)
|
||||
@Threads(1)
|
||||
@State(Scope.Thread)
|
||||
public class SortBench {
|
||||
private final int size = 100000;
|
||||
private final double[] values = new double[size];
|
||||
|
||||
@Param({ "0", "1", "-1" })
|
||||
public int sortDirection;
|
||||
|
||||
@Setup
|
||||
public void setup() {
|
||||
Random prng = new Random(999983);
|
||||
for (int i = 0; i < size; i++) {
|
||||
values[i] = prng.nextDouble();
|
||||
}
|
||||
if (sortDirection > 0) {
|
||||
Arrays.sort(values);
|
||||
} else if (sortDirection < 0) {
|
||||
Arrays.sort(values);
|
||||
Sort.reverse(values, 0, values.length);
|
||||
}
|
||||
}
|
||||
|
||||
@Benchmark
|
||||
public void quicksort() {
|
||||
int[] order = new int[size];
|
||||
for (int i = 0; i < size; i++) {
|
||||
order[i] = i;
|
||||
}
|
||||
Sort.sort(order, values, null, values.length);
|
||||
}
|
||||
}
|
|
@ -0,0 +1,131 @@
|
|||
/*
|
||||
* Licensed to Elasticsearch B.V. under one or more contributor
|
||||
* license agreements. See the NOTICE file distributed with
|
||||
* this work for additional information regarding copyright
|
||||
* ownership. Elasticsearch B.V. licenses this file to you under
|
||||
* the Apache License, Version 2.0 (the "License"); you may
|
||||
* not use this file except in compliance with the License.
|
||||
* You may obtain a copy of the License at
|
||||
*
|
||||
* http://www.apache.org/licenses/LICENSE-2.0
|
||||
*
|
||||
* Unless required by applicable law or agreed to in writing,
|
||||
* software distributed under the License is distributed on an
|
||||
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||
* KIND, either express or implied. See the License for the
|
||||
* specific language governing permissions and limitations
|
||||
* under the License.
|
||||
*
|
||||
* This project is based on a modification of https://github.com/tdunning/t-digest which is licensed under the Apache 2.0 License.
|
||||
*/
|
||||
|
||||
package org.elasticsearch.benchmark.tdigest;
|
||||
|
||||
import org.elasticsearch.tdigest.AVLTreeDigest;
|
||||
import org.elasticsearch.tdigest.MergingDigest;
|
||||
import org.elasticsearch.tdigest.TDigest;
|
||||
import org.openjdk.jmh.annotations.Benchmark;
|
||||
import org.openjdk.jmh.annotations.BenchmarkMode;
|
||||
import org.openjdk.jmh.annotations.Fork;
|
||||
import org.openjdk.jmh.annotations.Measurement;
|
||||
import org.openjdk.jmh.annotations.Mode;
|
||||
import org.openjdk.jmh.annotations.OutputTimeUnit;
|
||||
import org.openjdk.jmh.annotations.Param;
|
||||
import org.openjdk.jmh.annotations.Scope;
|
||||
import org.openjdk.jmh.annotations.Setup;
|
||||
import org.openjdk.jmh.annotations.State;
|
||||
import org.openjdk.jmh.annotations.Threads;
|
||||
import org.openjdk.jmh.annotations.Warmup;
|
||||
import org.openjdk.jmh.profile.GCProfiler;
|
||||
import org.openjdk.jmh.profile.StackProfiler;
|
||||
import org.openjdk.jmh.runner.Runner;
|
||||
import org.openjdk.jmh.runner.RunnerException;
|
||||
import org.openjdk.jmh.runner.options.Options;
|
||||
import org.openjdk.jmh.runner.options.OptionsBuilder;
|
||||
|
||||
import java.util.Random;
|
||||
import java.util.concurrent.ThreadLocalRandom;
|
||||
import java.util.concurrent.TimeUnit;
|
||||
import java.util.function.Supplier;
|
||||
|
||||
@BenchmarkMode(Mode.AverageTime)
|
||||
@OutputTimeUnit(TimeUnit.NANOSECONDS)
|
||||
@Warmup(iterations = 3, time = 3, timeUnit = TimeUnit.SECONDS)
|
||||
@Measurement(iterations = 5, time = 2, timeUnit = TimeUnit.SECONDS)
|
||||
@Fork(1)
|
||||
@Threads(1)
|
||||
@State(Scope.Thread)
|
||||
public class TDigestBench {
|
||||
|
||||
public enum TDigestFactory {
|
||||
MERGE {
|
||||
@Override
|
||||
TDigest create(double compression) {
|
||||
return new MergingDigest(compression, (int) (10 * compression));
|
||||
}
|
||||
},
|
||||
AVL_TREE {
|
||||
@Override
|
||||
TDigest create(double compression) {
|
||||
return new AVLTreeDigest(compression);
|
||||
}
|
||||
};
|
||||
|
||||
abstract TDigest create(double compression);
|
||||
}
|
||||
|
||||
@Param({ "100", "300" })
|
||||
double compression;
|
||||
|
||||
@Param({ "MERGE", "AVL_TREE" })
|
||||
TDigestFactory tdigestFactory;
|
||||
|
||||
@Param({ "NORMAL", "GAUSSIAN" })
|
||||
String distribution;
|
||||
|
||||
Random random;
|
||||
TDigest tdigest;
|
||||
|
||||
double[] data = new double[1000000];
|
||||
|
||||
@Setup
|
||||
public void setUp() {
|
||||
random = ThreadLocalRandom.current();
|
||||
tdigest = tdigestFactory.create(compression);
|
||||
|
||||
Supplier<Double> nextRandom = () -> distribution.equals("GAUSSIAN") ? random.nextGaussian() : random.nextDouble();
|
||||
for (int i = 0; i < 10000; ++i) {
|
||||
tdigest.add(nextRandom.get());
|
||||
}
|
||||
|
||||
for (int i = 0; i < data.length; ++i) {
|
||||
data[i] = nextRandom.get();
|
||||
}
|
||||
}
|
||||
|
||||
@State(Scope.Thread)
|
||||
public static class ThreadState {
|
||||
int index = 0;
|
||||
}
|
||||
|
||||
@Benchmark
|
||||
@BenchmarkMode(Mode.AverageTime)
|
||||
@OutputTimeUnit(TimeUnit.MICROSECONDS)
|
||||
public void add(ThreadState state) {
|
||||
if (state.index >= data.length) {
|
||||
state.index = 0;
|
||||
}
|
||||
tdigest.add(data[state.index++]);
|
||||
}
|
||||
|
||||
public static void main(String[] args) throws RunnerException {
|
||||
Options opt = new OptionsBuilder().include(".*" + TDigestBench.class.getSimpleName() + ".*")
|
||||
.warmupIterations(5)
|
||||
.measurementIterations(5)
|
||||
.addProfiler(GCProfiler.class)
|
||||
.addProfiler(StackProfiler.class)
|
||||
.build();
|
||||
|
||||
new Runner(opt).run();
|
||||
}
|
||||
}
|
|
@ -61,6 +61,7 @@ public class InternalDistributionModuleCheckTaskProvider {
|
|||
"org.elasticsearch.preallocate",
|
||||
"org.elasticsearch.securesm",
|
||||
"org.elasticsearch.server",
|
||||
"org.elasticsearch.tdigest",
|
||||
"org.elasticsearch.xcontent"
|
||||
);
|
||||
|
||||
|
@ -75,7 +76,7 @@ public class InternalDistributionModuleCheckTaskProvider {
|
|||
|
||||
private static final Function<ModuleReference, String> toName = mref -> mref.descriptor().name();
|
||||
|
||||
private InternalDistributionModuleCheckTaskProvider() {};
|
||||
private InternalDistributionModuleCheckTaskProvider() {}
|
||||
|
||||
/** Registers the checkModules tasks, which contains all checks relevant to ES Java Modules. */
|
||||
static TaskProvider<Task> registerCheckModulesTask(Project project, TaskProvider<Copy> checkExtraction) {
|
||||
|
|
|
@ -53,16 +53,16 @@ The response will look like this:
|
|||
"aggregations": {
|
||||
"load_time_ranks": {
|
||||
"values": {
|
||||
"500.0": 90.01,
|
||||
"600.0": 100.0
|
||||
"500.0": 55.0,
|
||||
"600.0": 64.0
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
--------------------------------------------------
|
||||
// TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]
|
||||
// TESTRESPONSE[s/"500.0": 90.01/"500.0": 55.00000000000001/]
|
||||
// TESTRESPONSE[s/"600.0": 100.0/"600.0": 64.0/]
|
||||
// TESTRESPONSE[s/"500.0": 55.0/"500.0": 55.00000000000001/]
|
||||
// TESTRESPONSE[s/"600.0": 64.0/"600.0": 64.0/]
|
||||
|
||||
From this information you can determine you are hitting the 99% load time target but not quite
|
||||
hitting the 95% load time target
|
||||
|
@ -101,11 +101,11 @@ Response:
|
|||
"values": [
|
||||
{
|
||||
"key": 500.0,
|
||||
"value": 90.01
|
||||
"value": 55.0
|
||||
},
|
||||
{
|
||||
"key": 600.0,
|
||||
"value": 100.0
|
||||
"value": 64.0
|
||||
}
|
||||
]
|
||||
}
|
||||
|
@ -113,8 +113,8 @@ Response:
|
|||
}
|
||||
--------------------------------------------------
|
||||
// TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]
|
||||
// TESTRESPONSE[s/"value": 90.01/"value": 55.00000000000001/]
|
||||
// TESTRESPONSE[s/"value": 100.0/"value": 64.0/]
|
||||
// TESTRESPONSE[s/"value": 55.0/"value": 55.00000000000001/]
|
||||
// TESTRESPONSE[s/"value": 64.0/"value": 64.0/]
|
||||
|
||||
|
||||
==== Script
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
Apache License
|
||||
Apache License
|
||||
Version 2.0, January 2004
|
||||
http://www.apache.org/licenses/
|
||||
|
|
@ -0,0 +1,28 @@
|
|||
Elastic-t-digest
|
||||
|
||||
Copyright 2023 Elasticsearch B.V.
|
||||
|
||||
--
|
||||
This project is based on a modification of https://github.com/tdunning/t-digest which is licensed under the Apache 2.0 License.
|
||||
|
||||
Licensed to Elasticsearch B.V. under one or more contributor
|
||||
license agreements. See the NOTICE file distributed with
|
||||
this work for additional information regarding copyright
|
||||
ownership. Elasticsearch B.V. licenses this file to you under
|
||||
the Apache License, Version 2.0 (the "License"); you may
|
||||
not use this file except in compliance with the License.
|
||||
You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing,
|
||||
software distributed under the License is distributed on an
|
||||
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||
KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations
|
||||
under the License.
|
||||
|
||||
--
|
||||
The code for the t-digest was originally authored by Ted Dunning
|
||||
|
||||
Adrien Grand contributed the heart of the AVLTreeDigest (https://github.com/jpountz)
|
|
@ -0,0 +1,41 @@
|
|||
import org.elasticsearch.gradle.internal.conventions.precommit.LicenseHeadersTask
|
||||
|
||||
/*
|
||||
* Licensed to Elasticsearch B.V. under one or more contributor
|
||||
* license agreements. See the NOTICE file distributed with
|
||||
* this work for additional information regarding copyright
|
||||
* ownership. Elasticsearch B.V. licenses this file to you under
|
||||
* the Apache License, Version 2.0 (the "License"); you may
|
||||
* not use this file except in compliance with the License.
|
||||
* You may obtain a copy of the License at
|
||||
*
|
||||
* http://www.apache.org/licenses/LICENSE-2.0
|
||||
*
|
||||
* Unless required by applicable law or agreed to in writing,
|
||||
* software distributed under the License is distributed on an
|
||||
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||
* KIND, either express or implied. See the License for the
|
||||
* specific language governing permissions and limitations
|
||||
* under the License.
|
||||
*/
|
||||
apply plugin: 'elasticsearch.build'
|
||||
apply plugin: 'elasticsearch.publish'
|
||||
|
||||
dependencies {
|
||||
testImplementation(project(":test:framework")) {
|
||||
exclude group: 'org.elasticsearch', module: 'elasticsearch-tdigest'
|
||||
}
|
||||
testImplementation 'org.junit.jupiter:junit-jupiter:5.8.1'
|
||||
}
|
||||
|
||||
tasks.named('forbiddenApisMain').configure {
|
||||
// t-digest does not depend on core, so only jdk signatures should be checked
|
||||
replaceSignatureFiles 'jdk-signatures'
|
||||
}
|
||||
|
||||
ext.projectLicenses.set(['The Apache Software License, Version 2.0': 'http://www.apache.org/licenses/LICENSE-2.0'])
|
||||
licenseFile.set(rootProject.file('licenses/APACHE-LICENSE-2.0.txt'))
|
||||
|
||||
tasks.withType(LicenseHeadersTask.class).configureEach {
|
||||
approvedLicenses = ['Apache', 'Generated', 'Vendored']
|
||||
}
|
|
@ -0,0 +1,22 @@
|
|||
/*
|
||||
* Licensed to Elasticsearch B.V. under one or more contributor
|
||||
* license agreements. See the NOTICE file distributed with
|
||||
* this work for additional information regarding copyright
|
||||
* ownership. Elasticsearch B.V. licenses this file to you under
|
||||
* the Apache License, Version 2.0 (the "License"); you may
|
||||
* not use this file except in compliance with the License.
|
||||
* You may obtain a copy of the License at
|
||||
*
|
||||
* http://www.apache.org/licenses/LICENSE-2.0
|
||||
*
|
||||
* Unless required by applicable law or agreed to in writing,
|
||||
* software distributed under the License is distributed on an
|
||||
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||
* KIND, either express or implied. See the License for the
|
||||
* specific language governing permissions and limitations
|
||||
* under the License.
|
||||
*/
|
||||
|
||||
module org.elasticsearch.tdigest {
|
||||
exports org.elasticsearch.tdigest;
|
||||
}
|
|
@ -0,0 +1,265 @@
|
|||
/*
|
||||
* Licensed to Elasticsearch B.V. under one or more contributor
|
||||
* license agreements. See the NOTICE file distributed with
|
||||
* this work for additional information regarding copyright
|
||||
* ownership. Elasticsearch B.V. licenses this file to you under
|
||||
* the Apache License, Version 2.0 (the "License"); you may
|
||||
* not use this file except in compliance with the License.
|
||||
* You may obtain a copy of the License at
|
||||
*
|
||||
* http://www.apache.org/licenses/LICENSE-2.0
|
||||
*
|
||||
* Unless required by applicable law or agreed to in writing,
|
||||
* software distributed under the License is distributed on an
|
||||
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||
* KIND, either express or implied. See the License for the
|
||||
* specific language governing permissions and limitations
|
||||
* under the License.
|
||||
*
|
||||
* This project is based on a modification of https://github.com/tdunning/t-digest which is licensed under the Apache 2.0 License.
|
||||
*/
|
||||
|
||||
package org.elasticsearch.tdigest;
|
||||
|
||||
import java.util.AbstractCollection;
|
||||
import java.util.Arrays;
|
||||
import java.util.Iterator;
|
||||
|
||||
/**
|
||||
* A tree of t-digest centroids.
|
||||
*/
|
||||
final class AVLGroupTree extends AbstractCollection<Centroid> {
|
||||
/* For insertions into the tree */
|
||||
private double centroid;
|
||||
private int count;
|
||||
private double[] centroids;
|
||||
private int[] counts;
|
||||
private int[] aggregatedCounts;
|
||||
private final IntAVLTree tree;
|
||||
|
||||
AVLGroupTree() {
|
||||
tree = new IntAVLTree() {
|
||||
|
||||
@Override
|
||||
protected void resize(int newCapacity) {
|
||||
super.resize(newCapacity);
|
||||
centroids = Arrays.copyOf(centroids, newCapacity);
|
||||
counts = Arrays.copyOf(counts, newCapacity);
|
||||
aggregatedCounts = Arrays.copyOf(aggregatedCounts, newCapacity);
|
||||
}
|
||||
|
||||
@Override
|
||||
protected void merge(int node) {
|
||||
// two nodes are never considered equal
|
||||
throw new UnsupportedOperationException();
|
||||
}
|
||||
|
||||
@Override
|
||||
protected void copy(int node) {
|
||||
centroids[node] = centroid;
|
||||
counts[node] = count;
|
||||
}
|
||||
|
||||
@Override
|
||||
protected int compare(int node) {
|
||||
if (centroid < centroids[node]) {
|
||||
return -1;
|
||||
} else {
|
||||
// upon equality, the newly added node is considered greater
|
||||
return 1;
|
||||
}
|
||||
}
|
||||
|
||||
@Override
|
||||
protected void fixAggregates(int node) {
|
||||
super.fixAggregates(node);
|
||||
aggregatedCounts[node] = counts[node] + aggregatedCounts[left(node)] + aggregatedCounts[right(node)];
|
||||
}
|
||||
|
||||
};
|
||||
centroids = new double[tree.capacity()];
|
||||
counts = new int[tree.capacity()];
|
||||
aggregatedCounts = new int[tree.capacity()];
|
||||
}
|
||||
|
||||
/**
|
||||
* Return the number of centroids in the tree.
|
||||
*/
|
||||
public int size() {
|
||||
return tree.size();
|
||||
}
|
||||
|
||||
/**
|
||||
* Return the previous node.
|
||||
*/
|
||||
public int prev(int node) {
|
||||
return tree.prev(node);
|
||||
}
|
||||
|
||||
/**
|
||||
* Return the next node.
|
||||
*/
|
||||
public int next(int node) {
|
||||
return tree.next(node);
|
||||
}
|
||||
|
||||
/**
|
||||
* Return the mean for the provided node.
|
||||
*/
|
||||
public double mean(int node) {
|
||||
return centroids[node];
|
||||
}
|
||||
|
||||
/**
|
||||
* Return the count for the provided node.
|
||||
*/
|
||||
public int count(int node) {
|
||||
return counts[node];
|
||||
}
|
||||
|
||||
/**
|
||||
* Add the provided centroid to the tree.
|
||||
*/
|
||||
public void add(double centroid, int count) {
|
||||
this.centroid = centroid;
|
||||
this.count = count;
|
||||
tree.add();
|
||||
}
|
||||
|
||||
@Override
|
||||
public boolean add(Centroid centroid) {
|
||||
add(centroid.mean(), centroid.count());
|
||||
return true;
|
||||
}
|
||||
|
||||
/**
|
||||
* Update values associated with a node, readjusting the tree if necessary.
|
||||
*/
|
||||
public void update(int node, double centroid, int count) {
|
||||
// have to do full scale update
|
||||
this.centroid = centroid;
|
||||
this.count = count;
|
||||
tree.update(node);
|
||||
}
|
||||
|
||||
/**
|
||||
* Return the last node whose centroid is less than <code>centroid</code>.
|
||||
*/
|
||||
public int floor(double centroid) {
|
||||
int floor = IntAVLTree.NIL;
|
||||
for (int node = tree.root(); node != IntAVLTree.NIL;) {
|
||||
final int cmp = Double.compare(centroid, mean(node));
|
||||
if (cmp <= 0) {
|
||||
node = tree.left(node);
|
||||
} else {
|
||||
floor = node;
|
||||
node = tree.right(node);
|
||||
}
|
||||
}
|
||||
return floor;
|
||||
}
|
||||
|
||||
/**
|
||||
* Return the last node so that the sum of counts of nodes that are before
|
||||
* it is less than or equal to <code>sum</code>.
|
||||
*/
|
||||
public int floorSum(long sum) {
|
||||
int floor = IntAVLTree.NIL;
|
||||
for (int node = tree.root(); node != IntAVLTree.NIL;) {
|
||||
final int left = tree.left(node);
|
||||
final long leftCount = aggregatedCounts[left];
|
||||
if (leftCount <= sum) {
|
||||
floor = node;
|
||||
sum -= leftCount + count(node);
|
||||
node = tree.right(node);
|
||||
} else {
|
||||
node = tree.left(node);
|
||||
}
|
||||
}
|
||||
return floor;
|
||||
}
|
||||
|
||||
/**
|
||||
* Return the least node in the tree.
|
||||
*/
|
||||
public int first() {
|
||||
return tree.first(tree.root());
|
||||
}
|
||||
|
||||
/**
|
||||
* Return the least node in the tree.
|
||||
*/
|
||||
public int last() {
|
||||
return tree.last(tree.root());
|
||||
}
|
||||
|
||||
/**
|
||||
* Compute the number of elements and sum of counts for every entry that
|
||||
* is strictly before <code>node</code>.
|
||||
*/
|
||||
public long headSum(int node) {
|
||||
final int left = tree.left(node);
|
||||
long sum = aggregatedCounts[left];
|
||||
for (int n = node, p = tree.parent(node); p != IntAVLTree.NIL; n = p, p = tree.parent(n)) {
|
||||
if (n == tree.right(p)) {
|
||||
final int leftP = tree.left(p);
|
||||
sum += counts[p] + aggregatedCounts[leftP];
|
||||
}
|
||||
}
|
||||
return sum;
|
||||
}
|
||||
|
||||
@Override
|
||||
public Iterator<Centroid> iterator() {
|
||||
return iterator(first());
|
||||
}
|
||||
|
||||
private Iterator<Centroid> iterator(final int startNode) {
|
||||
return new Iterator<>() {
|
||||
|
||||
int nextNode = startNode;
|
||||
|
||||
@Override
|
||||
public boolean hasNext() {
|
||||
return nextNode != IntAVLTree.NIL;
|
||||
}
|
||||
|
||||
@Override
|
||||
public Centroid next() {
|
||||
final Centroid next = new Centroid(mean(nextNode), count(nextNode));
|
||||
nextNode = tree.next(nextNode);
|
||||
return next;
|
||||
}
|
||||
|
||||
@Override
|
||||
public void remove() {
|
||||
throw new UnsupportedOperationException("Read-only iterator");
|
||||
}
|
||||
|
||||
};
|
||||
}
|
||||
|
||||
/**
|
||||
* Return the total count of points that have been added to the tree.
|
||||
*/
|
||||
public int sum() {
|
||||
return aggregatedCounts[tree.root()];
|
||||
}
|
||||
|
||||
void checkBalance() {
|
||||
tree.checkBalance(tree.root());
|
||||
}
|
||||
|
||||
void checkAggregates() {
|
||||
checkAggregates(tree.root());
|
||||
}
|
||||
|
||||
private void checkAggregates(int node) {
|
||||
assert aggregatedCounts[node] == counts[node] + aggregatedCounts[tree.left(node)] + aggregatedCounts[tree.right(node)];
|
||||
if (node != IntAVLTree.NIL) {
|
||||
checkAggregates(tree.left(node));
|
||||
checkAggregates(tree.right(node));
|
||||
}
|
||||
}
|
||||
|
||||
}
|
|
@ -0,0 +1,365 @@
|
|||
/*
|
||||
* Licensed to Elasticsearch B.V. under one or more contributor
|
||||
* license agreements. See the NOTICE file distributed with
|
||||
* this work for additional information regarding copyright
|
||||
* ownership. Elasticsearch B.V. licenses this file to you under
|
||||
* the Apache License, Version 2.0 (the "License"); you may
|
||||
* not use this file except in compliance with the License.
|
||||
* You may obtain a copy of the License at
|
||||
*
|
||||
* http://www.apache.org/licenses/LICENSE-2.0
|
||||
*
|
||||
* Unless required by applicable law or agreed to in writing,
|
||||
* software distributed under the License is distributed on an
|
||||
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||
* KIND, either express or implied. See the License for the
|
||||
* specific language governing permissions and limitations
|
||||
* under the License.
|
||||
*
|
||||
* This project is based on a modification of https://github.com/tdunning/t-digest which is licensed under the Apache 2.0 License.
|
||||
*/
|
||||
|
||||
package org.elasticsearch.tdigest;
|
||||
|
||||
import java.util.Collection;
|
||||
import java.util.Collections;
|
||||
import java.util.Iterator;
|
||||
import java.util.List;
|
||||
import java.util.Random;
|
||||
|
||||
import static org.elasticsearch.tdigest.IntAVLTree.NIL;
|
||||
|
||||
public class AVLTreeDigest extends AbstractTDigest {
|
||||
final Random gen = new Random();
|
||||
private final double compression;
|
||||
private AVLGroupTree summary;
|
||||
|
||||
private long count = 0; // package private for testing
|
||||
|
||||
// Indicates if a sample has been added after the last compression.
|
||||
private boolean needsCompression;
|
||||
|
||||
/**
|
||||
* A histogram structure that will record a sketch of a distribution.
|
||||
*
|
||||
* @param compression How should accuracy be traded for size? A value of N here will give quantile errors
|
||||
* almost always less than 3/N with considerably smaller errors expected for extreme
|
||||
* quantiles. Conversely, you should expect to track about 5 N centroids for this
|
||||
* accuracy.
|
||||
*/
|
||||
public AVLTreeDigest(double compression) {
|
||||
this.compression = compression;
|
||||
summary = new AVLGroupTree();
|
||||
}
|
||||
|
||||
/**
|
||||
* Sets the seed for the RNG.
|
||||
* In cases where a predicatable tree should be created, this function may be used to make the
|
||||
* randomness in this AVLTree become more deterministic.
|
||||
*
|
||||
* @param seed The random seed to use for RNG purposes
|
||||
*/
|
||||
public void setRandomSeed(long seed) {
|
||||
gen.setSeed(seed);
|
||||
}
|
||||
|
||||
@Override
|
||||
public int centroidCount() {
|
||||
return summary.size();
|
||||
}
|
||||
|
||||
@Override
|
||||
public void add(List<? extends TDigest> others) {
|
||||
for (TDigest other : others) {
|
||||
setMinMax(Math.min(min, other.getMin()), Math.max(max, other.getMax()));
|
||||
for (Centroid centroid : other.centroids()) {
|
||||
add(centroid.mean(), centroid.count());
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@Override
|
||||
public void add(double x, int w) {
|
||||
checkValue(x);
|
||||
needsCompression = true;
|
||||
|
||||
if (x < min) {
|
||||
min = x;
|
||||
}
|
||||
if (x > max) {
|
||||
max = x;
|
||||
}
|
||||
int start = summary.floor(x);
|
||||
if (start == NIL) {
|
||||
start = summary.first();
|
||||
}
|
||||
|
||||
if (start == NIL) { // empty summary
|
||||
assert summary.size() == 0;
|
||||
summary.add(x, w);
|
||||
count = w;
|
||||
} else {
|
||||
double minDistance = Double.MAX_VALUE;
|
||||
int lastNeighbor = NIL;
|
||||
for (int neighbor = start; neighbor != NIL; neighbor = summary.next(neighbor)) {
|
||||
double z = Math.abs(summary.mean(neighbor) - x);
|
||||
if (z < minDistance) {
|
||||
start = neighbor;
|
||||
minDistance = z;
|
||||
} else if (z > minDistance) {
|
||||
// as soon as z increases, we have passed the nearest neighbor and can quit
|
||||
lastNeighbor = neighbor;
|
||||
break;
|
||||
}
|
||||
}
|
||||
|
||||
int closest = NIL;
|
||||
double n = 0;
|
||||
long sum = summary.headSum(start);
|
||||
for (int neighbor = start; neighbor != lastNeighbor; neighbor = summary.next(neighbor)) {
|
||||
assert minDistance == Math.abs(summary.mean(neighbor) - x);
|
||||
double q = count == 1 ? 0.5 : (sum + (summary.count(neighbor) - 1) / 2.0) / (count - 1);
|
||||
double k = 4 * count * q * (1 - q) / compression;
|
||||
|
||||
// this slightly clever selection method improves accuracy with lots of repeated points
|
||||
// what it does is sample uniformly from all clusters that have room
|
||||
if (summary.count(neighbor) + w <= k) {
|
||||
n++;
|
||||
if (gen.nextDouble() < 1 / n) {
|
||||
closest = neighbor;
|
||||
}
|
||||
}
|
||||
sum += summary.count(neighbor);
|
||||
}
|
||||
|
||||
if (closest == NIL) {
|
||||
summary.add(x, w);
|
||||
} else {
|
||||
// if the nearest point was not unique, then we may not be modifying the first copy
|
||||
// which means that ordering can change
|
||||
double centroid = summary.mean(closest);
|
||||
int count = summary.count(closest);
|
||||
centroid = weightedAverage(centroid, count, x, w);
|
||||
count += w;
|
||||
summary.update(closest, centroid, count);
|
||||
}
|
||||
count += w;
|
||||
|
||||
if (summary.size() > 20 * compression) {
|
||||
// may happen in case of sequential points
|
||||
compress();
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@Override
|
||||
public void compress() {
|
||||
if (needsCompression == false) {
|
||||
return;
|
||||
}
|
||||
needsCompression = false;
|
||||
|
||||
AVLGroupTree centroids = summary;
|
||||
this.summary = new AVLGroupTree();
|
||||
|
||||
final int[] nodes = new int[centroids.size()];
|
||||
nodes[0] = centroids.first();
|
||||
for (int i = 1; i < nodes.length; ++i) {
|
||||
nodes[i] = centroids.next(nodes[i - 1]);
|
||||
assert nodes[i] != IntAVLTree.NIL;
|
||||
}
|
||||
assert centroids.next(nodes[nodes.length - 1]) == IntAVLTree.NIL;
|
||||
|
||||
for (int i = centroids.size() - 1; i > 0; --i) {
|
||||
final int other = gen.nextInt(i + 1);
|
||||
final int tmp = nodes[other];
|
||||
nodes[other] = nodes[i];
|
||||
nodes[i] = tmp;
|
||||
}
|
||||
|
||||
for (int node : nodes) {
|
||||
add(centroids.mean(node), centroids.count(node));
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Returns the number of samples represented in this histogram. If you want to know how many
|
||||
* centroids are being used, try centroids().size().
|
||||
*
|
||||
* @return the number of samples that have been added.
|
||||
*/
|
||||
@Override
|
||||
public long size() {
|
||||
return count;
|
||||
}
|
||||
|
||||
/**
|
||||
* @param x the value at which the CDF should be evaluated
|
||||
* @return the approximate fraction of all samples that were less than or equal to x.
|
||||
*/
|
||||
@Override
|
||||
public double cdf(double x) {
|
||||
AVLGroupTree values = summary;
|
||||
if (values.size() == 0) {
|
||||
return Double.NaN;
|
||||
}
|
||||
if (values.size() == 1) {
|
||||
if (x < values.mean(values.first())) return 0;
|
||||
if (x > values.mean(values.first())) return 1;
|
||||
return 0.5;
|
||||
} else {
|
||||
if (x < min) {
|
||||
return 0;
|
||||
}
|
||||
if (Double.compare(x, min) == 0) {
|
||||
// we have one or more centroids == x, treat them as one
|
||||
// dw will accumulate the weight of all of the centroids at x
|
||||
double dw = 0;
|
||||
for (Centroid value : values) {
|
||||
if (Double.compare(value.mean(), x) != 0) {
|
||||
break;
|
||||
}
|
||||
dw += value.count();
|
||||
}
|
||||
return dw / 2.0 / size();
|
||||
}
|
||||
|
||||
if (x > max) {
|
||||
return 1;
|
||||
}
|
||||
if (Double.compare(x, max) == 0) {
|
||||
int ix = values.last();
|
||||
double dw = 0;
|
||||
while (ix != NIL && Double.compare(values.mean(ix), x) == 0) {
|
||||
dw += values.count(ix);
|
||||
ix = values.prev(ix);
|
||||
}
|
||||
long n = size();
|
||||
return (n - dw / 2.0) / n;
|
||||
}
|
||||
|
||||
// we scan a across the centroids
|
||||
Iterator<Centroid> it = values.iterator();
|
||||
Centroid a = it.next();
|
||||
|
||||
// b is the look-ahead to the next centroid
|
||||
Centroid b = it.next();
|
||||
|
||||
// initially, we set left width equal to right width
|
||||
double left = (b.mean() - a.mean()) / 2;
|
||||
double right = left;
|
||||
|
||||
// scan to next to last element
|
||||
double r = 0;
|
||||
while (it.hasNext()) {
|
||||
if (x < a.mean() + right) {
|
||||
double value = (r + a.count() * interpolate(x, a.mean() - left, a.mean() + right)) / count;
|
||||
return Math.max(value, 0.0);
|
||||
}
|
||||
|
||||
r += a.count();
|
||||
a = b;
|
||||
left = right;
|
||||
b = it.next();
|
||||
right = (b.mean() - a.mean()) / 2;
|
||||
}
|
||||
|
||||
// for the last element, assume right width is same as left
|
||||
if (x < a.mean() + right) {
|
||||
return (r + a.count() * interpolate(x, a.mean() - right, a.mean() + right)) / count;
|
||||
}
|
||||
return 1;
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* @param q The quantile desired. Can be in the range [0,1].
|
||||
* @return The minimum value x such that we think that the proportion of samples is ≤ x is q.
|
||||
*/
|
||||
@Override
|
||||
public double quantile(double q) {
|
||||
if (q < 0 || q > 1) {
|
||||
throw new IllegalArgumentException("q should be in [0,1], got " + q);
|
||||
}
|
||||
|
||||
AVLGroupTree values = summary;
|
||||
if (values.size() == 0) {
|
||||
// no centroids means no data, no way to get a quantile
|
||||
return Double.NaN;
|
||||
} else if (values.size() == 1) {
|
||||
// with one data point, all quantiles lead to Rome
|
||||
return values.iterator().next().mean();
|
||||
}
|
||||
|
||||
// if values were stored in a sorted array, index would be the offset we are interested in
|
||||
final double index = q * count;
|
||||
|
||||
// deal with min and max as a special case singletons
|
||||
if (index <= 0) {
|
||||
return min;
|
||||
}
|
||||
|
||||
if (index >= count) {
|
||||
return max;
|
||||
}
|
||||
|
||||
int currentNode = values.first();
|
||||
int currentWeight = values.count(currentNode);
|
||||
|
||||
// Total mass to the left of the center of the current node.
|
||||
double weightSoFar = currentWeight / 2.0;
|
||||
|
||||
if (index <= weightSoFar && weightSoFar > 1) {
|
||||
// Interpolate between min and first mean, if there's no singleton on the left boundary.
|
||||
return weightedAverage(min, weightSoFar - index, values.mean(currentNode), index);
|
||||
}
|
||||
|
||||
for (int i = 0; i < values.size() - 1; i++) {
|
||||
int nextNode = values.next(currentNode);
|
||||
int nextWeight = values.count(nextNode);
|
||||
// this is the mass between current center and next center
|
||||
double dw = (currentWeight + nextWeight) / 2.0;
|
||||
|
||||
if (index < weightSoFar + dw) {
|
||||
// index is bracketed between centroids i and i+1
|
||||
assert dw >= 1;
|
||||
|
||||
double w1 = index - weightSoFar;
|
||||
double w2 = weightSoFar + dw - index;
|
||||
return weightedAverage(values.mean(currentNode), w2, values.mean(nextNode), w1);
|
||||
}
|
||||
weightSoFar += dw;
|
||||
currentNode = nextNode;
|
||||
currentWeight = nextWeight;
|
||||
}
|
||||
|
||||
// Index is close or after the last centroid.
|
||||
assert currentWeight >= 1;
|
||||
assert index - weightSoFar < count - currentWeight / 2.0;
|
||||
assert count - weightSoFar >= 0.5;
|
||||
|
||||
// Interpolate between the last mean and the max.
|
||||
double w1 = index - weightSoFar;
|
||||
double w2 = currentWeight / 2.0 - w1;
|
||||
return weightedAverage(values.mean(currentNode), w2, max, w1);
|
||||
}
|
||||
|
||||
@Override
|
||||
public Collection<Centroid> centroids() {
|
||||
return Collections.unmodifiableCollection(summary);
|
||||
}
|
||||
|
||||
@Override
|
||||
public double compression() {
|
||||
return compression;
|
||||
}
|
||||
|
||||
/**
|
||||
* Returns an upper bound on the number bytes that will be required to represent this histogram.
|
||||
*/
|
||||
@Override
|
||||
public int byteSize() {
|
||||
compress();
|
||||
return 64 + summary.size() * 13;
|
||||
}
|
||||
}
|
|
@ -0,0 +1,69 @@
|
|||
/*
|
||||
* Licensed to Elasticsearch B.V. under one or more contributor
|
||||
* license agreements. See the NOTICE file distributed with
|
||||
* this work for additional information regarding copyright
|
||||
* ownership. Elasticsearch B.V. licenses this file to you under
|
||||
* the Apache License, Version 2.0 (the "License"); you may
|
||||
* not use this file except in compliance with the License.
|
||||
* You may obtain a copy of the License at
|
||||
*
|
||||
* http://www.apache.org/licenses/LICENSE-2.0
|
||||
*
|
||||
* Unless required by applicable law or agreed to in writing,
|
||||
* software distributed under the License is distributed on an
|
||||
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||
* KIND, either express or implied. See the License for the
|
||||
* specific language governing permissions and limitations
|
||||
* under the License.
|
||||
*
|
||||
* This project is based on a modification of https://github.com/tdunning/t-digest which is licensed under the Apache 2.0 License.
|
||||
*/
|
||||
|
||||
package org.elasticsearch.tdigest;
|
||||
|
||||
public abstract class AbstractTDigest extends TDigest {
|
||||
/**
|
||||
* Same as {@link #weightedAverageSorted(double, double, double, double)} but flips
|
||||
* the order of the variables if <code>x2</code> is greater than
|
||||
* <code>x1</code>.
|
||||
*/
|
||||
static double weightedAverage(double x1, double w1, double x2, double w2) {
|
||||
if (x1 <= x2) {
|
||||
return weightedAverageSorted(x1, w1, x2, w2);
|
||||
} else {
|
||||
return weightedAverageSorted(x2, w2, x1, w1);
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Compute the weighted average between <code>x1</code> with a weight of
|
||||
* <code>w1</code> and <code>x2</code> with a weight of <code>w2</code>.
|
||||
* This expects <code>x1</code> to be less than or equal to <code>x2</code>
|
||||
* and is guaranteed to return a number in <code>[x1, x2]</code>. An
|
||||
* explicit check is required since this isn't guaranteed with floating-point
|
||||
* numbers.
|
||||
*/
|
||||
private static double weightedAverageSorted(double x1, double w1, double x2, double w2) {
|
||||
assert x1 <= x2;
|
||||
final double x = (x1 * w1 + x2 * w2) / (w1 + w2);
|
||||
return Math.max(x1, Math.min(x, x2));
|
||||
}
|
||||
|
||||
/**
|
||||
* Interpolate from a given value given a low and a high reference values
|
||||
* @param x value to interpolate from
|
||||
* @param x0 low reference value
|
||||
* @param x1 high reference value
|
||||
* @return interpolated value
|
||||
*/
|
||||
static double interpolate(double x, double x0, double x1) {
|
||||
return (x - x0) / (x1 - x0);
|
||||
}
|
||||
|
||||
@Override
|
||||
public void add(TDigest other) {
|
||||
for (Centroid centroid : other.centroids()) {
|
||||
add(centroid.mean(), centroid.count());
|
||||
}
|
||||
}
|
||||
}
|
|
@ -0,0 +1,106 @@
|
|||
/*
|
||||
* Licensed to Elasticsearch B.V. under one or more contributor
|
||||
* license agreements. See the NOTICE file distributed with
|
||||
* this work for additional information regarding copyright
|
||||
* ownership. Elasticsearch B.V. licenses this file to you under
|
||||
* the Apache License, Version 2.0 (the "License"); you may
|
||||
* not use this file except in compliance with the License.
|
||||
* You may obtain a copy of the License at
|
||||
*
|
||||
* http://www.apache.org/licenses/LICENSE-2.0
|
||||
*
|
||||
* Unless required by applicable law or agreed to in writing,
|
||||
* software distributed under the License is distributed on an
|
||||
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||
* KIND, either express or implied. See the License for the
|
||||
* specific language governing permissions and limitations
|
||||
* under the License.
|
||||
*
|
||||
* This project is based on a modification of https://github.com/tdunning/t-digest which is licensed under the Apache 2.0 License.
|
||||
*/
|
||||
|
||||
package org.elasticsearch.tdigest;
|
||||
|
||||
import java.util.concurrent.atomic.AtomicInteger;
|
||||
|
||||
/**
|
||||
* A single centroid which represents a number of data points.
|
||||
*/
|
||||
public class Centroid implements Comparable<Centroid> {
|
||||
private static final AtomicInteger uniqueCount = new AtomicInteger(1);
|
||||
|
||||
private double centroid = 0;
|
||||
private int count = 0;
|
||||
|
||||
// The ID is transient because it must be unique within a given JVM. A new
|
||||
// ID should be generated from uniqueCount when a Centroid is deserialized.
|
||||
private transient int id;
|
||||
|
||||
private Centroid() {
|
||||
id = uniqueCount.getAndIncrement();
|
||||
}
|
||||
|
||||
public Centroid(double x) {
|
||||
this();
|
||||
start(x, 1, uniqueCount.getAndIncrement());
|
||||
}
|
||||
|
||||
public Centroid(double x, int w) {
|
||||
this();
|
||||
start(x, w, uniqueCount.getAndIncrement());
|
||||
}
|
||||
|
||||
public Centroid(double x, int w, int id) {
|
||||
this();
|
||||
start(x, w, id);
|
||||
}
|
||||
|
||||
private void start(double x, int w, int id) {
|
||||
this.id = id;
|
||||
add(x, w);
|
||||
}
|
||||
|
||||
public void add(double x, int w) {
|
||||
count += w;
|
||||
centroid += w * (x - centroid) / count;
|
||||
}
|
||||
|
||||
public double mean() {
|
||||
return centroid;
|
||||
}
|
||||
|
||||
public int count() {
|
||||
return count;
|
||||
}
|
||||
|
||||
public int id() {
|
||||
return id;
|
||||
}
|
||||
|
||||
@Override
|
||||
public String toString() {
|
||||
return "Centroid{" + "centroid=" + centroid + ", count=" + count + '}';
|
||||
}
|
||||
|
||||
@Override
|
||||
public int hashCode() {
|
||||
return id;
|
||||
}
|
||||
|
||||
@Override
|
||||
public boolean equals(Object o) {
|
||||
if (o instanceof Centroid == false) {
|
||||
return false;
|
||||
}
|
||||
return id == ((Centroid) o).id;
|
||||
}
|
||||
|
||||
@Override
|
||||
public int compareTo(Centroid o) {
|
||||
int r = Double.compare(centroid, o.centroid);
|
||||
if (r == 0) {
|
||||
r = id - o.id;
|
||||
}
|
||||
return r;
|
||||
}
|
||||
}
|
|
@ -0,0 +1,98 @@
|
|||
/*
|
||||
* Licensed to Elasticsearch B.V. under one or more contributor
|
||||
* license agreements. See the NOTICE file distributed with
|
||||
* this work for additional information regarding copyright
|
||||
* ownership. Elasticsearch B.V. licenses this file to you under
|
||||
* the Apache License, Version 2.0 (the "License"); you may
|
||||
* not use this file except in compliance with the License.
|
||||
* You may obtain a copy of the License at
|
||||
*
|
||||
* http://www.apache.org/licenses/LICENSE-2.0
|
||||
*
|
||||
* Unless required by applicable law or agreed to in writing,
|
||||
* software distributed under the License is distributed on an
|
||||
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||
* KIND, either express or implied. See the License for the
|
||||
* specific language governing permissions and limitations
|
||||
* under the License.
|
||||
*
|
||||
* This project is based on a modification of https://github.com/tdunning/t-digest which is licensed under the Apache 2.0 License.
|
||||
*/
|
||||
|
||||
package org.elasticsearch.tdigest;
|
||||
|
||||
import java.util.List;
|
||||
import java.util.function.Function;
|
||||
|
||||
/**
|
||||
* Reference implementations for cdf and quantile if we have all data sorted.
|
||||
*/
|
||||
public class Dist {
|
||||
|
||||
private static double cdf(final double x, final int length, Function<Integer, Double> elementGetter) {
|
||||
if (Double.compare(x, elementGetter.apply(0)) < 0) {
|
||||
return 0;
|
||||
}
|
||||
|
||||
double n1 = 0.5;
|
||||
int n2 = 0;
|
||||
for (int i = 1; i < length; i++) {
|
||||
double value = elementGetter.apply(i);
|
||||
int compareResult = Double.compare(value, x);
|
||||
if (compareResult > 0) {
|
||||
if (Double.compare(n2, 0) > 0) {
|
||||
return (n1 + 0.5 * n2) / length;
|
||||
}
|
||||
double previousValue = elementGetter.apply(i - 1);
|
||||
double factor = (x - previousValue) / (value - previousValue);
|
||||
return (n1 + factor) / length;
|
||||
}
|
||||
if (compareResult < 0) {
|
||||
n1++;
|
||||
} else {
|
||||
n2++;
|
||||
}
|
||||
}
|
||||
return (length - 0.5 * n2) / length;
|
||||
}
|
||||
|
||||
public static double cdf(final double x, double[] data) {
|
||||
return cdf(x, data.length, (i) -> data[i]);
|
||||
}
|
||||
|
||||
public static double cdf(final double x, List<Double> data) {
|
||||
return cdf(x, data.size(), data::get);
|
||||
}
|
||||
|
||||
private static double quantile(final double q, final int length, Function<Integer, Double> elementGetter) {
|
||||
if (length == 0) {
|
||||
return Double.NaN;
|
||||
}
|
||||
double index = q * (length - 1);
|
||||
int low_index = (int) Math.floor(index);
|
||||
int high_index = low_index + 1;
|
||||
double weight = index - low_index;
|
||||
|
||||
if (index <= 0) {
|
||||
low_index = 0;
|
||||
high_index = 0;
|
||||
weight = 0;
|
||||
}
|
||||
if (index >= length - 1) {
|
||||
low_index = length - 1;
|
||||
high_index = length - 1;
|
||||
weight = 0;
|
||||
}
|
||||
double low_value = elementGetter.apply(low_index);
|
||||
double high_value = elementGetter.apply(high_index);
|
||||
return low_value + weight * (high_value - low_value);
|
||||
}
|
||||
|
||||
public static double quantile(final double q, double[] data) {
|
||||
return quantile(q, data.length, (i) -> data[i]);
|
||||
}
|
||||
|
||||
public static double quantile(final double q, List<Double> data) {
|
||||
return quantile(q, data.size(), data::get);
|
||||
}
|
||||
}
|
|
@ -0,0 +1,586 @@
|
|||
/*
|
||||
* Licensed to Elasticsearch B.V. under one or more contributor
|
||||
* license agreements. See the NOTICE file distributed with
|
||||
* this work for additional information regarding copyright
|
||||
* ownership. Elasticsearch B.V. licenses this file to you under
|
||||
* the Apache License, Version 2.0 (the "License"); you may
|
||||
* not use this file except in compliance with the License.
|
||||
* You may obtain a copy of the License at
|
||||
*
|
||||
* http://www.apache.org/licenses/LICENSE-2.0
|
||||
*
|
||||
* Unless required by applicable law or agreed to in writing,
|
||||
* software distributed under the License is distributed on an
|
||||
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||
* KIND, either express or implied. See the License for the
|
||||
* specific language governing permissions and limitations
|
||||
* under the License.
|
||||
*
|
||||
* This project is based on a modification of https://github.com/tdunning/t-digest which is licensed under the Apache 2.0 License.
|
||||
*/
|
||||
|
||||
package org.elasticsearch.tdigest;
|
||||
|
||||
import java.util.Arrays;
|
||||
|
||||
/**
|
||||
* An AVL-tree structure stored in parallel arrays.
|
||||
* This class only stores the tree structure, so you need to extend it if you
|
||||
* want to add data to the nodes, typically by using arrays and node
|
||||
* identifiers as indices.
|
||||
*/
|
||||
abstract class IntAVLTree {
|
||||
|
||||
/**
|
||||
* We use <code>0</code> instead of <code>-1</code> so that left(NIL) works without
|
||||
* condition.
|
||||
*/
|
||||
protected static final int NIL = 0;
|
||||
|
||||
/** Grow a size by 1/8. */
|
||||
static int oversize(int size) {
|
||||
return size + (size >>> 3);
|
||||
}
|
||||
|
||||
private final NodeAllocator nodeAllocator;
|
||||
private int root;
|
||||
private int[] parent;
|
||||
private int[] left;
|
||||
private int[] right;
|
||||
private byte[] depth;
|
||||
|
||||
IntAVLTree(int initialCapacity) {
|
||||
nodeAllocator = new NodeAllocator();
|
||||
root = NIL;
|
||||
parent = new int[initialCapacity];
|
||||
left = new int[initialCapacity];
|
||||
right = new int[initialCapacity];
|
||||
depth = new byte[initialCapacity];
|
||||
}
|
||||
|
||||
IntAVLTree() {
|
||||
this(16);
|
||||
}
|
||||
|
||||
/**
|
||||
* Return the current root of the tree.
|
||||
*/
|
||||
public int root() {
|
||||
return root;
|
||||
}
|
||||
|
||||
/**
|
||||
* Return the current capacity, which is the number of nodes that this tree
|
||||
* can hold.
|
||||
*/
|
||||
public int capacity() {
|
||||
return parent.length;
|
||||
}
|
||||
|
||||
/**
|
||||
* Resize internal storage in order to be able to store data for nodes up to
|
||||
* <code>newCapacity</code> (excluded).
|
||||
*/
|
||||
protected void resize(int newCapacity) {
|
||||
parent = Arrays.copyOf(parent, newCapacity);
|
||||
left = Arrays.copyOf(left, newCapacity);
|
||||
right = Arrays.copyOf(right, newCapacity);
|
||||
depth = Arrays.copyOf(depth, newCapacity);
|
||||
}
|
||||
|
||||
/**
|
||||
* Return the size of this tree.
|
||||
*/
|
||||
public int size() {
|
||||
return nodeAllocator.size();
|
||||
}
|
||||
|
||||
/**
|
||||
* Return the parent of the provided node.
|
||||
*/
|
||||
public int parent(int node) {
|
||||
return parent[node];
|
||||
}
|
||||
|
||||
/**
|
||||
* Return the left child of the provided node.
|
||||
*/
|
||||
public int left(int node) {
|
||||
return left[node];
|
||||
}
|
||||
|
||||
/**
|
||||
* Return the right child of the provided node.
|
||||
*/
|
||||
public int right(int node) {
|
||||
return right[node];
|
||||
}
|
||||
|
||||
/**
|
||||
* Return the depth nodes that are stored below <code>node</code> including itself.
|
||||
*/
|
||||
public int depth(int node) {
|
||||
return depth[node];
|
||||
}
|
||||
|
||||
/**
|
||||
* Return the least node under <code>node</code>.
|
||||
*/
|
||||
public int first(int node) {
|
||||
if (node == NIL) {
|
||||
return NIL;
|
||||
}
|
||||
while (true) {
|
||||
final int left = left(node);
|
||||
if (left == NIL) {
|
||||
break;
|
||||
}
|
||||
node = left;
|
||||
}
|
||||
return node;
|
||||
}
|
||||
|
||||
/**
|
||||
* Return the largest node under <code>node</code>.
|
||||
*/
|
||||
public int last(int node) {
|
||||
while (true) {
|
||||
final int right = right(node);
|
||||
if (right == NIL) {
|
||||
break;
|
||||
}
|
||||
node = right;
|
||||
}
|
||||
return node;
|
||||
}
|
||||
|
||||
/**
|
||||
* Return the least node that is strictly greater than <code>node</code>.
|
||||
*/
|
||||
public final int next(int node) {
|
||||
final int right = right(node);
|
||||
if (right != NIL) {
|
||||
return first(right);
|
||||
} else {
|
||||
int parent = parent(node);
|
||||
while (parent != NIL && node == right(parent)) {
|
||||
node = parent;
|
||||
parent = parent(parent);
|
||||
}
|
||||
return parent;
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Return the highest node that is strictly less than <code>node</code>.
|
||||
*/
|
||||
public final int prev(int node) {
|
||||
final int left = left(node);
|
||||
if (left != NIL) {
|
||||
return last(left);
|
||||
} else {
|
||||
int parent = parent(node);
|
||||
while (parent != NIL && node == left(parent)) {
|
||||
node = parent;
|
||||
parent = parent(parent);
|
||||
}
|
||||
return parent;
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Compare data against data which is stored in <code>node</code>.
|
||||
*/
|
||||
protected abstract int compare(int node);
|
||||
|
||||
/**
|
||||
* Compare data into <code>node</code>.
|
||||
*/
|
||||
protected abstract void copy(int node);
|
||||
|
||||
/**
|
||||
* Merge data into <code>node</code>.
|
||||
*/
|
||||
protected abstract void merge(int node);
|
||||
|
||||
/**
|
||||
* Add current data to the tree and return <code>true</code> if a new node was added
|
||||
* to the tree or <code>false</code> if the node was merged into an existing node.
|
||||
*/
|
||||
public boolean add() {
|
||||
if (root == NIL) {
|
||||
root = nodeAllocator.newNode();
|
||||
copy(root);
|
||||
fixAggregates(root);
|
||||
return true;
|
||||
} else {
|
||||
int node = root;
|
||||
assert parent(root) == NIL;
|
||||
int parent;
|
||||
int cmp;
|
||||
do {
|
||||
cmp = compare(node);
|
||||
if (cmp < 0) {
|
||||
parent = node;
|
||||
node = left(node);
|
||||
} else if (cmp > 0) {
|
||||
parent = node;
|
||||
node = right(node);
|
||||
} else {
|
||||
merge(node);
|
||||
return false;
|
||||
}
|
||||
} while (node != NIL);
|
||||
|
||||
node = nodeAllocator.newNode();
|
||||
if (node >= capacity()) {
|
||||
resize(oversize(node + 1));
|
||||
}
|
||||
copy(node);
|
||||
parent(node, parent);
|
||||
if (cmp < 0) {
|
||||
left(parent, node);
|
||||
} else {
|
||||
right(parent, node);
|
||||
}
|
||||
|
||||
rebalance(node);
|
||||
|
||||
return true;
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Find a node in this tree.
|
||||
*/
|
||||
public int find() {
|
||||
for (int node = root; node != NIL;) {
|
||||
final int cmp = compare(node);
|
||||
if (cmp < 0) {
|
||||
node = left(node);
|
||||
} else if (cmp > 0) {
|
||||
node = right(node);
|
||||
} else {
|
||||
return node;
|
||||
}
|
||||
}
|
||||
return NIL;
|
||||
}
|
||||
|
||||
/**
|
||||
* Update <code>node</code> with the current data.
|
||||
*/
|
||||
public void update(int node) {
|
||||
final int prev = prev(node);
|
||||
final int next = next(node);
|
||||
if ((prev == NIL || compare(prev) > 0) && (next == NIL || compare(next) < 0)) {
|
||||
// Update can be done in-place
|
||||
copy(node);
|
||||
for (int n = node; n != NIL; n = parent(n)) {
|
||||
fixAggregates(n);
|
||||
}
|
||||
} else {
|
||||
// TODO: it should be possible to find the new node position without
|
||||
// starting from scratch
|
||||
remove(node);
|
||||
add();
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Remove the specified node from the tree.
|
||||
*/
|
||||
public void remove(int node) {
|
||||
if (node == NIL) {
|
||||
throw new IllegalArgumentException();
|
||||
}
|
||||
if (left(node) != NIL && right(node) != NIL) {
|
||||
// inner node
|
||||
final int next = next(node);
|
||||
assert next != NIL;
|
||||
swap(node, next);
|
||||
}
|
||||
assert left(node) == NIL || right(node) == NIL;
|
||||
|
||||
final int parent = parent(node);
|
||||
int child = left(node);
|
||||
if (child == NIL) {
|
||||
child = right(node);
|
||||
}
|
||||
|
||||
if (child == NIL) {
|
||||
// no children
|
||||
if (node == root) {
|
||||
assert size() == 1 : size();
|
||||
root = NIL;
|
||||
} else {
|
||||
if (node == left(parent)) {
|
||||
left(parent, NIL);
|
||||
} else {
|
||||
assert node == right(parent);
|
||||
right(parent, NIL);
|
||||
}
|
||||
}
|
||||
} else {
|
||||
// one single child
|
||||
if (node == root) {
|
||||
assert size() == 2;
|
||||
root = child;
|
||||
} else if (node == left(parent)) {
|
||||
left(parent, child);
|
||||
} else {
|
||||
assert node == right(parent);
|
||||
right(parent, child);
|
||||
}
|
||||
parent(child, parent);
|
||||
}
|
||||
|
||||
release(node);
|
||||
rebalance(parent);
|
||||
}
|
||||
|
||||
private void release(int node) {
|
||||
left(node, NIL);
|
||||
right(node, NIL);
|
||||
parent(node, NIL);
|
||||
nodeAllocator.release(node);
|
||||
}
|
||||
|
||||
private void swap(int node1, int node2) {
|
||||
final int parent1 = parent(node1);
|
||||
final int parent2 = parent(node2);
|
||||
if (parent1 != NIL) {
|
||||
if (node1 == left(parent1)) {
|
||||
left(parent1, node2);
|
||||
} else {
|
||||
assert node1 == right(parent1);
|
||||
right(parent1, node2);
|
||||
}
|
||||
} else {
|
||||
assert root == node1;
|
||||
root = node2;
|
||||
}
|
||||
if (parent2 != NIL) {
|
||||
if (node2 == left(parent2)) {
|
||||
left(parent2, node1);
|
||||
} else {
|
||||
assert node2 == right(parent2);
|
||||
right(parent2, node1);
|
||||
}
|
||||
} else {
|
||||
assert root == node2;
|
||||
root = node1;
|
||||
}
|
||||
parent(node1, parent2);
|
||||
parent(node2, parent1);
|
||||
|
||||
final int left1 = left(node1);
|
||||
final int left2 = left(node2);
|
||||
left(node1, left2);
|
||||
if (left2 != NIL) {
|
||||
parent(left2, node1);
|
||||
}
|
||||
left(node2, left1);
|
||||
if (left1 != NIL) {
|
||||
parent(left1, node2);
|
||||
}
|
||||
|
||||
final int right1 = right(node1);
|
||||
final int right2 = right(node2);
|
||||
right(node1, right2);
|
||||
if (right2 != NIL) {
|
||||
parent(right2, node1);
|
||||
}
|
||||
right(node2, right1);
|
||||
if (right1 != NIL) {
|
||||
parent(right1, node2);
|
||||
}
|
||||
|
||||
final int depth1 = depth(node1);
|
||||
final int depth2 = depth(node2);
|
||||
depth(node1, depth2);
|
||||
depth(node2, depth1);
|
||||
}
|
||||
|
||||
private int balanceFactor(int node) {
|
||||
return depth(left(node)) - depth(right(node));
|
||||
}
|
||||
|
||||
private void rebalance(int node) {
|
||||
for (int n = node; n != NIL;) {
|
||||
final int p = parent(n);
|
||||
|
||||
fixAggregates(n);
|
||||
|
||||
switch (balanceFactor(n)) {
|
||||
case -2:
|
||||
final int right = right(n);
|
||||
if (balanceFactor(right) == 1) {
|
||||
rotateRight(right);
|
||||
}
|
||||
rotateLeft(n);
|
||||
break;
|
||||
case 2:
|
||||
final int left = left(n);
|
||||
if (balanceFactor(left) == -1) {
|
||||
rotateLeft(left);
|
||||
}
|
||||
rotateRight(n);
|
||||
break;
|
||||
case -1:
|
||||
case 0:
|
||||
case 1:
|
||||
break; // ok
|
||||
default:
|
||||
throw new AssertionError();
|
||||
}
|
||||
|
||||
n = p;
|
||||
}
|
||||
}
|
||||
|
||||
protected void fixAggregates(int node) {
|
||||
depth(node, 1 + Math.max(depth(left(node)), depth(right(node))));
|
||||
}
|
||||
|
||||
/** Rotate left the subtree under <code>n</code> */
|
||||
private void rotateLeft(int n) {
|
||||
final int r = right(n);
|
||||
final int lr = left(r);
|
||||
right(n, lr);
|
||||
if (lr != NIL) {
|
||||
parent(lr, n);
|
||||
}
|
||||
final int p = parent(n);
|
||||
parent(r, p);
|
||||
if (p == NIL) {
|
||||
root = r;
|
||||
} else if (left(p) == n) {
|
||||
left(p, r);
|
||||
} else {
|
||||
assert right(p) == n;
|
||||
right(p, r);
|
||||
}
|
||||
left(r, n);
|
||||
parent(n, r);
|
||||
fixAggregates(n);
|
||||
fixAggregates(parent(n));
|
||||
}
|
||||
|
||||
/** Rotate right the subtree under <code>n</code> */
|
||||
private void rotateRight(int n) {
|
||||
final int l = left(n);
|
||||
final int rl = right(l);
|
||||
left(n, rl);
|
||||
if (rl != NIL) {
|
||||
parent(rl, n);
|
||||
}
|
||||
final int p = parent(n);
|
||||
parent(l, p);
|
||||
if (p == NIL) {
|
||||
root = l;
|
||||
} else if (right(p) == n) {
|
||||
right(p, l);
|
||||
} else {
|
||||
assert left(p) == n;
|
||||
left(p, l);
|
||||
}
|
||||
right(l, n);
|
||||
parent(n, l);
|
||||
fixAggregates(n);
|
||||
fixAggregates(parent(n));
|
||||
}
|
||||
|
||||
private void parent(int node, int parent) {
|
||||
assert node != NIL;
|
||||
this.parent[node] = parent;
|
||||
}
|
||||
|
||||
private void left(int node, int left) {
|
||||
assert node != NIL;
|
||||
this.left[node] = left;
|
||||
}
|
||||
|
||||
private void right(int node, int right) {
|
||||
assert node != NIL;
|
||||
this.right[node] = right;
|
||||
}
|
||||
|
||||
private void depth(int node, int depth) {
|
||||
assert node != NIL;
|
||||
assert depth >= 0 && depth <= Byte.MAX_VALUE;
|
||||
this.depth[node] = (byte) depth;
|
||||
}
|
||||
|
||||
void checkBalance(int node) {
|
||||
if (node == NIL) {
|
||||
assert depth(node) == 0;
|
||||
} else {
|
||||
assert depth(node) == 1 + Math.max(depth(left(node)), depth(right(node)));
|
||||
assert Math.abs(depth(left(node)) - depth(right(node))) <= 1;
|
||||
checkBalance(left(node));
|
||||
checkBalance(right(node));
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* A stack of int values.
|
||||
*/
|
||||
private static class IntStack {
|
||||
|
||||
private int[] stack;
|
||||
private int size;
|
||||
|
||||
IntStack() {
|
||||
stack = new int[0];
|
||||
size = 0;
|
||||
}
|
||||
|
||||
int size() {
|
||||
return size;
|
||||
}
|
||||
|
||||
int pop() {
|
||||
return stack[--size];
|
||||
}
|
||||
|
||||
void push(int v) {
|
||||
if (size >= stack.length) {
|
||||
final int newLength = oversize(size + 1);
|
||||
stack = Arrays.copyOf(stack, newLength);
|
||||
}
|
||||
stack[size++] = v;
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
private static class NodeAllocator {
|
||||
|
||||
private int nextNode;
|
||||
private final IntStack releasedNodes;
|
||||
|
||||
NodeAllocator() {
|
||||
nextNode = NIL + 1;
|
||||
releasedNodes = new IntStack();
|
||||
}
|
||||
|
||||
int newNode() {
|
||||
if (releasedNodes.size() > 0) {
|
||||
return releasedNodes.pop();
|
||||
} else {
|
||||
return nextNode++;
|
||||
}
|
||||
}
|
||||
|
||||
void release(int node) {
|
||||
assert node < nextNode;
|
||||
releasedNodes.push(node);
|
||||
}
|
||||
|
||||
int size() {
|
||||
return nextNode - releasedNodes.size() - 1;
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
}
|
|
@ -0,0 +1,620 @@
|
|||
/*
|
||||
* Licensed to Elasticsearch B.V. under one or more contributor
|
||||
* license agreements. See the NOTICE file distributed with
|
||||
* this work for additional information regarding copyright
|
||||
* ownership. Elasticsearch B.V. licenses this file to you under
|
||||
* the Apache License, Version 2.0 (the "License"); you may
|
||||
* not use this file except in compliance with the License.
|
||||
* You may obtain a copy of the License at
|
||||
*
|
||||
* http://www.apache.org/licenses/LICENSE-2.0
|
||||
*
|
||||
* Unless required by applicable law or agreed to in writing,
|
||||
* software distributed under the License is distributed on an
|
||||
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||
* KIND, either express or implied. See the License for the
|
||||
* specific language governing permissions and limitations
|
||||
* under the License.
|
||||
*
|
||||
* This project is based on a modification of https://github.com/tdunning/t-digest which is licensed under the Apache 2.0 License.
|
||||
*/
|
||||
|
||||
package org.elasticsearch.tdigest;
|
||||
|
||||
import java.util.AbstractCollection;
|
||||
import java.util.Collection;
|
||||
import java.util.Iterator;
|
||||
import java.util.List;
|
||||
|
||||
/**
|
||||
* Maintains a t-digest by collecting new points in a buffer that is then sorted occasionally and merged
|
||||
* into a sorted array that contains previously computed centroids.
|
||||
* <p>
|
||||
* This can be very fast because the cost of sorting and merging is amortized over several insertion. If
|
||||
* we keep N centroids total and have the input array is k long, then the amortized cost is something like
|
||||
* <p>
|
||||
* N/k + log k
|
||||
* <p>
|
||||
* These costs even out when N/k = log k. Balancing costs is often a good place to start in optimizing an
|
||||
* algorithm. For different values of compression factor, the following table shows estimated asymptotic
|
||||
* values of N and suggested values of k:
|
||||
* <table>
|
||||
* <thead>
|
||||
* <tr><td>Compression</td><td>N</td><td>k</td></tr>
|
||||
* </thead>
|
||||
* <tbody>
|
||||
* <tr><td>50</td><td>78</td><td>25</td></tr>
|
||||
* <tr><td>100</td><td>157</td><td>42</td></tr>
|
||||
* <tr><td>200</td><td>314</td><td>73</td></tr>
|
||||
* </tbody>
|
||||
* <caption>Sizing considerations for t-digest</caption>
|
||||
* </table>
|
||||
* <p>
|
||||
* The virtues of this kind of t-digest implementation include:
|
||||
* <ul>
|
||||
* <li>No allocation is required after initialization</li>
|
||||
* <li>The data structure automatically compresses existing centroids when possible</li>
|
||||
* <li>No Java object overhead is incurred for centroids since data is kept in primitive arrays</li>
|
||||
* </ul>
|
||||
* <p>
|
||||
* The current implementation takes the liberty of using ping-pong buffers for implementing the merge resulting
|
||||
* in a substantial memory penalty, but the complexity of an in place merge was not considered as worthwhile
|
||||
* since even with the overhead, the memory cost is less than 40 bytes per centroid which is much less than half
|
||||
* what the AVLTreeDigest uses and no dynamic allocation is required at all.
|
||||
*/
|
||||
public class MergingDigest extends AbstractTDigest {
|
||||
private int mergeCount = 0;
|
||||
|
||||
private final double publicCompression;
|
||||
private final double compression;
|
||||
|
||||
// points to the first unused centroid
|
||||
private int lastUsedCell;
|
||||
|
||||
// sum_i weight[i] See also unmergedWeight
|
||||
private double totalWeight = 0;
|
||||
|
||||
// number of points that have been added to each merged centroid
|
||||
private final double[] weight;
|
||||
// mean of points added to each merged centroid
|
||||
private final double[] mean;
|
||||
|
||||
// sum_i tempWeight[i]
|
||||
private double unmergedWeight = 0;
|
||||
|
||||
// this is the index of the next temporary centroid
|
||||
// this is a more Java-like convention than lastUsedCell uses
|
||||
private int tempUsed = 0;
|
||||
private final double[] tempWeight;
|
||||
private final double[] tempMean;
|
||||
|
||||
// array used for sorting the temp centroids. This is a field
|
||||
// to avoid allocations during operation
|
||||
private final int[] order;
|
||||
|
||||
// if true, alternate upward and downward merge passes
|
||||
public boolean useAlternatingSort = true;
|
||||
// if true, use higher working value of compression during construction, then reduce on presentation
|
||||
public boolean useTwoLevelCompression = true;
|
||||
|
||||
// this forces centroid merging based on size limit rather than
|
||||
// based on accumulated k-index. This can be much faster since we
|
||||
// scale functions are more expensive than the corresponding
|
||||
// weight limits.
|
||||
public static boolean useWeightLimit = true;
|
||||
|
||||
/**
|
||||
* Allocates a buffer merging t-digest. This is the normally used constructor that
|
||||
* allocates default sized internal arrays. Other versions are available, but should
|
||||
* only be used for special cases.
|
||||
*
|
||||
* @param compression The compression factor
|
||||
*/
|
||||
public MergingDigest(double compression) {
|
||||
this(compression, -1);
|
||||
}
|
||||
|
||||
/**
|
||||
* If you know the size of the temporary buffer for incoming points, you can use this entry point.
|
||||
*
|
||||
* @param compression Compression factor for t-digest. Same as 1/\delta in the paper.
|
||||
* @param bufferSize How many samples to retain before merging.
|
||||
*/
|
||||
public MergingDigest(double compression, int bufferSize) {
|
||||
// we can guarantee that we only need ceiling(compression).
|
||||
this(compression, bufferSize, -1);
|
||||
}
|
||||
|
||||
/**
|
||||
* Fully specified constructor. Normally only used for deserializing a buffer t-digest.
|
||||
*
|
||||
* @param compression Compression factor
|
||||
* @param bufferSize Number of temporary centroids
|
||||
* @param size Size of main buffer
|
||||
*/
|
||||
public MergingDigest(double compression, int bufferSize, int size) {
|
||||
// ensure compression >= 10
|
||||
// default size = 2 * ceil(compression)
|
||||
// default bufferSize = 5 * size
|
||||
// scale = max(2, bufferSize / size - 1)
|
||||
// compression, publicCompression = sqrt(scale-1)*compression, compression
|
||||
// ensure size > 2 * compression + weightLimitFudge
|
||||
// ensure bufferSize > 2*size
|
||||
|
||||
// force reasonable value. Anything less than 10 doesn't make much sense because
|
||||
// too few centroids are retained
|
||||
if (compression < 10) {
|
||||
compression = 10;
|
||||
}
|
||||
|
||||
// the weight limit is too conservative about sizes and can require a bit of extra room
|
||||
double sizeFudge = 0;
|
||||
if (useWeightLimit) {
|
||||
sizeFudge = 10;
|
||||
}
|
||||
|
||||
// default size
|
||||
size = (int) Math.max(compression + sizeFudge, size);
|
||||
|
||||
// default buffer size has enough capacity
|
||||
if (bufferSize < 5 * size) {
|
||||
// TODO update with current numbers
|
||||
// having a big buffer is good for speed
|
||||
// experiments show bufferSize = 1 gives half the performance of bufferSize=10
|
||||
// bufferSize = 2 gives 40% worse performance than 10
|
||||
// but bufferSize = 5 only costs about 5-10%
|
||||
//
|
||||
// compression factor time(us)
|
||||
// 50 1 0.275799
|
||||
// 50 2 0.151368
|
||||
// 50 5 0.108856
|
||||
// 50 10 0.102530
|
||||
// 100 1 0.215121
|
||||
// 100 2 0.142743
|
||||
// 100 5 0.112278
|
||||
// 100 10 0.107753
|
||||
// 200 1 0.210972
|
||||
// 200 2 0.148613
|
||||
// 200 5 0.118220
|
||||
// 200 10 0.112970
|
||||
// 500 1 0.219469
|
||||
// 500 2 0.158364
|
||||
// 500 5 0.127552
|
||||
// 500 10 0.121505
|
||||
bufferSize = 5 * size;
|
||||
}
|
||||
|
||||
// scale is the ratio of extra buffer to the final size
|
||||
// we have to account for the fact that we copy all live centroids into the incoming space
|
||||
double scale = Math.max(1, bufferSize / size - 1);
|
||||
if (useTwoLevelCompression == false) {
|
||||
scale = 1;
|
||||
}
|
||||
|
||||
// publicCompression is how many centroids the user asked for
|
||||
// compression is how many we actually keep
|
||||
this.publicCompression = compression;
|
||||
this.compression = Math.sqrt(scale) * publicCompression;
|
||||
|
||||
// changing the compression could cause buffers to be too small, readjust if so
|
||||
if (size < this.compression + sizeFudge) {
|
||||
size = (int) Math.ceil(this.compression + sizeFudge);
|
||||
}
|
||||
|
||||
// ensure enough space in buffer (possibly again)
|
||||
if (bufferSize <= 2 * size) {
|
||||
bufferSize = 2 * size;
|
||||
}
|
||||
|
||||
weight = new double[size];
|
||||
mean = new double[size];
|
||||
|
||||
tempWeight = new double[bufferSize];
|
||||
tempMean = new double[bufferSize];
|
||||
order = new int[bufferSize];
|
||||
|
||||
lastUsedCell = 0;
|
||||
}
|
||||
|
||||
@Override
|
||||
public void add(double x, int w) {
|
||||
checkValue(x);
|
||||
if (tempUsed >= tempWeight.length - lastUsedCell - 1) {
|
||||
mergeNewValues();
|
||||
}
|
||||
int where = tempUsed++;
|
||||
tempWeight[where] = w;
|
||||
tempMean[where] = x;
|
||||
unmergedWeight += w;
|
||||
if (x < min) {
|
||||
min = x;
|
||||
}
|
||||
if (x > max) {
|
||||
max = x;
|
||||
}
|
||||
}
|
||||
|
||||
private void add(double[] m, double[] w, int count) {
|
||||
if (m.length != w.length) {
|
||||
throw new IllegalArgumentException("Arrays not same length");
|
||||
}
|
||||
if (m.length < count + lastUsedCell) {
|
||||
// make room to add existing centroids
|
||||
double[] m1 = new double[count + lastUsedCell];
|
||||
System.arraycopy(m, 0, m1, 0, count);
|
||||
m = m1;
|
||||
double[] w1 = new double[count + lastUsedCell];
|
||||
System.arraycopy(w, 0, w1, 0, count);
|
||||
w = w1;
|
||||
}
|
||||
double total = 0;
|
||||
for (int i = 0; i < count; i++) {
|
||||
total += w[i];
|
||||
}
|
||||
merge(m, w, count, null, total, false, compression);
|
||||
}
|
||||
|
||||
@Override
|
||||
public void add(List<? extends TDigest> others) {
|
||||
if (others.size() == 0) {
|
||||
return;
|
||||
}
|
||||
int size = 0;
|
||||
for (TDigest other : others) {
|
||||
other.compress();
|
||||
size += other.centroidCount();
|
||||
}
|
||||
|
||||
double[] m = new double[size];
|
||||
double[] w = new double[size];
|
||||
int offset = 0;
|
||||
for (TDigest other : others) {
|
||||
if (other instanceof MergingDigest md) {
|
||||
System.arraycopy(md.mean, 0, m, offset, md.lastUsedCell);
|
||||
System.arraycopy(md.weight, 0, w, offset, md.lastUsedCell);
|
||||
offset += md.lastUsedCell;
|
||||
} else {
|
||||
for (Centroid centroid : other.centroids()) {
|
||||
m[offset] = centroid.mean();
|
||||
w[offset] = centroid.count();
|
||||
offset++;
|
||||
}
|
||||
}
|
||||
}
|
||||
add(m, w, size);
|
||||
}
|
||||
|
||||
private void mergeNewValues() {
|
||||
mergeNewValues(compression);
|
||||
}
|
||||
|
||||
private void mergeNewValues(double compression) {
|
||||
if (totalWeight == 0 && unmergedWeight == 0) {
|
||||
// seriously nothing to do
|
||||
return;
|
||||
}
|
||||
if (unmergedWeight > 0) {
|
||||
// note that we run the merge in reverse every other merge to avoid left-to-right bias in merging
|
||||
merge(tempMean, tempWeight, tempUsed, order, unmergedWeight, useAlternatingSort & mergeCount % 2 == 1, compression);
|
||||
mergeCount++;
|
||||
tempUsed = 0;
|
||||
unmergedWeight = 0;
|
||||
}
|
||||
}
|
||||
|
||||
private void merge(
|
||||
double[] incomingMean,
|
||||
double[] incomingWeight,
|
||||
int incomingCount,
|
||||
int[] incomingOrder,
|
||||
double unmergedWeight,
|
||||
boolean runBackwards,
|
||||
double compression
|
||||
) {
|
||||
// when our incoming buffer fills up, we combine our existing centroids with the incoming data,
|
||||
// and then reduce the centroids by merging if possible
|
||||
assert lastUsedCell <= 0 || weight[0] == 1;
|
||||
assert lastUsedCell <= 0 || weight[lastUsedCell - 1] == 1;
|
||||
System.arraycopy(mean, 0, incomingMean, incomingCount, lastUsedCell);
|
||||
System.arraycopy(weight, 0, incomingWeight, incomingCount, lastUsedCell);
|
||||
incomingCount += lastUsedCell;
|
||||
|
||||
if (incomingOrder == null) {
|
||||
incomingOrder = new int[incomingCount];
|
||||
}
|
||||
Sort.stableSort(incomingOrder, incomingMean, incomingCount);
|
||||
|
||||
totalWeight += unmergedWeight;
|
||||
|
||||
// option to run backwards is to help investigate bias in errors
|
||||
if (runBackwards) {
|
||||
Sort.reverse(incomingOrder, 0, incomingCount);
|
||||
}
|
||||
|
||||
// start by copying the least incoming value to the normal buffer
|
||||
lastUsedCell = 0;
|
||||
mean[lastUsedCell] = incomingMean[incomingOrder[0]];
|
||||
weight[lastUsedCell] = incomingWeight[incomingOrder[0]];
|
||||
double wSoFar = 0;
|
||||
|
||||
// weight will contain all zeros after this loop
|
||||
|
||||
double normalizer = scale.normalizer(compression, totalWeight);
|
||||
double k1 = scale.k(0, normalizer);
|
||||
double wLimit = totalWeight * scale.q(k1 + 1, normalizer);
|
||||
for (int i = 1; i < incomingCount; i++) {
|
||||
int ix = incomingOrder[i];
|
||||
double proposedWeight = weight[lastUsedCell] + incomingWeight[ix];
|
||||
double projectedW = wSoFar + proposedWeight;
|
||||
boolean addThis;
|
||||
if (useWeightLimit) {
|
||||
double q0 = wSoFar / totalWeight;
|
||||
double q2 = (wSoFar + proposedWeight) / totalWeight;
|
||||
addThis = proposedWeight <= totalWeight * Math.min(scale.max(q0, normalizer), scale.max(q2, normalizer));
|
||||
} else {
|
||||
addThis = projectedW <= wLimit;
|
||||
}
|
||||
if (i == 1 || i == incomingCount - 1) {
|
||||
// force last centroid to never merge
|
||||
addThis = false;
|
||||
}
|
||||
|
||||
if (addThis) {
|
||||
// next point will fit
|
||||
// so merge into existing centroid
|
||||
weight[lastUsedCell] += incomingWeight[ix];
|
||||
mean[lastUsedCell] = mean[lastUsedCell] + (incomingMean[ix] - mean[lastUsedCell]) * incomingWeight[ix]
|
||||
/ weight[lastUsedCell];
|
||||
incomingWeight[ix] = 0;
|
||||
} else {
|
||||
// didn't fit ... move to next output, copy out first centroid
|
||||
wSoFar += weight[lastUsedCell];
|
||||
if (useWeightLimit == false) {
|
||||
k1 = scale.k(wSoFar / totalWeight, normalizer);
|
||||
wLimit = totalWeight * scale.q(k1 + 1, normalizer);
|
||||
}
|
||||
|
||||
lastUsedCell++;
|
||||
mean[lastUsedCell] = incomingMean[ix];
|
||||
weight[lastUsedCell] = incomingWeight[ix];
|
||||
incomingWeight[ix] = 0;
|
||||
}
|
||||
}
|
||||
// points to next empty cell
|
||||
lastUsedCell++;
|
||||
|
||||
// sanity check
|
||||
double sum = 0;
|
||||
for (int i = 0; i < lastUsedCell; i++) {
|
||||
sum += weight[i];
|
||||
}
|
||||
assert sum == totalWeight;
|
||||
if (runBackwards) {
|
||||
Sort.reverse(mean, 0, lastUsedCell);
|
||||
Sort.reverse(weight, 0, lastUsedCell);
|
||||
}
|
||||
if (totalWeight > 0) {
|
||||
min = Math.min(min, mean[0]);
|
||||
max = Math.max(max, mean[lastUsedCell - 1]);
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Merges any pending inputs and compresses the data down to the public setting.
|
||||
* Note that this typically loses a bit of precision and thus isn't a thing to
|
||||
* be doing all the time. It is best done only when we want to show results to
|
||||
* the outside world.
|
||||
*/
|
||||
@Override
|
||||
public void compress() {
|
||||
mergeNewValues(publicCompression);
|
||||
}
|
||||
|
||||
@Override
|
||||
public long size() {
|
||||
return (long) (totalWeight + unmergedWeight);
|
||||
}
|
||||
|
||||
@Override
|
||||
public double cdf(double x) {
|
||||
checkValue(x);
|
||||
mergeNewValues();
|
||||
|
||||
if (lastUsedCell == 0) {
|
||||
// no data to examine
|
||||
return Double.NaN;
|
||||
}
|
||||
if (lastUsedCell == 1) {
|
||||
if (x < min) return 0;
|
||||
if (x > max) return 1;
|
||||
return 0.5;
|
||||
} else {
|
||||
if (x < min) {
|
||||
return 0;
|
||||
}
|
||||
if (Double.compare(x, min) == 0) {
|
||||
// we have one or more centroids == x, treat them as one
|
||||
// dw will accumulate the weight of all of the centroids at x
|
||||
double dw = 0;
|
||||
for (int i = 0; i < lastUsedCell && Double.compare(mean[i], x) == 0; i++) {
|
||||
dw += weight[i];
|
||||
}
|
||||
return dw / 2.0 / size();
|
||||
}
|
||||
|
||||
if (x > max) {
|
||||
return 1;
|
||||
}
|
||||
if (x == max) {
|
||||
double dw = 0;
|
||||
for (int i = lastUsedCell - 1; i >= 0 && Double.compare(mean[i], x) == 0; i--) {
|
||||
dw += weight[i];
|
||||
}
|
||||
return (size() - dw / 2.0) / size();
|
||||
}
|
||||
|
||||
// initially, we set left width equal to right width
|
||||
double left = (mean[1] - mean[0]) / 2;
|
||||
double weightSoFar = 0;
|
||||
|
||||
for (int i = 0; i < lastUsedCell - 1; i++) {
|
||||
double right = (mean[i + 1] - mean[i]) / 2;
|
||||
if (x < mean[i] + right) {
|
||||
double value = (weightSoFar + weight[i] * interpolate(x, mean[i] - left, mean[i] + right)) / size();
|
||||
return Math.max(value, 0.0);
|
||||
}
|
||||
weightSoFar += weight[i];
|
||||
left = right;
|
||||
}
|
||||
|
||||
// for the last element, assume right width is same as left
|
||||
int lastOffset = lastUsedCell - 1;
|
||||
double right = (mean[lastOffset] - mean[lastOffset - 1]) / 2;
|
||||
if (x < mean[lastOffset] + right) {
|
||||
return (weightSoFar + weight[lastOffset] * interpolate(x, mean[lastOffset] - right, mean[lastOffset] + right)) / size();
|
||||
}
|
||||
return 1;
|
||||
}
|
||||
}
|
||||
|
||||
@Override
|
||||
public double quantile(double q) {
|
||||
if (q < 0 || q > 1) {
|
||||
throw new IllegalArgumentException("q should be in [0,1], got " + q);
|
||||
}
|
||||
mergeNewValues();
|
||||
|
||||
if (lastUsedCell == 0) {
|
||||
// no centroids means no data, no way to get a quantile
|
||||
return Double.NaN;
|
||||
} else if (lastUsedCell == 1) {
|
||||
// with one data point, all quantiles lead to Rome
|
||||
return mean[0];
|
||||
}
|
||||
|
||||
// we know that there are at least two centroids now
|
||||
int n = lastUsedCell;
|
||||
|
||||
// if values were stored in a sorted array, index would be the offset we are interested in
|
||||
final double index = q * totalWeight;
|
||||
|
||||
// beyond the boundaries, we return min or max
|
||||
// usually, the first and last centroids have unit weights so this will make it moot
|
||||
if (index < 0) {
|
||||
return min;
|
||||
}
|
||||
if (index >= totalWeight) {
|
||||
return max;
|
||||
}
|
||||
|
||||
double weightSoFar = weight[0] / 2;
|
||||
|
||||
// if the left centroid has more than one sample, we still know
|
||||
// that one sample occurred at min so we can do some interpolation
|
||||
if (weight[0] > 1 && index < weightSoFar) {
|
||||
// there is a single sample at min so we interpolate with less weight
|
||||
return weightedAverage(min, weightSoFar - index, mean[0], index);
|
||||
}
|
||||
|
||||
// if the right-most centroid has more than one sample, we still know
|
||||
// that one sample occurred at max so we can do some interpolation
|
||||
if (weight[n - 1] > 1 && totalWeight - index <= weight[n - 1] / 2) {
|
||||
return max - (totalWeight - index - 1) / (weight[n - 1] / 2 - 1) * (max - mean[n - 1]);
|
||||
}
|
||||
|
||||
// in between extremes we interpolate between centroids
|
||||
for (int i = 0; i < n - 1; i++) {
|
||||
double dw = (weight[i] + weight[i + 1]) / 2;
|
||||
if (weightSoFar + dw > index) {
|
||||
// centroids i and i+1 bracket our current point
|
||||
double z1 = index - weightSoFar;
|
||||
double z2 = weightSoFar + dw - index;
|
||||
return weightedAverage(mean[i], z2, mean[i + 1], z1);
|
||||
}
|
||||
weightSoFar += dw;
|
||||
}
|
||||
|
||||
assert weight[n - 1] >= 1;
|
||||
assert index >= totalWeight - weight[n - 1];
|
||||
|
||||
// Interpolate between the last mean and the max.
|
||||
double z1 = index - weightSoFar;
|
||||
double z2 = weight[n - 1] / 2.0 - z1;
|
||||
return weightedAverage(mean[n - 1], z1, max, z2);
|
||||
}
|
||||
|
||||
@Override
|
||||
public int centroidCount() {
|
||||
mergeNewValues();
|
||||
return lastUsedCell;
|
||||
}
|
||||
|
||||
@Override
|
||||
public Collection<Centroid> centroids() {
|
||||
mergeNewValues();
|
||||
|
||||
// we don't actually keep centroid structures around so we have to fake it
|
||||
return new AbstractCollection<>() {
|
||||
@Override
|
||||
public Iterator<Centroid> iterator() {
|
||||
return new Iterator<>() {
|
||||
int i = 0;
|
||||
|
||||
@Override
|
||||
public boolean hasNext() {
|
||||
return i < lastUsedCell;
|
||||
}
|
||||
|
||||
@Override
|
||||
public Centroid next() {
|
||||
Centroid rc = new Centroid(mean[i], (int) weight[i]);
|
||||
i++;
|
||||
return rc;
|
||||
}
|
||||
|
||||
@Override
|
||||
public void remove() {
|
||||
throw new UnsupportedOperationException("Default operation");
|
||||
}
|
||||
};
|
||||
}
|
||||
|
||||
@Override
|
||||
public int size() {
|
||||
return lastUsedCell;
|
||||
}
|
||||
};
|
||||
}
|
||||
|
||||
@Override
|
||||
public double compression() {
|
||||
return publicCompression;
|
||||
}
|
||||
|
||||
public ScaleFunction getScaleFunction() {
|
||||
return scale;
|
||||
}
|
||||
|
||||
@Override
|
||||
public void setScaleFunction(ScaleFunction scaleFunction) {
|
||||
super.setScaleFunction(scaleFunction);
|
||||
}
|
||||
|
||||
@Override
|
||||
public int byteSize() {
|
||||
return 48 + 8 * (mean.length + weight.length + tempMean.length + tempWeight.length) + 4 * order.length;
|
||||
}
|
||||
|
||||
@Override
|
||||
public String toString() {
|
||||
return "MergingDigest"
|
||||
+ "-"
|
||||
+ getScaleFunction()
|
||||
+ "-"
|
||||
+ (useWeightLimit ? "weight" : "kSize")
|
||||
+ "-"
|
||||
+ (useAlternatingSort ? "alternating" : "stable")
|
||||
+ "-"
|
||||
+ (useTwoLevelCompression ? "twoLevel" : "oneLevel");
|
||||
}
|
||||
}
|
|
@ -0,0 +1,673 @@
|
|||
/*
|
||||
* Licensed to Elasticsearch B.V. under one or more contributor
|
||||
* license agreements. See the NOTICE file distributed with
|
||||
* this work for additional information regarding copyright
|
||||
* ownership. Elasticsearch B.V. licenses this file to you under
|
||||
* the Apache License, Version 2.0 (the "License"); you may
|
||||
* not use this file except in compliance with the License.
|
||||
* You may obtain a copy of the License at
|
||||
*
|
||||
* http://www.apache.org/licenses/LICENSE-2.0
|
||||
*
|
||||
* Unless required by applicable law or agreed to in writing,
|
||||
* software distributed under the License is distributed on an
|
||||
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||
* KIND, either express or implied. See the License for the
|
||||
* specific language governing permissions and limitations
|
||||
* under the License.
|
||||
*
|
||||
* This project is based on a modification of https://github.com/tdunning/t-digest which is licensed under the Apache 2.0 License.
|
||||
*/
|
||||
|
||||
package org.elasticsearch.tdigest;
|
||||
|
||||
/**
|
||||
* Encodes the various scale functions for t-digests. These limits trade accuracy near the tails against accuracy near
|
||||
* the median in different ways. For instance, K_0 has uniform cluster sizes and results in constant accuracy (in terms
|
||||
* of q) while K_3 has cluster sizes proportional to min(q,1-q) which results in very much smaller error near the tails
|
||||
* and modestly increased error near the median.
|
||||
* <p>
|
||||
* The base forms (K_0, K_1, K_2 and K_3) all result in t-digests limited to a number of clusters equal to the
|
||||
* compression factor. The K_2_NO_NORM and K_3_NO_NORM versions result in the cluster count increasing roughly with
|
||||
* log(n).
|
||||
*/
|
||||
public enum ScaleFunction {
|
||||
/**
|
||||
* Generates uniform cluster sizes. Used for comparison only.
|
||||
*/
|
||||
K_0 {
|
||||
@Override
|
||||
public double k(double q, double compression, double n) {
|
||||
return compression * q / 2;
|
||||
}
|
||||
|
||||
@Override
|
||||
public double k(double q, double normalizer) {
|
||||
return normalizer * q;
|
||||
}
|
||||
|
||||
@Override
|
||||
public double q(double k, double compression, double n) {
|
||||
return 2 * k / compression;
|
||||
}
|
||||
|
||||
@Override
|
||||
public double q(double k, double normalizer) {
|
||||
return k / normalizer;
|
||||
}
|
||||
|
||||
@Override
|
||||
public double max(double q, double compression, double n) {
|
||||
return 2 / compression;
|
||||
}
|
||||
|
||||
@Override
|
||||
public double max(double q, double normalizer) {
|
||||
return 1 / normalizer;
|
||||
}
|
||||
|
||||
@Override
|
||||
public double normalizer(double compression, double n) {
|
||||
return compression / 2;
|
||||
}
|
||||
},
|
||||
|
||||
/**
|
||||
* Generates cluster sizes proportional to sqrt(q*(1-q)). This gives constant relative accuracy if accuracy is
|
||||
* proportional to squared cluster size. It is expected that K_2 and K_3 will give better practical results.
|
||||
*/
|
||||
K_1 {
|
||||
@Override
|
||||
public double k(final double q, final double compression, double n) {
|
||||
Function f = new Function() {
|
||||
@Override
|
||||
double apply(double q) {
|
||||
return compression * Math.asin(2 * q - 1) / (2 * Math.PI);
|
||||
}
|
||||
};
|
||||
return ScaleFunction.limitCall(f, q, 1e-15, 1 - 1e-15);
|
||||
}
|
||||
|
||||
@Override
|
||||
public double k(final double q, final double normalizer) {
|
||||
Function f = new Function() {
|
||||
@Override
|
||||
double apply(double q) {
|
||||
return normalizer * Math.asin(2 * q - 1);
|
||||
}
|
||||
};
|
||||
return ScaleFunction.limitCall(f, q, 1e-15, 1 - 1e-15);
|
||||
}
|
||||
|
||||
@Override
|
||||
public double q(double k, final double compression, double n) {
|
||||
Function f = new Function() {
|
||||
@Override
|
||||
double apply(double k) {
|
||||
return (Math.sin(k * (2 * Math.PI / compression)) + 1) / 2;
|
||||
}
|
||||
};
|
||||
return ScaleFunction.limitCall(f, k, -compression / 4, compression / 4);
|
||||
}
|
||||
|
||||
@Override
|
||||
public double q(double k, final double normalizer) {
|
||||
Function f = new Function() {
|
||||
@Override
|
||||
double apply(double x) {
|
||||
return (Math.sin(x) + 1) / 2;
|
||||
}
|
||||
};
|
||||
double x = k / normalizer;
|
||||
return ScaleFunction.limitCall(f, x, -Math.PI / 2, Math.PI / 2);
|
||||
}
|
||||
|
||||
@Override
|
||||
public double max(double q, double compression, double n) {
|
||||
if (q <= 0) {
|
||||
return 0;
|
||||
} else if (q >= 1) {
|
||||
return 0;
|
||||
} else {
|
||||
return 2 * Math.sin(Math.PI / compression) * Math.sqrt(q * (1 - q));
|
||||
}
|
||||
}
|
||||
|
||||
@Override
|
||||
public double max(double q, double normalizer) {
|
||||
if (q <= 0) {
|
||||
return 0;
|
||||
} else if (q >= 1) {
|
||||
return 0;
|
||||
} else {
|
||||
return 2 * Math.sin(0.5 / normalizer) * Math.sqrt(q * (1 - q));
|
||||
}
|
||||
}
|
||||
|
||||
@Override
|
||||
public double normalizer(double compression, double n) {
|
||||
return compression / (2 * Math.PI);
|
||||
}
|
||||
},
|
||||
|
||||
/**
|
||||
* Generates cluster sizes proportional to sqrt(q*(1-q)) but avoids computation of asin in the critical path by
|
||||
* using an approximate version.
|
||||
*/
|
||||
K_1_FAST {
|
||||
@Override
|
||||
public double k(double q, final double compression, double n) {
|
||||
Function f = new Function() {
|
||||
@Override
|
||||
double apply(double q) {
|
||||
return compression * fastAsin(2 * q - 1) / (2 * Math.PI);
|
||||
}
|
||||
};
|
||||
return ScaleFunction.limitCall(f, q, 0, 1);
|
||||
}
|
||||
|
||||
@Override
|
||||
public double k(double q, final double normalizer) {
|
||||
Function f = new Function() {
|
||||
@Override
|
||||
double apply(double q) {
|
||||
return normalizer * fastAsin(2 * q - 1);
|
||||
}
|
||||
};
|
||||
return ScaleFunction.limitCall(f, q, 0, 1);
|
||||
}
|
||||
|
||||
@Override
|
||||
public double q(double k, double compression, double n) {
|
||||
return (Math.sin(k * (2 * Math.PI / compression)) + 1) / 2;
|
||||
}
|
||||
|
||||
@Override
|
||||
public double q(double k, double normalizer) {
|
||||
return (Math.sin(k / normalizer) + 1) / 2;
|
||||
}
|
||||
|
||||
@Override
|
||||
public double max(double q, double compression, double n) {
|
||||
if (q <= 0) {
|
||||
return 0;
|
||||
} else if (q >= 1) {
|
||||
return 0;
|
||||
} else {
|
||||
return 2 * Math.sin(Math.PI / compression) * Math.sqrt(q * (1 - q));
|
||||
}
|
||||
}
|
||||
|
||||
@Override
|
||||
public double max(double q, double normalizer) {
|
||||
if (q <= 0) {
|
||||
return 0;
|
||||
} else if (q >= 1) {
|
||||
return 0;
|
||||
} else {
|
||||
return 2 * Math.sin(0.5 / normalizer) * Math.sqrt(q * (1 - q));
|
||||
}
|
||||
}
|
||||
|
||||
@Override
|
||||
public double normalizer(double compression, double n) {
|
||||
return compression / (2 * Math.PI);
|
||||
}
|
||||
},
|
||||
|
||||
/**
|
||||
* Generates cluster sizes proportional to q*(1-q). This makes tail error bounds tighter than for K_1. The use of a
|
||||
* normalizing function results in a strictly bounded number of clusters no matter how many samples.
|
||||
*/
|
||||
K_2 {
|
||||
@Override
|
||||
public double k(double q, final double compression, final double n) {
|
||||
if (n <= 1) {
|
||||
if (q <= 0) {
|
||||
return -10;
|
||||
} else if (q >= 1) {
|
||||
return 10;
|
||||
} else {
|
||||
return 0;
|
||||
}
|
||||
}
|
||||
Function f = new Function() {
|
||||
@Override
|
||||
double apply(double q) {
|
||||
return compression * Math.log(q / (1 - q)) / Z(compression, n);
|
||||
}
|
||||
};
|
||||
return ScaleFunction.limitCall(f, q, 1e-15, 1 - 1e-15);
|
||||
}
|
||||
|
||||
@Override
|
||||
public double k(double q, final double normalizer) {
|
||||
Function f = new Function() {
|
||||
@Override
|
||||
double apply(double q) {
|
||||
return Math.log(q / (1 - q)) * normalizer;
|
||||
}
|
||||
};
|
||||
return ScaleFunction.limitCall(f, q, 1e-15, 1 - 1e-15);
|
||||
}
|
||||
|
||||
@Override
|
||||
public double q(double k, double compression, double n) {
|
||||
double w = Math.exp(k * Z(compression, n) / compression);
|
||||
return w / (1 + w);
|
||||
}
|
||||
|
||||
@Override
|
||||
public double q(double k, double normalizer) {
|
||||
double w = Math.exp(k / normalizer);
|
||||
return w / (1 + w);
|
||||
}
|
||||
|
||||
@Override
|
||||
public double max(double q, double compression, double n) {
|
||||
return Z(compression, n) * q * (1 - q) / compression;
|
||||
}
|
||||
|
||||
@Override
|
||||
public double max(double q, double normalizer) {
|
||||
return q * (1 - q) / normalizer;
|
||||
}
|
||||
|
||||
@Override
|
||||
public double normalizer(double compression, double n) {
|
||||
return compression / Z(compression, n);
|
||||
}
|
||||
|
||||
private double Z(double compression, double n) {
|
||||
return 4 * Math.log(n / compression) + 24;
|
||||
}
|
||||
},
|
||||
|
||||
/**
|
||||
* Generates cluster sizes proportional to min(q, 1-q). This makes tail error bounds tighter than for K_1 or K_2.
|
||||
* The use of a normalizing function results in a strictly bounded number of clusters no matter how many samples.
|
||||
*/
|
||||
K_3 {
|
||||
@Override
|
||||
public double k(double q, final double compression, final double n) {
|
||||
Function f = new Function() {
|
||||
@Override
|
||||
double apply(double q) {
|
||||
if (q <= 0.5) {
|
||||
return compression * Math.log(2 * q) / Z(compression, n);
|
||||
} else {
|
||||
return -k(1 - q, compression, n);
|
||||
}
|
||||
}
|
||||
};
|
||||
return ScaleFunction.limitCall(f, q, 1e-15, 1 - 1e-15);
|
||||
}
|
||||
|
||||
@Override
|
||||
public double k(double q, final double normalizer) {
|
||||
Function f = new Function() {
|
||||
@Override
|
||||
double apply(double q) {
|
||||
if (q <= 0.5) {
|
||||
return Math.log(2 * q) * normalizer;
|
||||
} else {
|
||||
return -k(1 - q, normalizer);
|
||||
}
|
||||
}
|
||||
};
|
||||
return ScaleFunction.limitCall(f, q, 1e-15, 1 - 1e-15);
|
||||
}
|
||||
|
||||
@Override
|
||||
public double q(double k, double compression, double n) {
|
||||
if (k <= 0) {
|
||||
return Math.exp(k * Z(compression, n) / compression) / 2;
|
||||
} else {
|
||||
return 1 - q(-k, compression, n);
|
||||
}
|
||||
}
|
||||
|
||||
@Override
|
||||
public double q(double k, double normalizer) {
|
||||
if (k <= 0) {
|
||||
return Math.exp(k / normalizer) / 2;
|
||||
} else {
|
||||
return 1 - q(-k, normalizer);
|
||||
}
|
||||
}
|
||||
|
||||
@Override
|
||||
public double max(double q, double compression, double n) {
|
||||
return Z(compression, n) * Math.min(q, 1 - q) / compression;
|
||||
}
|
||||
|
||||
@Override
|
||||
public double max(double q, double normalizer) {
|
||||
return Math.min(q, 1 - q) / normalizer;
|
||||
}
|
||||
|
||||
@Override
|
||||
public double normalizer(double compression, double n) {
|
||||
return compression / Z(compression, n);
|
||||
}
|
||||
|
||||
private double Z(double compression, double n) {
|
||||
return 4 * Math.log(n / compression) + 21;
|
||||
}
|
||||
},
|
||||
|
||||
/**
|
||||
* Generates cluster sizes proportional to q*(1-q). This makes the tail error bounds tighter. This version does not
|
||||
* use a normalizer function and thus the number of clusters increases roughly proportional to log(n). That is good
|
||||
* for accuracy, but bad for size and bad for the statically allocated MergingDigest, but can be useful for
|
||||
* tree-based implementations.
|
||||
*/
|
||||
K_2_NO_NORM {
|
||||
@Override
|
||||
public double k(double q, final double compression, double n) {
|
||||
Function f = new Function() {
|
||||
@Override
|
||||
double apply(double q) {
|
||||
return compression * Math.log(q / (1 - q));
|
||||
}
|
||||
};
|
||||
return ScaleFunction.limitCall(f, q, 1e-15, 1 - 1e-15);
|
||||
}
|
||||
|
||||
@Override
|
||||
public double k(double q, final double normalizer) {
|
||||
Function f = new Function() {
|
||||
@Override
|
||||
double apply(double q) {
|
||||
return normalizer * Math.log(q / (1 - q));
|
||||
}
|
||||
};
|
||||
return ScaleFunction.limitCall(f, q, 1e-15, 1 - 1e-15);
|
||||
}
|
||||
|
||||
@Override
|
||||
public double q(double k, double compression, double n) {
|
||||
double w = Math.exp(k / compression);
|
||||
return w / (1 + w);
|
||||
}
|
||||
|
||||
@Override
|
||||
public double q(double k, double normalizer) {
|
||||
double w = Math.exp(k / normalizer);
|
||||
return w / (1 + w);
|
||||
}
|
||||
|
||||
@Override
|
||||
public double max(double q, double compression, double n) {
|
||||
return q * (1 - q) / compression;
|
||||
}
|
||||
|
||||
@Override
|
||||
public double max(double q, double normalizer) {
|
||||
return q * (1 - q) / normalizer;
|
||||
}
|
||||
|
||||
@Override
|
||||
public double normalizer(double compression, double n) {
|
||||
return compression;
|
||||
}
|
||||
},
|
||||
|
||||
/**
|
||||
* Generates cluster sizes proportional to min(q, 1-q). This makes the tail error bounds tighter. This version does
|
||||
* not use a normalizer function and thus the number of clusters increases roughly proportional to log(n). That is
|
||||
* good for accuracy, but bad for size and bad for the statically allocated MergingDigest, but can be useful for
|
||||
* tree-based implementations.
|
||||
*/
|
||||
K_3_NO_NORM {
|
||||
@Override
|
||||
public double k(double q, final double compression, final double n) {
|
||||
Function f = new Function() {
|
||||
@Override
|
||||
double apply(double q) {
|
||||
if (q <= 0.5) {
|
||||
return compression * Math.log(2 * q);
|
||||
} else {
|
||||
return -k(1 - q, compression, n);
|
||||
}
|
||||
}
|
||||
};
|
||||
return limitCall(f, q, 1e-15, 1 - 1e-15);
|
||||
}
|
||||
|
||||
@Override
|
||||
public double k(double q, final double normalizer) {
|
||||
// poor man's lambda, sigh
|
||||
Function f = new Function() {
|
||||
@Override
|
||||
double apply(double q) {
|
||||
if (q <= 0.5) {
|
||||
return normalizer * Math.log(2 * q);
|
||||
} else {
|
||||
return -k(1 - q, normalizer);
|
||||
}
|
||||
}
|
||||
};
|
||||
return limitCall(f, q, 1e-15, 1 - 1e-15);
|
||||
}
|
||||
|
||||
@Override
|
||||
public double q(double k, double compression, double n) {
|
||||
if (k <= 0) {
|
||||
return Math.exp(k / compression) / 2;
|
||||
} else {
|
||||
return 1 - q(-k, compression, n);
|
||||
}
|
||||
}
|
||||
|
||||
@Override
|
||||
public double q(double k, double normalizer) {
|
||||
if (k <= 0) {
|
||||
return Math.exp(k / normalizer) / 2;
|
||||
} else {
|
||||
return 1 - q(-k, normalizer);
|
||||
}
|
||||
}
|
||||
|
||||
@Override
|
||||
public double max(double q, double compression, double n) {
|
||||
return Math.min(q, 1 - q) / compression;
|
||||
}
|
||||
|
||||
@Override
|
||||
public double max(double q, double normalizer) {
|
||||
return Math.min(q, 1 - q) / normalizer;
|
||||
}
|
||||
|
||||
@Override
|
||||
public double normalizer(double compression, double n) {
|
||||
return compression;
|
||||
}
|
||||
}; // max weight is min(q,1-q), should improve tail accuracy even more
|
||||
|
||||
/**
|
||||
* Converts a quantile to the k-scale. The total number of points is also provided so that a normalizing function
|
||||
* can be computed if necessary.
|
||||
*
|
||||
* @param q The quantile
|
||||
* @param compression Also known as delta in literature on the t-digest
|
||||
* @param n The total number of samples
|
||||
* @return The corresponding value of k
|
||||
*/
|
||||
public abstract double k(double q, double compression, double n);
|
||||
|
||||
/**
|
||||
* Converts a quantile to the k-scale. The normalizer value depends on compression and (possibly) number of points
|
||||
* in the digest. #normalizer(double, double)
|
||||
*
|
||||
* @param q The quantile
|
||||
* @param normalizer The normalizer value which depends on compression and (possibly) number of points in the
|
||||
* digest.
|
||||
* @return The corresponding value of k
|
||||
*/
|
||||
public abstract double k(double q, double normalizer);
|
||||
|
||||
/**
|
||||
* Computes q as a function of k. This is often faster than finding k as a function of q for some scales.
|
||||
*
|
||||
* @param k The index value to convert into q scale.
|
||||
* @param compression The compression factor (often written as δ)
|
||||
* @param n The number of samples already in the digest.
|
||||
* @return The value of q that corresponds to k
|
||||
*/
|
||||
public abstract double q(double k, double compression, double n);
|
||||
|
||||
/**
|
||||
* Computes q as a function of k. This is often faster than finding k as a function of q for some scales.
|
||||
*
|
||||
* @param k The index value to convert into q scale.
|
||||
* @param normalizer The normalizer value which depends on compression and (possibly) number of points in the
|
||||
* digest.
|
||||
* @return The value of q that corresponds to k
|
||||
*/
|
||||
public abstract double q(double k, double normalizer);
|
||||
|
||||
/**
|
||||
* Computes the maximum relative size a cluster can have at quantile q. Note that exactly where within the range
|
||||
* spanned by a cluster that q should be isn't clear. That means that this function usually has to be taken at
|
||||
* multiple points and the smallest value used.
|
||||
* <p>
|
||||
* Note that this is the relative size of a cluster. To get the max number of samples in the cluster, multiply this
|
||||
* value times the total number of samples in the digest.
|
||||
*
|
||||
* @param q The quantile
|
||||
* @param compression The compression factor, typically delta in the literature
|
||||
* @param n The number of samples seen so far in the digest
|
||||
* @return The maximum number of samples that can be in the cluster
|
||||
*/
|
||||
public abstract double max(double q, double compression, double n);
|
||||
|
||||
/**
|
||||
* Computes the maximum relative size a cluster can have at quantile q. Note that exactly where within the range
|
||||
* spanned by a cluster that q should be isn't clear. That means that this function usually has to be taken at
|
||||
* multiple points and the smallest value used.
|
||||
* <p>
|
||||
* Note that this is the relative size of a cluster. To get the max number of samples in the cluster, multiply this
|
||||
* value times the total number of samples in the digest.
|
||||
*
|
||||
* @param q The quantile
|
||||
* @param normalizer The normalizer value which depends on compression and (possibly) number of points in the
|
||||
* digest.
|
||||
* @return The maximum number of samples that can be in the cluster
|
||||
*/
|
||||
public abstract double max(double q, double normalizer);
|
||||
|
||||
/**
|
||||
* Computes the normalizer given compression and number of points.
|
||||
* @param compression The compression parameter for the digest
|
||||
* @param n The number of samples seen so far
|
||||
* @return The normalizing factor for the scale function
|
||||
*/
|
||||
public abstract double normalizer(double compression, double n);
|
||||
|
||||
/**
|
||||
* Approximates asin to within about 1e-6. This approximation works by breaking the range from 0 to 1 into 5 regions
|
||||
* for all but the region nearest 1, rational polynomial models get us a very good approximation of asin and by
|
||||
* interpolating as we move from region to region, we can guarantee continuity and we happen to get monotonicity as
|
||||
* well. for the values near 1, we just use Math.asin as our region "approximation".
|
||||
*
|
||||
* @param x sin(theta)
|
||||
* @return theta
|
||||
*/
|
||||
static double fastAsin(double x) {
|
||||
if (x < 0) {
|
||||
return -fastAsin(-x);
|
||||
} else if (x > 1) {
|
||||
return Double.NaN;
|
||||
} else {
|
||||
// Cutoffs for models. Note that the ranges overlap. In the
|
||||
// overlap we do linear interpolation to guarantee the overall
|
||||
// result is "nice"
|
||||
double c0High = 0.1;
|
||||
double c1High = 0.55;
|
||||
double c2Low = 0.5;
|
||||
double c2High = 0.8;
|
||||
double c3Low = 0.75;
|
||||
double c3High = 0.9;
|
||||
double c4Low = 0.87;
|
||||
if (x > c3High) {
|
||||
return Math.asin(x);
|
||||
} else {
|
||||
// the models
|
||||
double[] m0 = { 0.2955302411, 1.2221903614, 0.1488583743, 0.2422015816, -0.3688700895, 0.0733398445 };
|
||||
double[] m1 = { -0.0430991920, 0.9594035750, -0.0362312299, 0.1204623351, 0.0457029620, -0.0026025285 };
|
||||
double[] m2 = { -0.034873933724, 1.054796752703, -0.194127063385, 0.283963735636, 0.023800124916, -0.000872727381 };
|
||||
double[] m3 = { -0.37588391875, 2.61991859025, -2.48835406886, 1.48605387425, 0.00857627492, -0.00015802871 };
|
||||
|
||||
// the parameters for all of the models
|
||||
double[] vars = { 1, x, x * x, x * x * x, 1 / (1 - x), 1 / (1 - x) / (1 - x) };
|
||||
|
||||
// raw grist for interpolation coefficients
|
||||
double x0 = bound((c0High - x) / c0High);
|
||||
double x1 = bound((c1High - x) / (c1High - c2Low));
|
||||
double x2 = bound((c2High - x) / (c2High - c3Low));
|
||||
double x3 = bound((c3High - x) / (c3High - c4Low));
|
||||
|
||||
// interpolation coefficients
|
||||
// noinspection UnnecessaryLocalVariable
|
||||
double mix0 = x0;
|
||||
double mix1 = (1 - x0) * x1;
|
||||
double mix2 = (1 - x1) * x2;
|
||||
double mix3 = (1 - x2) * x3;
|
||||
double mix4 = 1 - x3;
|
||||
|
||||
// now mix all the results together, avoiding extra evaluations
|
||||
double r = 0;
|
||||
if (mix0 > 0) {
|
||||
r += mix0 * eval(m0, vars);
|
||||
}
|
||||
if (mix1 > 0) {
|
||||
r += mix1 * eval(m1, vars);
|
||||
}
|
||||
if (mix2 > 0) {
|
||||
r += mix2 * eval(m2, vars);
|
||||
}
|
||||
if (mix3 > 0) {
|
||||
r += mix3 * eval(m3, vars);
|
||||
}
|
||||
if (mix4 > 0) {
|
||||
// model 4 is just the real deal
|
||||
r += mix4 * Math.asin(x);
|
||||
}
|
||||
return r;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
abstract static class Function {
|
||||
abstract double apply(double x);
|
||||
}
|
||||
|
||||
static double limitCall(Function f, double x, double low, double high) {
|
||||
if (x < low) {
|
||||
return f.apply(low);
|
||||
} else if (x > high) {
|
||||
return f.apply(high);
|
||||
} else {
|
||||
return f.apply(x);
|
||||
}
|
||||
}
|
||||
|
||||
private static double eval(double[] model, double[] vars) {
|
||||
double r = 0;
|
||||
for (int i = 0; i < model.length; i++) {
|
||||
r += model[i] * vars[i];
|
||||
}
|
||||
return r;
|
||||
}
|
||||
|
||||
private static double bound(double v) {
|
||||
if (v <= 0) {
|
||||
return 0;
|
||||
} else if (v >= 1) {
|
||||
return 1;
|
||||
} else {
|
||||
return v;
|
||||
}
|
||||
}
|
||||
}
|
|
@ -0,0 +1,648 @@
|
|||
/*
|
||||
* Licensed to Elasticsearch B.V. under one or more contributor
|
||||
* license agreements. See the NOTICE file distributed with
|
||||
* this work for additional information regarding copyright
|
||||
* ownership. Elasticsearch B.V. licenses this file to you under
|
||||
* the Apache License, Version 2.0 (the "License"); you may
|
||||
* not use this file except in compliance with the License.
|
||||
* You may obtain a copy of the License at
|
||||
*
|
||||
* http://www.apache.org/licenses/LICENSE-2.0
|
||||
*
|
||||
* Unless required by applicable law or agreed to in writing,
|
||||
* software distributed under the License is distributed on an
|
||||
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||
* KIND, either express or implied. See the License for the
|
||||
* specific language governing permissions and limitations
|
||||
* under the License.
|
||||
*
|
||||
* This project is based on a modification of https://github.com/tdunning/t-digest which is licensed under the Apache 2.0 License.
|
||||
*/
|
||||
|
||||
package org.elasticsearch.tdigest;
|
||||
|
||||
import java.util.Arrays;
|
||||
import java.util.Random;
|
||||
|
||||
/**
|
||||
* Static sorting methods
|
||||
*/
|
||||
public class Sort {
|
||||
private static final Random prng = new Random(); // for choosing pivots during quicksort
|
||||
|
||||
/**
|
||||
* Single-key stabilized quick sort on using an index array
|
||||
*
|
||||
* @param order Indexes into values
|
||||
* @param values The values to sort.
|
||||
* @param n The number of values to sort
|
||||
*/
|
||||
public static void stableSort(int[] order, double[] values, int n) {
|
||||
for (int i = 0; i < n; i++) {
|
||||
order[i] = i;
|
||||
}
|
||||
stableQuickSort(order, values, 0, n, 64);
|
||||
stableInsertionSort(order, values, 0, n, 64);
|
||||
}
|
||||
|
||||
/**
|
||||
* Two-key quick sort on (values, weights) using an index array
|
||||
*
|
||||
* @param order Indexes into values
|
||||
* @param values The values to sort.
|
||||
* @param weights The secondary sort key
|
||||
* @param n The number of values to sort
|
||||
* @return true if the values were already sorted
|
||||
*/
|
||||
public static boolean sort(int[] order, double[] values, double[] weights, int n) {
|
||||
if (weights == null) {
|
||||
weights = Arrays.copyOf(values, values.length);
|
||||
}
|
||||
boolean r = sort(order, values, weights, 0, n);
|
||||
// now adjust all runs with equal value so that bigger weights are nearer
|
||||
// the median
|
||||
double medianWeight = 0;
|
||||
for (int i = 0; i < n; i++) {
|
||||
medianWeight += weights[i];
|
||||
}
|
||||
medianWeight = medianWeight / 2;
|
||||
int i = 0;
|
||||
double soFar = 0;
|
||||
double nextGroup = 0;
|
||||
while (i < n) {
|
||||
int j = i;
|
||||
while (j < n && values[order[j]] == values[order[i]]) {
|
||||
double w = weights[order[j]];
|
||||
nextGroup += w;
|
||||
j++;
|
||||
}
|
||||
if (j > i + 1) {
|
||||
if (soFar >= medianWeight) {
|
||||
// entire group is in last half, reverse the order
|
||||
reverse(order, i, j - i);
|
||||
} else if (nextGroup > medianWeight) {
|
||||
// group straddles the median, but not necessarily evenly
|
||||
// most elements are probably unit weight if there are many
|
||||
double[] scratch = new double[j - i];
|
||||
|
||||
double netAfter = nextGroup + soFar - 2 * medianWeight;
|
||||
// heuristically adjust weights to roughly balance around median
|
||||
double max = weights[order[j - 1]];
|
||||
for (int k = j - i - 1; k >= 0; k--) {
|
||||
double weight = weights[order[i + k]];
|
||||
if (netAfter < 0) {
|
||||
// sort in normal order
|
||||
scratch[k] = weight;
|
||||
netAfter += weight;
|
||||
} else {
|
||||
// sort reversed, but after normal items
|
||||
scratch[k] = 2 * max + 1 - weight;
|
||||
netAfter -= weight;
|
||||
}
|
||||
}
|
||||
// sort these balanced weights
|
||||
int[] sub = new int[j - i];
|
||||
sort(sub, scratch, scratch, 0, j - i);
|
||||
int[] tmp = Arrays.copyOfRange(order, i, j);
|
||||
for (int k = 0; k < j - i; k++) {
|
||||
order[i + k] = tmp[sub[k]];
|
||||
}
|
||||
}
|
||||
}
|
||||
soFar = nextGroup;
|
||||
i = j;
|
||||
}
|
||||
return r;
|
||||
}
|
||||
|
||||
/**
|
||||
* Two-key quick sort on (values, weights) using an index array
|
||||
*
|
||||
* @param order Indexes into values
|
||||
* @param values The values to sort
|
||||
* @param weights The weights that define the secondary ordering
|
||||
* @param start The first element to sort
|
||||
* @param n The number of values to sort
|
||||
* @return True if the values were in order without sorting
|
||||
*/
|
||||
private static boolean sort(int[] order, double[] values, double[] weights, int start, int n) {
|
||||
boolean inOrder = true;
|
||||
for (int i = start; i < start + n; i++) {
|
||||
if (inOrder && i < start + n - 1) {
|
||||
inOrder = values[i] < values[i + 1] || (values[i] == values[i + 1] && weights[i] <= weights[i + 1]);
|
||||
}
|
||||
order[i] = i;
|
||||
}
|
||||
if (inOrder) {
|
||||
return true;
|
||||
}
|
||||
quickSort(order, values, weights, start, start + n, 64);
|
||||
insertionSort(order, values, weights, start, start + n, 64);
|
||||
return false;
|
||||
}
|
||||
|
||||
/**
|
||||
* Standard two-key quick sort on (values, weights) except that sorting is done on an index array
|
||||
* rather than the values themselves
|
||||
*
|
||||
* @param order The pre-allocated index array
|
||||
* @param values The values to sort
|
||||
* @param weights The weights (secondary key)
|
||||
* @param start The beginning of the values to sort
|
||||
* @param end The value after the last value to sort
|
||||
* @param limit The minimum size to recurse down to.
|
||||
*/
|
||||
private static void quickSort(int[] order, double[] values, double[] weights, int start, int end, int limit) {
|
||||
// the while loop implements tail-recursion to avoid excessive stack calls on nasty cases
|
||||
while (end - start > limit) {
|
||||
|
||||
// pivot by a random element
|
||||
int pivotIndex = start + prng.nextInt(end - start);
|
||||
double pivotValue = values[order[pivotIndex]];
|
||||
double pivotWeight = weights[order[pivotIndex]];
|
||||
|
||||
// move pivot to beginning of array
|
||||
swap(order, start, pivotIndex);
|
||||
|
||||
// we use a three way partition because many duplicate values is an important case
|
||||
|
||||
int low = start + 1; // low points to first value not known to be equal to pivotValue
|
||||
int high = end; // high points to first value > pivotValue
|
||||
int i = low; // i scans the array
|
||||
while (i < high) {
|
||||
// invariant: (values,weights)[order[k]] == (pivotValue, pivotWeight) for k in [0..low)
|
||||
// invariant: (values,weights)[order[k]] < (pivotValue, pivotWeight) for k in [low..i)
|
||||
// invariant: (values,weights)[order[k]] > (pivotValue, pivotWeight) for k in [high..end)
|
||||
// in-loop: i < high
|
||||
// in-loop: low < high
|
||||
// in-loop: i >= low
|
||||
double vi = values[order[i]];
|
||||
double wi = weights[order[i]];
|
||||
if (vi == pivotValue && wi == pivotWeight) {
|
||||
if (low != i) {
|
||||
swap(order, low, i);
|
||||
} else {
|
||||
i++;
|
||||
}
|
||||
low++;
|
||||
} else if (vi > pivotValue || (vi == pivotValue && wi > pivotWeight)) {
|
||||
high--;
|
||||
swap(order, i, high);
|
||||
} else {
|
||||
// vi < pivotValue || (vi == pivotValue && wi < pivotWeight)
|
||||
i++;
|
||||
}
|
||||
}
|
||||
// invariant: (values,weights)[order[k]] == (pivotValue, pivotWeight) for k in [0..low)
|
||||
// invariant: (values,weights)[order[k]] < (pivotValue, pivotWeight) for k in [low..i)
|
||||
// invariant: (values,weights)[order[k]] > (pivotValue, pivotWeight) for k in [high..end)
|
||||
// assert i == high || low == high therefore, we are done with partition
|
||||
|
||||
// at this point, i==high, from [start,low) are == pivot, [low,high) are < and [high,end) are >
|
||||
// we have to move the values equal to the pivot into the middle. To do this, we swap pivot
|
||||
// values into the top end of the [low,high) range stopping when we run out of destinations
|
||||
// or when we run out of values to copy
|
||||
int from = start;
|
||||
int to = high - 1;
|
||||
for (i = 0; from < low && to >= low; i++) {
|
||||
swap(order, from++, to--);
|
||||
}
|
||||
if (from == low) {
|
||||
// ran out of things to copy. This means that the last destination is the boundary
|
||||
low = to + 1;
|
||||
} else {
|
||||
// ran out of places to copy to. This means that there are uncopied pivots and the
|
||||
// boundary is at the beginning of those
|
||||
low = from;
|
||||
}
|
||||
|
||||
// checkPartition(order, values, pivotValue, start, low, high, end);
|
||||
|
||||
// now recurse, but arrange it so we handle the longer limit by tail recursion
|
||||
// we have to sort the pivot values because they may have different weights
|
||||
// we can't do that, however until we know how much weight is in the left and right
|
||||
if (low - start < end - high) {
|
||||
// left side is smaller
|
||||
quickSort(order, values, weights, start, low, limit);
|
||||
|
||||
// this is really a way to do
|
||||
// quickSort(order, values, high, end, limit);
|
||||
start = high;
|
||||
} else {
|
||||
quickSort(order, values, weights, high, end, limit);
|
||||
// this is really a way to do
|
||||
// quickSort(order, values, start, low, limit);
|
||||
end = low;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Stabilized quick sort on an index array. This is a normal quick sort that uses the
|
||||
* original index as a secondary key. Since we are really just sorting an index array
|
||||
* we can do this nearly for free.
|
||||
*
|
||||
* @param order The pre-allocated index array
|
||||
* @param values The values to sort
|
||||
* @param start The beginning of the values to sort
|
||||
* @param end The value after the last value to sort
|
||||
* @param limit The minimum size to recurse down to.
|
||||
*/
|
||||
private static void stableQuickSort(int[] order, double[] values, int start, int end, int limit) {
|
||||
// the while loop implements tail-recursion to avoid excessive stack calls on nasty cases
|
||||
while (end - start > limit) {
|
||||
|
||||
// pivot by a random element
|
||||
int pivotIndex = start + prng.nextInt(end - start);
|
||||
double pivotValue = values[order[pivotIndex]];
|
||||
int pv = order[pivotIndex];
|
||||
|
||||
// move pivot to beginning of array
|
||||
swap(order, start, pivotIndex);
|
||||
|
||||
// we use a three way partition because many duplicate values is an important case
|
||||
|
||||
int low = start + 1; // low points to first value not known to be equal to pivotValue
|
||||
int high = end; // high points to first value > pivotValue
|
||||
int i = low; // i scans the array
|
||||
while (i < high) {
|
||||
// invariant: (values[order[k]],order[k]) == (pivotValue, pv) for k in [0..low)
|
||||
// invariant: (values[order[k]],order[k]) < (pivotValue, pv) for k in [low..i)
|
||||
// invariant: (values[order[k]],order[k]) > (pivotValue, pv) for k in [high..end)
|
||||
// in-loop: i < high
|
||||
// in-loop: low < high
|
||||
// in-loop: i >= low
|
||||
double vi = values[order[i]];
|
||||
int pi = order[i];
|
||||
if (vi == pivotValue && pi == pv) {
|
||||
if (low != i) {
|
||||
swap(order, low, i);
|
||||
} else {
|
||||
i++;
|
||||
}
|
||||
low++;
|
||||
} else if (vi > pivotValue || (vi == pivotValue && pi > pv)) {
|
||||
high--;
|
||||
swap(order, i, high);
|
||||
} else {
|
||||
// vi < pivotValue || (vi == pivotValue && pi < pv)
|
||||
i++;
|
||||
}
|
||||
}
|
||||
// invariant: (values[order[k]],order[k]) == (pivotValue, pv) for k in [0..low)
|
||||
// invariant: (values[order[k]],order[k]) < (pivotValue, pv) for k in [low..i)
|
||||
// invariant: (values[order[k]],order[k]) > (pivotValue, pv) for k in [high..end)
|
||||
// assert i == high || low == high therefore, we are done with partition
|
||||
|
||||
// at this point, i==high, from [start,low) are == pivot, [low,high) are < and [high,end) are >
|
||||
// we have to move the values equal to the pivot into the middle. To do this, we swap pivot
|
||||
// values into the top end of the [low,high) range stopping when we run out of destinations
|
||||
// or when we run out of values to copy
|
||||
int from = start;
|
||||
int to = high - 1;
|
||||
for (i = 0; from < low && to >= low; i++) {
|
||||
swap(order, from++, to--);
|
||||
}
|
||||
if (from == low) {
|
||||
// ran out of things to copy. This means that the last destination is the boundary
|
||||
low = to + 1;
|
||||
} else {
|
||||
// ran out of places to copy to. This means that there are uncopied pivots and the
|
||||
// boundary is at the beginning of those
|
||||
low = from;
|
||||
}
|
||||
|
||||
// checkPartition(order, values, pivotValue, start, low, high, end);
|
||||
|
||||
// now recurse, but arrange it so we handle the longer limit by tail recursion
|
||||
// we have to sort the pivot values because they may have different weights
|
||||
// we can't do that, however until we know how much weight is in the left and right
|
||||
if (low - start < end - high) {
|
||||
// left side is smaller
|
||||
stableQuickSort(order, values, start, low, limit);
|
||||
|
||||
// this is really a way to do
|
||||
// quickSort(order, values, high, end, limit);
|
||||
start = high;
|
||||
} else {
|
||||
stableQuickSort(order, values, high, end, limit);
|
||||
// this is really a way to do
|
||||
// quickSort(order, values, start, low, limit);
|
||||
end = low;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Quick sort in place of several paired arrays. On return,
|
||||
* keys[...] is in order and the values[] arrays will be
|
||||
* reordered as well in the same way.
|
||||
*
|
||||
* @param key Values to sort on
|
||||
* @param values The auxiliary values to sort.
|
||||
*/
|
||||
public static void sort(double[] key, double[]... values) {
|
||||
sort(key, 0, key.length, values);
|
||||
}
|
||||
|
||||
/**
|
||||
* Quick sort using an index array. On return,
|
||||
* values[order[i]] is in order as i goes start..n
|
||||
* @param key Values to sort on
|
||||
* @param start The first element to sort
|
||||
* @param n The number of values to sort
|
||||
* @param values The auxiliary values to sort.
|
||||
*/
|
||||
public static void sort(double[] key, int start, int n, double[]... values) {
|
||||
quickSort(key, values, start, start + n, 8);
|
||||
insertionSort(key, values, start, start + n, 8);
|
||||
}
|
||||
|
||||
/**
|
||||
* Standard quick sort except that sorting rearranges parallel arrays
|
||||
*
|
||||
* @param key Values to sort on
|
||||
* @param values The auxiliary values to sort.
|
||||
* @param start The beginning of the values to sort
|
||||
* @param end The value after the last value to sort
|
||||
* @param limit The minimum size to recurse down to.
|
||||
*/
|
||||
private static void quickSort(double[] key, double[][] values, int start, int end, int limit) {
|
||||
// the while loop implements tail-recursion to avoid excessive stack calls on nasty cases
|
||||
while (end - start > limit) {
|
||||
|
||||
// median of three values for the pivot
|
||||
int a = start;
|
||||
int b = (start + end) / 2;
|
||||
int c = end - 1;
|
||||
|
||||
int pivotIndex;
|
||||
double pivotValue;
|
||||
double va = key[a];
|
||||
double vb = key[b];
|
||||
double vc = key[c];
|
||||
|
||||
if (va > vb) {
|
||||
if (vc > va) {
|
||||
// vc > va > vb
|
||||
pivotIndex = a;
|
||||
pivotValue = va;
|
||||
} else {
|
||||
// va > vb, va >= vc
|
||||
if (vc < vb) {
|
||||
// va > vb > vc
|
||||
pivotIndex = b;
|
||||
pivotValue = vb;
|
||||
} else {
|
||||
// va >= vc >= vb
|
||||
pivotIndex = c;
|
||||
pivotValue = vc;
|
||||
}
|
||||
}
|
||||
} else {
|
||||
// vb >= va
|
||||
if (vc > vb) {
|
||||
// vc > vb >= va
|
||||
pivotIndex = b;
|
||||
pivotValue = vb;
|
||||
} else {
|
||||
// vb >= va, vb >= vc
|
||||
if (vc < va) {
|
||||
// vb >= va > vc
|
||||
pivotIndex = a;
|
||||
pivotValue = va;
|
||||
} else {
|
||||
// vb >= vc >= va
|
||||
pivotIndex = c;
|
||||
pivotValue = vc;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// move pivot to beginning of array
|
||||
swap(start, pivotIndex, key, values);
|
||||
|
||||
// we use a three way partition because many duplicate values is an important case
|
||||
|
||||
int low = start + 1; // low points to first value not known to be equal to pivotValue
|
||||
int high = end; // high points to first value > pivotValue
|
||||
int i = low; // i scans the array
|
||||
while (i < high) {
|
||||
// invariant: values[order[k]] == pivotValue for k in [0..low)
|
||||
// invariant: values[order[k]] < pivotValue for k in [low..i)
|
||||
// invariant: values[order[k]] > pivotValue for k in [high..end)
|
||||
// in-loop: i < high
|
||||
// in-loop: low < high
|
||||
// in-loop: i >= low
|
||||
double vi = key[i];
|
||||
if (vi == pivotValue) {
|
||||
if (low != i) {
|
||||
swap(low, i, key, values);
|
||||
} else {
|
||||
i++;
|
||||
}
|
||||
low++;
|
||||
} else if (vi > pivotValue) {
|
||||
high--;
|
||||
swap(i, high, key, values);
|
||||
} else {
|
||||
// vi < pivotValue
|
||||
i++;
|
||||
}
|
||||
}
|
||||
// invariant: values[order[k]] == pivotValue for k in [0..low)
|
||||
// invariant: values[order[k]] < pivotValue for k in [low..i)
|
||||
// invariant: values[order[k]] > pivotValue for k in [high..end)
|
||||
// assert i == high || low == high therefore, we are done with partition
|
||||
|
||||
// at this point, i==high, from [start,low) are == pivot, [low,high) are < and [high,end) are >
|
||||
// we have to move the values equal to the pivot into the middle. To do this, we swap pivot
|
||||
// values into the top end of the [low,high) range stopping when we run out of destinations
|
||||
// or when we run out of values to copy
|
||||
int from = start;
|
||||
int to = high - 1;
|
||||
for (i = 0; from < low && to >= low; i++) {
|
||||
swap(from++, to--, key, values);
|
||||
}
|
||||
if (from == low) {
|
||||
// ran out of things to copy. This means that the last destination is the boundary
|
||||
low = to + 1;
|
||||
} else {
|
||||
// ran out of places to copy to. This means that there are uncopied pivots and the
|
||||
// boundary is at the beginning of those
|
||||
low = from;
|
||||
}
|
||||
|
||||
// checkPartition(order, values, pivotValue, start, low, high, end);
|
||||
|
||||
// now recurse, but arrange it so we handle the longer limit by tail recursion
|
||||
if (low - start < end - high) {
|
||||
quickSort(key, values, start, low, limit);
|
||||
|
||||
// this is really a way to do
|
||||
// quickSort(order, values, high, end, limit);
|
||||
start = high;
|
||||
} else {
|
||||
quickSort(key, values, high, end, limit);
|
||||
// this is really a way to do
|
||||
// quickSort(order, values, start, low, limit);
|
||||
end = low;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Limited range insertion sort. We assume that no element has to move more than limit steps
|
||||
* because quick sort has done its thing. This version works on parallel arrays of keys and values.
|
||||
*
|
||||
* @param key The array of keys
|
||||
* @param values The values we are sorting
|
||||
* @param start The starting point of the sort
|
||||
* @param end The ending point of the sort
|
||||
* @param limit The largest amount of disorder
|
||||
*/
|
||||
private static void insertionSort(double[] key, double[][] values, int start, int end, int limit) {
|
||||
// loop invariant: all values start ... i-1 are ordered
|
||||
for (int i = start + 1; i < end; i++) {
|
||||
double v = key[i];
|
||||
int m = Math.max(i - limit, start);
|
||||
for (int j = i; j >= m; j--) {
|
||||
if (j == m || key[j - 1] <= v) {
|
||||
if (j < i) {
|
||||
System.arraycopy(key, j, key, j + 1, i - j);
|
||||
key[j] = v;
|
||||
for (double[] value : values) {
|
||||
double tmp = value[i];
|
||||
System.arraycopy(value, j, value, j + 1, i - j);
|
||||
value[j] = tmp;
|
||||
}
|
||||
}
|
||||
break;
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
private static void swap(int[] order, int i, int j) {
|
||||
int t = order[i];
|
||||
order[i] = order[j];
|
||||
order[j] = t;
|
||||
}
|
||||
|
||||
private static void swap(int i, int j, double[] key, double[]... values) {
|
||||
double t = key[i];
|
||||
key[i] = key[j];
|
||||
key[j] = t;
|
||||
|
||||
for (int k = 0; k < values.length; k++) {
|
||||
t = values[k][i];
|
||||
values[k][i] = values[k][j];
|
||||
values[k][j] = t;
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Limited range insertion sort with primary and secondary key. We assume that no
|
||||
* element has to move more than limit steps because quick sort has done its thing.
|
||||
*
|
||||
* If weights (the secondary key) is null, then only the primary key is used.
|
||||
*
|
||||
* This sort is inherently stable.
|
||||
*
|
||||
* @param order The permutation index
|
||||
* @param values The values we are sorting
|
||||
* @param weights The secondary key for sorting
|
||||
* @param start Where to start the sort
|
||||
* @param n How many elements to sort
|
||||
* @param limit The largest amount of disorder
|
||||
*/
|
||||
private static void insertionSort(int[] order, double[] values, double[] weights, int start, int n, int limit) {
|
||||
for (int i = start + 1; i < n; i++) {
|
||||
int t = order[i];
|
||||
double v = values[order[i]];
|
||||
double w = weights == null ? 0 : weights[order[i]];
|
||||
int m = Math.max(i - limit, start);
|
||||
// values in [start, i) are ordered
|
||||
// scan backwards to find where to stick t
|
||||
for (int j = i; j >= m; j--) {
|
||||
if (j == 0 || values[order[j - 1]] < v || (values[order[j - 1]] == v && (weights == null || weights[order[j - 1]] <= w))) {
|
||||
if (j < i) {
|
||||
System.arraycopy(order, j, order, j + 1, i - j);
|
||||
order[j] = t;
|
||||
}
|
||||
break;
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Limited range insertion sort with primary key stabilized by the use of the
|
||||
* original position to break ties. We assume that no element has to move more
|
||||
* than limit steps because quick sort has done its thing.
|
||||
*
|
||||
* @param order The permutation index
|
||||
* @param values The values we are sorting
|
||||
* @param start Where to start the sort
|
||||
* @param n How many elements to sort
|
||||
* @param limit The largest amount of disorder
|
||||
*/
|
||||
private static void stableInsertionSort(int[] order, double[] values, int start, int n, int limit) {
|
||||
for (int i = start + 1; i < n; i++) {
|
||||
int t = order[i];
|
||||
double v = values[order[i]];
|
||||
int vi = order[i];
|
||||
int m = Math.max(i - limit, start);
|
||||
// values in [start, i) are ordered
|
||||
// scan backwards to find where to stick t
|
||||
for (int j = i; j >= m; j--) {
|
||||
if (j == 0 || values[order[j - 1]] < v || (values[order[j - 1]] == v && (order[j - 1] <= vi))) {
|
||||
if (j < i) {
|
||||
System.arraycopy(order, j, order, j + 1, i - j);
|
||||
order[j] = t;
|
||||
}
|
||||
break;
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Reverses an array in-place.
|
||||
*
|
||||
* @param order The array to reverse
|
||||
*/
|
||||
public static void reverse(int[] order) {
|
||||
reverse(order, 0, order.length);
|
||||
}
|
||||
|
||||
/**
|
||||
* Reverses part of an array. See {@link #reverse(int[])}
|
||||
*
|
||||
* @param order The array containing the data to reverse.
|
||||
* @param offset Where to start reversing.
|
||||
* @param length How many elements to reverse
|
||||
*/
|
||||
public static void reverse(int[] order, int offset, int length) {
|
||||
for (int i = 0; i < length / 2; i++) {
|
||||
int t = order[offset + i];
|
||||
order[offset + i] = order[offset + length - i - 1];
|
||||
order[offset + length - i - 1] = t;
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Reverses part of an array. See {@link #reverse(int[])}
|
||||
*
|
||||
* @param order The array containing the data to reverse.
|
||||
* @param offset Where to start reversing.
|
||||
* @param length How many elements to reverse
|
||||
*/
|
||||
public static void reverse(double[] order, int offset, int length) {
|
||||
for (int i = 0; i < length / 2; i++) {
|
||||
double t = order[offset + i];
|
||||
order[offset + i] = order[offset + length - i - 1];
|
||||
order[offset + length - i - 1] = t;
|
||||
}
|
||||
}
|
||||
}
|
|
@ -0,0 +1,181 @@
|
|||
/*
|
||||
* Licensed to Elasticsearch B.V. under one or more contributor
|
||||
* license agreements. See the NOTICE file distributed with
|
||||
* this work for additional information regarding copyright
|
||||
* ownership. Elasticsearch B.V. licenses this file to you under
|
||||
* the Apache License, Version 2.0 (the "License"); you may
|
||||
* not use this file except in compliance with the License.
|
||||
* You may obtain a copy of the License at
|
||||
*
|
||||
* http://www.apache.org/licenses/LICENSE-2.0
|
||||
*
|
||||
* Unless required by applicable law or agreed to in writing,
|
||||
* software distributed under the License is distributed on an
|
||||
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||
* KIND, either express or implied. See the License for the
|
||||
* specific language governing permissions and limitations
|
||||
* under the License.
|
||||
*
|
||||
* This project is based on a modification of https://github.com/tdunning/t-digest which is licensed under the Apache 2.0 License.
|
||||
*/
|
||||
|
||||
package org.elasticsearch.tdigest;
|
||||
|
||||
import java.util.Collection;
|
||||
import java.util.List;
|
||||
import java.util.Locale;
|
||||
|
||||
/**
|
||||
* Adaptive histogram based on something like streaming k-means crossed with Q-digest.
|
||||
* The special characteristics of this algorithm are:
|
||||
* - smaller summaries than Q-digest
|
||||
* - works on doubles as well as integers.
|
||||
* - provides part per million accuracy for extreme quantiles and typically <1000 ppm accuracy for middle quantiles
|
||||
* - fast
|
||||
* - simple
|
||||
* - test coverage roughly at 90%
|
||||
* - easy to adapt for use with map-reduce
|
||||
*/
|
||||
public abstract class TDigest {
|
||||
protected ScaleFunction scale = ScaleFunction.K_2;
|
||||
double min = Double.POSITIVE_INFINITY;
|
||||
double max = Double.NEGATIVE_INFINITY;
|
||||
|
||||
/**
|
||||
* Creates an {@link MergingDigest}. This is generally the best known implementation right now.
|
||||
*
|
||||
* @param compression The compression parameter. 100 is a common value for normal uses. 1000 is extremely large.
|
||||
* The number of centroids retained will be a smallish (usually less than 10) multiple of this number.
|
||||
* @return the MergingDigest
|
||||
*/
|
||||
public static TDigest createMergingDigest(double compression) {
|
||||
return new MergingDigest(compression);
|
||||
}
|
||||
|
||||
/**
|
||||
* Creates an AVLTreeDigest. AVLTreeDigest is nearly the best known implementation right now.
|
||||
*
|
||||
* @param compression The compression parameter. 100 is a common value for normal uses. 1000 is extremely large.
|
||||
* The number of centroids retained will be a smallish (usually less than 10) multiple of this number.
|
||||
* @return the AvlTreeDigest
|
||||
*/
|
||||
public static TDigest createAvlTreeDigest(double compression) {
|
||||
return new AVLTreeDigest(compression);
|
||||
}
|
||||
|
||||
/**
|
||||
* Adds a sample to a histogram.
|
||||
*
|
||||
* @param x The value to add.
|
||||
* @param w The weight of this point.
|
||||
*/
|
||||
public abstract void add(double x, int w);
|
||||
|
||||
/**
|
||||
* Add a single sample to this TDigest.
|
||||
*
|
||||
* @param x The data value to add
|
||||
*/
|
||||
public final void add(double x) {
|
||||
add(x, 1);
|
||||
}
|
||||
|
||||
final void checkValue(double x) {
|
||||
if (Double.isNaN(x) || Double.isInfinite(x)) {
|
||||
throw new IllegalArgumentException("Invalid value: " + x);
|
||||
}
|
||||
}
|
||||
|
||||
public abstract void add(List<? extends TDigest> others);
|
||||
|
||||
/**
|
||||
* Re-examines a t-digest to determine whether some centroids are redundant. If your data are
|
||||
* perversely ordered, this may be a good idea. Even if not, this may save 20% or so in space.
|
||||
*
|
||||
* The cost is roughly the same as adding as many data points as there are centroids. This
|
||||
* is typically < 10 * compression, but could be as high as 100 * compression.
|
||||
*
|
||||
* This is a destructive operation that is not thread-safe.
|
||||
*/
|
||||
public abstract void compress();
|
||||
|
||||
/**
|
||||
* Returns the number of points that have been added to this TDigest.
|
||||
*
|
||||
* @return The sum of the weights on all centroids.
|
||||
*/
|
||||
public abstract long size();
|
||||
|
||||
/**
|
||||
* Returns the fraction of all points added which are ≤ x. Points
|
||||
* that are exactly equal get half credit (i.e. we use the mid-point
|
||||
* rule)
|
||||
*
|
||||
* @param x The cutoff for the cdf.
|
||||
* @return The fraction of all data which is less or equal to x.
|
||||
*/
|
||||
public abstract double cdf(double x);
|
||||
|
||||
/**
|
||||
* Returns an estimate of a cutoff such that a specified fraction of the data
|
||||
* added to this TDigest would be less than or equal to the cutoff.
|
||||
*
|
||||
* @param q The desired fraction
|
||||
* @return The smallest value x such that cdf(x) ≥ q
|
||||
*/
|
||||
public abstract double quantile(double q);
|
||||
|
||||
/**
|
||||
* A {@link Collection} that lets you go through the centroids in ascending order by mean. Centroids
|
||||
* returned will not be re-used, but may or may not share storage with this TDigest.
|
||||
*
|
||||
* @return The centroids in the form of a Collection.
|
||||
*/
|
||||
public abstract Collection<Centroid> centroids();
|
||||
|
||||
/**
|
||||
* Returns the current compression factor.
|
||||
*
|
||||
* @return The compression factor originally used to set up the TDigest.
|
||||
*/
|
||||
public abstract double compression();
|
||||
|
||||
/**
|
||||
* Returns the number of bytes required to encode this TDigest using #asBytes().
|
||||
*
|
||||
* @return The number of bytes required.
|
||||
*/
|
||||
public abstract int byteSize();
|
||||
|
||||
public void setScaleFunction(ScaleFunction scaleFunction) {
|
||||
if (scaleFunction.toString().endsWith("NO_NORM")) {
|
||||
throw new IllegalArgumentException(String.format(Locale.ROOT, "Can't use %s as scale with %s", scaleFunction, this.getClass()));
|
||||
}
|
||||
this.scale = scaleFunction;
|
||||
}
|
||||
|
||||
/**
|
||||
* Add all of the centroids of another TDigest to this one.
|
||||
*
|
||||
* @param other The other TDigest
|
||||
*/
|
||||
public abstract void add(TDigest other);
|
||||
|
||||
public abstract int centroidCount();
|
||||
|
||||
public double getMin() {
|
||||
return min;
|
||||
}
|
||||
|
||||
public double getMax() {
|
||||
return max;
|
||||
}
|
||||
|
||||
/**
|
||||
* Override the min and max values for testing purposes
|
||||
*/
|
||||
void setMinMax(double min, double max) {
|
||||
this.min = min;
|
||||
this.max = max;
|
||||
}
|
||||
}
|
|
@ -0,0 +1,29 @@
|
|||
/*
|
||||
* Licensed to Elasticsearch B.V. under one or more contributor
|
||||
* license agreements. See the NOTICE file distributed with
|
||||
* this work for additional information regarding copyright
|
||||
* ownership. Elasticsearch B.V. licenses this file to you under
|
||||
* the Apache License, Version 2.0 (the "License"); you may
|
||||
* not use this file except in compliance with the License.
|
||||
* You may obtain a copy of the License at
|
||||
*
|
||||
* http://www.apache.org/licenses/LICENSE-2.0
|
||||
*
|
||||
* Unless required by applicable law or agreed to in writing,
|
||||
* software distributed under the License is distributed on an
|
||||
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||
* KIND, either express or implied. See the License for the
|
||||
* specific language governing permissions and limitations
|
||||
* under the License.
|
||||
*/
|
||||
|
||||
/**
|
||||
* <h2>T-Digest library</h2>
|
||||
* This package contains a fork for the [T-Digest](https://github.com/tdunning/t-digest) library that's used for percentile calculations.
|
||||
*
|
||||
* Forking the library allows addressing bugs and inaccuracies around both the AVL- and the merging-based implementations that unblocks
|
||||
* switching from the former to the latter, with substantial performance gains. It also unlocks the use of BigArrays and other
|
||||
* ES-specific functionality to account for resources used in the digest data structures.
|
||||
*/
|
||||
|
||||
package org.elasticsearch.tdigest;
|
|
@ -0,0 +1,106 @@
|
|||
/*
|
||||
* Licensed to Elasticsearch B.V. under one or more contributor
|
||||
* license agreements. See the NOTICE file distributed with
|
||||
* this work for additional information regarding copyright
|
||||
* ownership. Elasticsearch B.V. licenses this file to you under
|
||||
* the Apache License, Version 2.0 (the "License"); you may
|
||||
* not use this file except in compliance with the License.
|
||||
* You may obtain a copy of the License at
|
||||
*
|
||||
* http://www.apache.org/licenses/LICENSE-2.0
|
||||
*
|
||||
* Unless required by applicable law or agreed to in writing,
|
||||
* software distributed under the License is distributed on an
|
||||
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||
* KIND, either express or implied. See the License for the
|
||||
* specific language governing permissions and limitations
|
||||
* under the License.
|
||||
*
|
||||
* This project is based on a modification of https://github.com/tdunning/t-digest which is licensed under the Apache 2.0 License.
|
||||
*/
|
||||
|
||||
package org.elasticsearch.tdigest;
|
||||
|
||||
import org.elasticsearch.test.ESTestCase;
|
||||
|
||||
public class AVLGroupTreeTests extends ESTestCase {
|
||||
|
||||
public void testSimpleAdds() {
|
||||
AVLGroupTree x = new AVLGroupTree();
|
||||
assertEquals(IntAVLTree.NIL, x.floor(34));
|
||||
assertEquals(IntAVLTree.NIL, x.first());
|
||||
assertEquals(IntAVLTree.NIL, x.last());
|
||||
assertEquals(0, x.size());
|
||||
assertEquals(0, x.sum());
|
||||
|
||||
x.add(new Centroid(1));
|
||||
assertEquals(1, x.sum());
|
||||
Centroid centroid = new Centroid(2);
|
||||
centroid.add(3, 1);
|
||||
centroid.add(4, 1);
|
||||
x.add(centroid);
|
||||
|
||||
assertEquals(2, x.size());
|
||||
assertEquals(4, x.sum());
|
||||
}
|
||||
|
||||
public void testBalancing() {
|
||||
AVLGroupTree x = new AVLGroupTree();
|
||||
for (int i = 0; i < 101; i++) {
|
||||
x.add(new Centroid(i));
|
||||
}
|
||||
|
||||
assertEquals(101, x.size());
|
||||
assertEquals(101, x.sum());
|
||||
|
||||
x.checkBalance();
|
||||
x.checkAggregates();
|
||||
}
|
||||
|
||||
public void testFloor() {
|
||||
// mostly tested in other tests
|
||||
AVLGroupTree x = new AVLGroupTree();
|
||||
for (int i = 0; i < 101; i++) {
|
||||
x.add(new Centroid(i / 2));
|
||||
}
|
||||
|
||||
assertEquals(IntAVLTree.NIL, x.floor(-30));
|
||||
|
||||
for (Centroid centroid : x) {
|
||||
assertEquals(centroid.mean(), x.mean(x.floor(centroid.mean() + 0.1)), 0);
|
||||
}
|
||||
}
|
||||
|
||||
public void testHeadSum() {
|
||||
AVLGroupTree x = new AVLGroupTree();
|
||||
for (int i = 0; i < 1000; ++i) {
|
||||
x.add(randomDouble(), randomIntBetween(1, 10));
|
||||
}
|
||||
long sum = 0;
|
||||
long last = -1;
|
||||
for (int node = x.first(); node != IntAVLTree.NIL; node = x.next(node)) {
|
||||
assertEquals(sum, x.headSum(node));
|
||||
sum += x.count(node);
|
||||
last = x.count(node);
|
||||
}
|
||||
assertEquals(last, x.count(x.last()));
|
||||
}
|
||||
|
||||
public void testFloorSum() {
|
||||
AVLGroupTree x = new AVLGroupTree();
|
||||
int total = 0;
|
||||
for (int i = 0; i < 1000; ++i) {
|
||||
int count = randomIntBetween(1, 10);
|
||||
x.add(randomDouble(), count);
|
||||
total += count;
|
||||
}
|
||||
assertEquals(IntAVLTree.NIL, x.floorSum(-1));
|
||||
for (long i = 0; i < total + 10; ++i) {
|
||||
final int floorNode = x.floorSum(i);
|
||||
assertTrue(x.headSum(floorNode) <= i);
|
||||
final int next = x.next(floorNode);
|
||||
assertTrue(next == IntAVLTree.NIL || x.headSum(next) > i);
|
||||
}
|
||||
}
|
||||
|
||||
}
|
|
@ -0,0 +1,33 @@
|
|||
/*
|
||||
* Licensed to Elasticsearch B.V. under one or more contributor
|
||||
* license agreements. See the NOTICE file distributed with
|
||||
* this work for additional information regarding copyright
|
||||
* ownership. Elasticsearch B.V. licenses this file to you under
|
||||
* the Apache License, Version 2.0 (the "License"); you may
|
||||
* not use this file except in compliance with the License.
|
||||
* You may obtain a copy of the License at
|
||||
*
|
||||
* http://www.apache.org/licenses/LICENSE-2.0
|
||||
*
|
||||
* Unless required by applicable law or agreed to in writing,
|
||||
* software distributed under the License is distributed on an
|
||||
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||
* KIND, either express or implied. See the License for the
|
||||
* specific language governing permissions and limitations
|
||||
* under the License.
|
||||
*
|
||||
* This project is based on a modification of https://github.com/tdunning/t-digest which is licensed under the Apache 2.0 License.
|
||||
*/
|
||||
|
||||
package org.elasticsearch.tdigest;
|
||||
|
||||
public class AVLTreeDigestTests extends TDigestTests {
|
||||
|
||||
protected DigestFactory factory(final double compression) {
|
||||
return () -> {
|
||||
AVLTreeDigest digest = new AVLTreeDigest(compression);
|
||||
digest.setRandomSeed(randomLong());
|
||||
return digest;
|
||||
};
|
||||
}
|
||||
}
|
|
@ -0,0 +1,84 @@
|
|||
/*
|
||||
* Licensed to Elasticsearch B.V. under one or more contributor
|
||||
* license agreements. See the NOTICE file distributed with
|
||||
* this work for additional information regarding copyright
|
||||
* ownership. Elasticsearch B.V. licenses this file to you under
|
||||
* the Apache License, Version 2.0 (the "License"); you may
|
||||
* not use this file except in compliance with the License.
|
||||
* You may obtain a copy of the License at
|
||||
*
|
||||
* http://www.apache.org/licenses/LICENSE-2.0
|
||||
*
|
||||
* Unless required by applicable law or agreed to in writing,
|
||||
* software distributed under the License is distributed on an
|
||||
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||
* KIND, either express or implied. See the License for the
|
||||
* specific language governing permissions and limitations
|
||||
* under the License.
|
||||
*
|
||||
* This project is based on a modification of https://github.com/tdunning/t-digest which is licensed under the Apache 2.0 License.
|
||||
*/
|
||||
|
||||
package org.elasticsearch.tdigest;
|
||||
|
||||
import org.elasticsearch.test.ESTestCase;
|
||||
|
||||
import java.util.ArrayList;
|
||||
import java.util.Collections;
|
||||
import java.util.List;
|
||||
import java.util.Random;
|
||||
|
||||
public class AlternativeMergeTests extends ESTestCase {
|
||||
/**
|
||||
* Computes size using the alternative scaling limit for both an idealized merge and for
|
||||
* a MergingDigest.
|
||||
*/
|
||||
public void testMerges() {
|
||||
for (int n : new int[] { 100, 1000, 10000, 100000 }) {
|
||||
for (double compression : new double[] { 50, 100, 200, 400 }) {
|
||||
MergingDigest mergingDigest = new MergingDigest(compression);
|
||||
AVLTreeDigest treeDigest = new AVLTreeDigest(compression);
|
||||
List<Double> data = new ArrayList<>();
|
||||
Random gen = random();
|
||||
for (int i = 0; i < n; i++) {
|
||||
double x = gen.nextDouble();
|
||||
data.add(x);
|
||||
mergingDigest.add(x);
|
||||
treeDigest.add(x);
|
||||
}
|
||||
Collections.sort(data);
|
||||
List<Double> counts = new ArrayList<>();
|
||||
double soFar = 0;
|
||||
double current = 0;
|
||||
for (Double x : data) {
|
||||
double q = (soFar + (current + 1.0) / 2) / n;
|
||||
if (current == 0 || current + 1 < n * Math.PI / compression * Math.sqrt(q * (1 - q))) {
|
||||
current += 1;
|
||||
} else {
|
||||
counts.add(current);
|
||||
soFar += current;
|
||||
current = 1;
|
||||
}
|
||||
}
|
||||
if (current > 0) {
|
||||
counts.add(current);
|
||||
}
|
||||
soFar = 0;
|
||||
for (Double count : counts) {
|
||||
soFar += count;
|
||||
}
|
||||
assertEquals(n, soFar, 0);
|
||||
soFar = 0;
|
||||
for (Centroid c : mergingDigest.centroids()) {
|
||||
soFar += c.count();
|
||||
}
|
||||
assertEquals(n, soFar, 0);
|
||||
soFar = 0;
|
||||
for (Centroid c : treeDigest.centroids()) {
|
||||
soFar += c.count();
|
||||
}
|
||||
assertEquals(n, soFar, 0);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
|
@ -0,0 +1,74 @@
|
|||
/*
|
||||
* Licensed to Elasticsearch B.V. under one or more contributor
|
||||
* license agreements. See the NOTICE file distributed with
|
||||
* this work for additional information regarding copyright
|
||||
* ownership. Elasticsearch B.V. licenses this file to you under
|
||||
* the Apache License, Version 2.0 (the "License"); you may
|
||||
* not use this file except in compliance with the License.
|
||||
* You may obtain a copy of the License at
|
||||
*
|
||||
* http://www.apache.org/licenses/LICENSE-2.0
|
||||
*
|
||||
* Unless required by applicable law or agreed to in writing,
|
||||
* software distributed under the License is distributed on an
|
||||
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||
* KIND, either express or implied. See the License for the
|
||||
* specific language governing permissions and limitations
|
||||
* under the License.
|
||||
*
|
||||
* This project is based on a modification of https://github.com/tdunning/t-digest which is licensed under the Apache 2.0 License.
|
||||
*/
|
||||
|
||||
package org.elasticsearch.tdigest;
|
||||
|
||||
import org.elasticsearch.test.ESTestCase;
|
||||
|
||||
public abstract class BigCountTests extends ESTestCase {
|
||||
|
||||
public void testBigMerge() {
|
||||
TDigest digest = createDigest();
|
||||
for (int i = 0; i < 5; i++) {
|
||||
digest.add(getDigest());
|
||||
double actual = digest.quantile(0.5);
|
||||
assertEquals("Count = " + digest.size(), 3000, actual, 0.001);
|
||||
}
|
||||
}
|
||||
|
||||
private TDigest getDigest() {
|
||||
TDigest digest = createDigest();
|
||||
addData(digest);
|
||||
return digest;
|
||||
}
|
||||
|
||||
public TDigest createDigest() {
|
||||
throw new IllegalStateException("Should have over-ridden createDigest");
|
||||
}
|
||||
|
||||
private static void addData(TDigest digest) {
|
||||
double n = 300_000_000 * 5 + 200;
|
||||
|
||||
addFakeCentroids(digest, n, 300_000_000, 10);
|
||||
addFakeCentroids(digest, n, 300_000_000, 200);
|
||||
addFakeCentroids(digest, n, 300_000_000, 3000);
|
||||
addFakeCentroids(digest, n, 300_000_000, 4000);
|
||||
addFakeCentroids(digest, n, 300_000_000, 5000);
|
||||
addFakeCentroids(digest, n, 200, 47883554);
|
||||
|
||||
assertEquals(n, digest.size(), 0);
|
||||
}
|
||||
|
||||
private static void addFakeCentroids(TDigest digest, double n, int points, int x) {
|
||||
long base = digest.size();
|
||||
double q0 = base / n;
|
||||
long added = 0;
|
||||
while (added < points) {
|
||||
double k0 = digest.scale.k(q0, digest.compression(), n);
|
||||
double q1 = digest.scale.q(k0 + 1, digest.compression(), n);
|
||||
q1 = Math.min(q1, (base + points) / n);
|
||||
int m = (int) Math.min(points - added, Math.max(1, Math.rint((q1 - q0) * n)));
|
||||
added += m;
|
||||
digest.add(x, m);
|
||||
q0 = q1;
|
||||
}
|
||||
}
|
||||
}
|
|
@ -0,0 +1,29 @@
|
|||
/*
|
||||
* Licensed to Elasticsearch B.V. under one or more contributor
|
||||
* license agreements. See the NOTICE file distributed with
|
||||
* this work for additional information regarding copyright
|
||||
* ownership. Elasticsearch B.V. licenses this file to you under
|
||||
* the Apache License, Version 2.0 (the "License"); you may
|
||||
* not use this file except in compliance with the License.
|
||||
* You may obtain a copy of the License at
|
||||
*
|
||||
* http://www.apache.org/licenses/LICENSE-2.0
|
||||
*
|
||||
* Unless required by applicable law or agreed to in writing,
|
||||
* software distributed under the License is distributed on an
|
||||
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||
* KIND, either express or implied. See the License for the
|
||||
* specific language governing permissions and limitations
|
||||
* under the License.
|
||||
*
|
||||
* This project is based on a modification of https://github.com/tdunning/t-digest which is licensed under the Apache 2.0 License.
|
||||
*/
|
||||
|
||||
package org.elasticsearch.tdigest;
|
||||
|
||||
public class BigCountTestsMergingDigestTests extends BigCountTests {
|
||||
@Override
|
||||
public TDigest createDigest() {
|
||||
return new MergingDigest(100);
|
||||
}
|
||||
}
|
|
@ -0,0 +1,29 @@
|
|||
/*
|
||||
* Licensed to Elasticsearch B.V. under one or more contributor
|
||||
* license agreements. See the NOTICE file distributed with
|
||||
* this work for additional information regarding copyright
|
||||
* ownership. Elasticsearch B.V. licenses this file to you under
|
||||
* the Apache License, Version 2.0 (the "License"); you may
|
||||
* not use this file except in compliance with the License.
|
||||
* You may obtain a copy of the License at
|
||||
*
|
||||
* http://www.apache.org/licenses/LICENSE-2.0
|
||||
*
|
||||
* Unless required by applicable law or agreed to in writing,
|
||||
* software distributed under the License is distributed on an
|
||||
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||
* KIND, either express or implied. See the License for the
|
||||
* specific language governing permissions and limitations
|
||||
* under the License.
|
||||
*
|
||||
* This project is based on a modification of https://github.com/tdunning/t-digest which is licensed under the Apache 2.0 License.
|
||||
*/
|
||||
|
||||
package org.elasticsearch.tdigest;
|
||||
|
||||
public class BigCountTestsTreeDigestTests extends BigCountTests {
|
||||
@Override
|
||||
public TDigest createDigest() {
|
||||
return new AVLTreeDigest(100);
|
||||
}
|
||||
}
|
|
@ -0,0 +1,138 @@
|
|||
/*
|
||||
* Licensed to Elasticsearch B.V. under one or more contributor
|
||||
* license agreements. See the NOTICE file distributed with
|
||||
* this work for additional information regarding copyright
|
||||
* ownership. Elasticsearch B.V. licenses this file to you under
|
||||
* the Apache License, Version 2.0 (the "License"); you may
|
||||
* not use this file except in compliance with the License.
|
||||
* You may obtain a copy of the License at
|
||||
*
|
||||
* http://www.apache.org/licenses/LICENSE-2.0
|
||||
*
|
||||
* Unless required by applicable law or agreed to in writing,
|
||||
* software distributed under the License is distributed on an
|
||||
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||
* KIND, either express or implied. See the License for the
|
||||
* specific language governing permissions and limitations
|
||||
* under the License.
|
||||
*
|
||||
* This project is based on a modification of https://github.com/tdunning/t-digest which is licensed under the Apache 2.0 License.
|
||||
*/
|
||||
|
||||
package org.elasticsearch.tdigest;
|
||||
|
||||
import org.elasticsearch.test.ESTestCase;
|
||||
|
||||
import java.util.Arrays;
|
||||
|
||||
public class ComparisonTests extends ESTestCase {
|
||||
|
||||
public void testRandomDenseDistribution() {
|
||||
final int SAMPLE_COUNT = 1_000_000;
|
||||
final int COMPRESSION = 100;
|
||||
|
||||
TDigest avlTreeDigest = TDigest.createAvlTreeDigest(COMPRESSION);
|
||||
TDigest mergingDigest = TDigest.createMergingDigest(COMPRESSION);
|
||||
double[] samples = new double[SAMPLE_COUNT];
|
||||
|
||||
var rand = random();
|
||||
for (int i = 0; i < SAMPLE_COUNT; i++) {
|
||||
samples[i] = rand.nextDouble();
|
||||
avlTreeDigest.add(samples[i]);
|
||||
mergingDigest.add(samples[i]);
|
||||
}
|
||||
Arrays.sort(samples);
|
||||
|
||||
for (double percentile : new double[] { 0, 0.01, 0.1, 1, 5, 10, 25, 50, 75, 90, 99, 99.9, 99.99, 100.0 }) {
|
||||
double q = percentile / 100.0;
|
||||
double expected = Dist.quantile(q, samples);
|
||||
double accuracy = percentile > 1 ? Math.abs(expected / 10) : Math.abs(expected);
|
||||
assertEquals(String.valueOf(percentile), expected, avlTreeDigest.quantile(q), accuracy);
|
||||
assertEquals(String.valueOf(percentile), expected, mergingDigest.quantile(q), accuracy);
|
||||
}
|
||||
}
|
||||
|
||||
public void testRandomSparseDistribution() {
|
||||
final int SAMPLE_COUNT = 1_000_000;
|
||||
final int COMPRESSION = 100;
|
||||
|
||||
TDigest avlTreeDigest = TDigest.createAvlTreeDigest(COMPRESSION);
|
||||
TDigest mergingDigest = TDigest.createMergingDigest(COMPRESSION);
|
||||
double[] samples = new double[SAMPLE_COUNT];
|
||||
|
||||
var rand = random();
|
||||
for (int i = 0; i < SAMPLE_COUNT; i++) {
|
||||
samples[i] = rand.nextDouble() * SAMPLE_COUNT * SAMPLE_COUNT + SAMPLE_COUNT;
|
||||
avlTreeDigest.add(samples[i]);
|
||||
mergingDigest.add(samples[i]);
|
||||
}
|
||||
Arrays.sort(samples);
|
||||
|
||||
for (double percentile : new double[] { 0, 0.01, 0.1, 1, 5, 10, 25, 50, 75, 90, 99, 99.9, 99.99, 100.0 }) {
|
||||
double q = percentile / 100.0;
|
||||
double expected = Dist.quantile(q, samples);
|
||||
double accuracy = percentile > 1 ? Math.abs(expected / 10) : Math.abs(expected);
|
||||
assertEquals(String.valueOf(percentile), expected, avlTreeDigest.quantile(q), accuracy);
|
||||
assertEquals(String.valueOf(percentile), expected, mergingDigest.quantile(q), accuracy);
|
||||
}
|
||||
}
|
||||
|
||||
public void testDenseGaussianDistribution() {
|
||||
final int SAMPLE_COUNT = 1_000_000;
|
||||
final int COMPRESSION = 100;
|
||||
|
||||
TDigest avlTreeDigest = TDigest.createAvlTreeDigest(COMPRESSION);
|
||||
TDigest mergingDigest = TDigest.createMergingDigest(COMPRESSION);
|
||||
double[] samples = new double[SAMPLE_COUNT];
|
||||
|
||||
var rand = random();
|
||||
for (int i = 0; i < SAMPLE_COUNT; i++) {
|
||||
samples[i] = rand.nextGaussian();
|
||||
avlTreeDigest.add(samples[i]);
|
||||
mergingDigest.add(samples[i]);
|
||||
}
|
||||
Arrays.sort(samples);
|
||||
|
||||
for (double percentile : new double[] { 0, 0.01, 0.1, 1, 5, 10, 25, 75, 90, 99, 99.9, 99.99, 100.0 }) {
|
||||
double q = percentile / 100.0;
|
||||
double expected = Dist.quantile(q, samples);
|
||||
double accuracy = percentile > 1 ? Math.abs(expected / 10) : Math.abs(expected);
|
||||
assertEquals(String.valueOf(percentile), expected, avlTreeDigest.quantile(q), accuracy);
|
||||
assertEquals(String.valueOf(percentile), expected, mergingDigest.quantile(q), accuracy);
|
||||
}
|
||||
|
||||
double expectedMedian = Dist.quantile(0.5, samples);
|
||||
assertEquals(expectedMedian, avlTreeDigest.quantile(0.5), 0.01);
|
||||
assertEquals(expectedMedian, mergingDigest.quantile(0.5), 0.01);
|
||||
}
|
||||
|
||||
public void testSparseGaussianDistribution() {
|
||||
final int SAMPLE_COUNT = 1_000_000;
|
||||
final int COMPRESSION = 100;
|
||||
|
||||
TDigest avlTreeDigest = TDigest.createAvlTreeDigest(COMPRESSION);
|
||||
TDigest mergingDigest = TDigest.createMergingDigest(COMPRESSION);
|
||||
double[] samples = new double[SAMPLE_COUNT];
|
||||
var rand = random();
|
||||
|
||||
for (int i = 0; i < SAMPLE_COUNT; i++) {
|
||||
samples[i] = rand.nextGaussian() * SAMPLE_COUNT;
|
||||
avlTreeDigest.add(samples[i]);
|
||||
mergingDigest.add(samples[i]);
|
||||
}
|
||||
Arrays.sort(samples);
|
||||
|
||||
for (double percentile : new double[] { 0, 0.01, 0.1, 1, 5, 10, 25, 75, 90, 99, 99.9, 99.99, 100.0 }) {
|
||||
double q = percentile / 100.0;
|
||||
double expected = Dist.quantile(q, samples);
|
||||
double accuracy = percentile > 1 ? Math.abs(expected / 10) : Math.abs(expected);
|
||||
assertEquals(String.valueOf(percentile), expected, avlTreeDigest.quantile(q), accuracy);
|
||||
assertEquals(String.valueOf(percentile), expected, mergingDigest.quantile(q), accuracy);
|
||||
}
|
||||
|
||||
// The absolute value of median is within [0,5000], which is deemed close enough to 0 compared to the max value.
|
||||
double expectedMedian = Dist.quantile(0.5, samples);
|
||||
assertEquals(expectedMedian, avlTreeDigest.quantile(0.5), 5000);
|
||||
assertEquals(expectedMedian, mergingDigest.quantile(0.5), 5000);
|
||||
}
|
||||
}
|
|
@ -0,0 +1,139 @@
|
|||
/*
|
||||
* Licensed to Elasticsearch B.V. under one or more contributor
|
||||
* license agreements. See the NOTICE file distributed with
|
||||
* this work for additional information regarding copyright
|
||||
* ownership. Elasticsearch B.V. licenses this file to you under
|
||||
* the Apache License, Version 2.0 (the "License"); you may
|
||||
* not use this file except in compliance with the License.
|
||||
* You may obtain a copy of the License at
|
||||
*
|
||||
* http://www.apache.org/licenses/LICENSE-2.0
|
||||
*
|
||||
* Unless required by applicable law or agreed to in writing,
|
||||
* software distributed under the License is distributed on an
|
||||
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||
* KIND, either express or implied. See the License for the
|
||||
* specific language governing permissions and limitations
|
||||
* under the License.
|
||||
*
|
||||
* This project is based on a modification of https://github.com/tdunning/t-digest which is licensed under the Apache 2.0 License.
|
||||
*/
|
||||
|
||||
package org.elasticsearch.tdigest;
|
||||
|
||||
import org.elasticsearch.test.ESTestCase;
|
||||
|
||||
import java.util.Arrays;
|
||||
import java.util.Iterator;
|
||||
import java.util.Map;
|
||||
import java.util.Random;
|
||||
import java.util.TreeMap;
|
||||
|
||||
public class IntAVLTreeTests extends ESTestCase {
|
||||
|
||||
static class IntegerBag extends IntAVLTree {
|
||||
|
||||
int value;
|
||||
int[] values;
|
||||
int[] counts;
|
||||
|
||||
IntegerBag() {
|
||||
values = new int[capacity()];
|
||||
counts = new int[capacity()];
|
||||
}
|
||||
|
||||
public boolean addValue(int value) {
|
||||
this.value = value;
|
||||
return super.add();
|
||||
}
|
||||
|
||||
public boolean removeValue(int value) {
|
||||
this.value = value;
|
||||
final int node = find();
|
||||
if (node == NIL) {
|
||||
return false;
|
||||
} else {
|
||||
super.remove(node);
|
||||
return true;
|
||||
}
|
||||
}
|
||||
|
||||
@Override
|
||||
protected void resize(int newCapacity) {
|
||||
super.resize(newCapacity);
|
||||
values = Arrays.copyOf(values, newCapacity);
|
||||
counts = Arrays.copyOf(counts, newCapacity);
|
||||
}
|
||||
|
||||
@Override
|
||||
protected int compare(int node) {
|
||||
return value - values[node];
|
||||
}
|
||||
|
||||
@Override
|
||||
protected void copy(int node) {
|
||||
values[node] = value;
|
||||
counts[node] = 1;
|
||||
}
|
||||
|
||||
@Override
|
||||
protected void merge(int node) {
|
||||
values[node] = value;
|
||||
counts[node]++;
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
public void testDualAdd() {
|
||||
Random r = random();
|
||||
TreeMap<Integer, Integer> map = new TreeMap<>();
|
||||
IntegerBag bag = new IntegerBag();
|
||||
for (int i = 0; i < 100000; ++i) {
|
||||
final int v = r.nextInt(100000);
|
||||
if (map.containsKey(v)) {
|
||||
map.put(v, map.get(v) + 1);
|
||||
assertFalse(bag.addValue(v));
|
||||
} else {
|
||||
map.put(v, 1);
|
||||
assertTrue(bag.addValue(v));
|
||||
}
|
||||
}
|
||||
Iterator<Map.Entry<Integer, Integer>> it = map.entrySet().iterator();
|
||||
for (int node = bag.first(bag.root()); node != IntAVLTree.NIL; node = bag.next(node)) {
|
||||
final Map.Entry<Integer, Integer> next = it.next();
|
||||
assertEquals(next.getKey().intValue(), bag.values[node]);
|
||||
assertEquals(next.getValue().intValue(), bag.counts[node]);
|
||||
}
|
||||
assertFalse(it.hasNext());
|
||||
}
|
||||
|
||||
public void testDualAddRemove() {
|
||||
Random r = random();
|
||||
TreeMap<Integer, Integer> map = new TreeMap<>();
|
||||
IntegerBag bag = new IntegerBag();
|
||||
for (int i = 0; i < 100000; ++i) {
|
||||
final int v = r.nextInt(1000);
|
||||
if (r.nextBoolean()) {
|
||||
// add
|
||||
if (map.containsKey(v)) {
|
||||
map.put(v, map.get(v) + 1);
|
||||
assertFalse(bag.addValue(v));
|
||||
} else {
|
||||
map.put(v, 1);
|
||||
assertTrue(bag.addValue(v));
|
||||
}
|
||||
} else {
|
||||
// remove
|
||||
assertEquals(map.remove(v) != null, bag.removeValue(v));
|
||||
}
|
||||
}
|
||||
Iterator<Map.Entry<Integer, Integer>> it = map.entrySet().iterator();
|
||||
for (int node = bag.first(bag.root()); node != IntAVLTree.NIL; node = bag.next(node)) {
|
||||
final Map.Entry<Integer, Integer> next = it.next();
|
||||
assertEquals(next.getKey().intValue(), bag.values[node]);
|
||||
assertEquals(next.getValue().intValue(), bag.counts[node]);
|
||||
}
|
||||
assertFalse(it.hasNext());
|
||||
}
|
||||
|
||||
}
|
|
@ -0,0 +1,55 @@
|
|||
/*
|
||||
* Licensed to Elasticsearch B.V. under one or more contributor
|
||||
* license agreements. See the NOTICE file distributed with
|
||||
* this work for additional information regarding copyright
|
||||
* ownership. Elasticsearch B.V. licenses this file to you under
|
||||
* the Apache License, Version 2.0 (the "License"); you may
|
||||
* not use this file except in compliance with the License.
|
||||
* You may obtain a copy of the License at
|
||||
*
|
||||
* http://www.apache.org/licenses/LICENSE-2.0
|
||||
*
|
||||
* Unless required by applicable law or agreed to in writing,
|
||||
* software distributed under the License is distributed on an
|
||||
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||
* KIND, either express or implied. See the License for the
|
||||
* specific language governing permissions and limitations
|
||||
* under the License.
|
||||
*
|
||||
* This project is based on a modification of https://github.com/tdunning/t-digest which is licensed under the Apache 2.0 License.
|
||||
*/
|
||||
|
||||
package org.elasticsearch.tdigest;
|
||||
|
||||
import org.elasticsearch.test.ESTestCase;
|
||||
|
||||
public class MedianTests extends ESTestCase {
|
||||
|
||||
public void testAVL() {
|
||||
double[] data = new double[] { 7, 15, 36, 39, 40, 41 };
|
||||
TDigest digest = new AVLTreeDigest(100);
|
||||
for (double value : data) {
|
||||
digest.add(value);
|
||||
}
|
||||
|
||||
assertEquals(37.5, digest.quantile(0.5), 0);
|
||||
assertEquals(0.5, digest.cdf(37.5), 0);
|
||||
}
|
||||
|
||||
public void testMergingDigest() {
|
||||
double[] data = new double[] { 7, 15, 36, 39, 40, 41 };
|
||||
TDigest digest = new MergingDigest(100);
|
||||
for (double value : data) {
|
||||
digest.add(value);
|
||||
}
|
||||
|
||||
assertEquals(37.5, digest.quantile(0.5), 0);
|
||||
assertEquals(0.5, digest.cdf(37.5), 0);
|
||||
}
|
||||
|
||||
public void testReferenceWikipedia() {
|
||||
double[] data = new double[] { 7, 15, 36, 39, 40, 41 };
|
||||
assertEquals(37.5, Dist.quantile(0.5, data), 0);
|
||||
assertEquals(0.5, Dist.cdf(37.5, data), 0);
|
||||
}
|
||||
}
|
|
@ -0,0 +1,156 @@
|
|||
/*
|
||||
* Licensed to Elasticsearch B.V. under one or more contributor
|
||||
* license agreements. See the NOTICE file distributed with
|
||||
* this work for additional information regarding copyright
|
||||
* ownership. Elasticsearch B.V. licenses this file to you under
|
||||
* the Apache License, Version 2.0 (the "License"); you may
|
||||
* not use this file except in compliance with the License.
|
||||
* You may obtain a copy of the License at
|
||||
*
|
||||
* http://www.apache.org/licenses/LICENSE-2.0
|
||||
*
|
||||
* Unless required by applicable law or agreed to in writing,
|
||||
* software distributed under the License is distributed on an
|
||||
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||
* KIND, either express or implied. See the License for the
|
||||
* specific language governing permissions and limitations
|
||||
* under the License.
|
||||
*
|
||||
* This project is based on a modification of https://github.com/tdunning/t-digest which is licensed under the Apache 2.0 License.
|
||||
*/
|
||||
|
||||
package org.elasticsearch.tdigest;
|
||||
|
||||
import org.junit.Assert;
|
||||
|
||||
import java.util.ArrayList;
|
||||
import java.util.Collections;
|
||||
import java.util.Iterator;
|
||||
import java.util.List;
|
||||
import java.util.Locale;
|
||||
import java.util.Random;
|
||||
|
||||
public class MergingDigestTests extends TDigestTests {
|
||||
|
||||
protected DigestFactory factory(final double compression) {
|
||||
return () -> new MergingDigest(compression);
|
||||
}
|
||||
|
||||
public void testNanDueToBadInitialization() {
|
||||
int compression = 100;
|
||||
int factor = 5;
|
||||
MergingDigest md = new MergingDigest(compression, (factor + 1) * compression, compression);
|
||||
|
||||
final int M = 10;
|
||||
List<MergingDigest> mds = new ArrayList<>();
|
||||
for (int i = 0; i < M; ++i) {
|
||||
mds.add(new MergingDigest(compression, (factor + 1) * compression, compression));
|
||||
}
|
||||
|
||||
// Fill all digests with values (0,10,20,...,80).
|
||||
List<Double> raw = new ArrayList<>();
|
||||
for (int i = 0; i < 9; ++i) {
|
||||
double x = 10 * i;
|
||||
md.add(x);
|
||||
raw.add(x);
|
||||
for (int j = 0; j < M; ++j) {
|
||||
mds.get(j).add(x);
|
||||
raw.add(x);
|
||||
}
|
||||
}
|
||||
Collections.sort(raw);
|
||||
|
||||
// Merge all mds one at a time into md.
|
||||
for (int i = 0; i < M; ++i) {
|
||||
List<MergingDigest> singleton = new ArrayList<>();
|
||||
singleton.add(mds.get(i));
|
||||
md.add(singleton);
|
||||
}
|
||||
Assert.assertFalse(Double.isNaN(md.quantile(0.01)));
|
||||
|
||||
for (double q : new double[] { 0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 0.90, 0.95, 0.99 }) {
|
||||
double est = md.quantile(q);
|
||||
double actual = Dist.quantile(q, raw);
|
||||
double qx = md.cdf(actual);
|
||||
Assert.assertEquals(q, qx, 0.5);
|
||||
Assert.assertEquals(est, actual, 3.8);
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Verifies interpolation between a singleton and a larger centroid.
|
||||
*/
|
||||
public void testSingleMultiRange() {
|
||||
TDigest digest = factory(100).create();
|
||||
digest.setScaleFunction(ScaleFunction.K_0);
|
||||
for (int i = 0; i < 100; i++) {
|
||||
digest.add(1);
|
||||
digest.add(2);
|
||||
digest.add(3);
|
||||
}
|
||||
// this check is, of course true, but it also forces merging before we change scale
|
||||
assertTrue(digest.centroidCount() < 300);
|
||||
digest.add(0);
|
||||
// we now have a digest with a singleton first, then a heavier centroid next
|
||||
Iterator<Centroid> ix = digest.centroids().iterator();
|
||||
Centroid first = ix.next();
|
||||
Centroid second = ix.next();
|
||||
assertEquals(1, first.count());
|
||||
assertEquals(0, first.mean(), 0);
|
||||
// assertTrue(second.count() > 1);
|
||||
assertEquals(1.0, second.mean(), 0);
|
||||
|
||||
assertEquals(0.00166, digest.cdf(0), 1e-5);
|
||||
assertEquals(0.00166, digest.cdf(1e-10), 1e-5);
|
||||
assertEquals(0.0025, digest.cdf(0.25), 1e-5);
|
||||
}
|
||||
|
||||
/**
|
||||
* Make sure that the first and last centroids have unit weight
|
||||
*/
|
||||
public void testSingletonsAtEnds() {
|
||||
TDigest d = new MergingDigest(50);
|
||||
Random gen = random();
|
||||
double[] data = new double[100];
|
||||
for (int i = 0; i < data.length; i++) {
|
||||
data[i] = Math.floor(gen.nextGaussian() * 3);
|
||||
}
|
||||
for (int i = 0; i < 100; i++) {
|
||||
for (double x : data) {
|
||||
d.add(x);
|
||||
}
|
||||
}
|
||||
int last = 0;
|
||||
for (Centroid centroid : d.centroids()) {
|
||||
if (last == 0) {
|
||||
assertEquals(1, centroid.count());
|
||||
}
|
||||
last = centroid.count();
|
||||
}
|
||||
assertEquals(1, last);
|
||||
}
|
||||
|
||||
/**
|
||||
* Verify centroid sizes.
|
||||
*/
|
||||
public void testFill() {
|
||||
MergingDigest x = new MergingDigest(300);
|
||||
Random gen = random();
|
||||
ScaleFunction scale = x.getScaleFunction();
|
||||
double compression = x.compression();
|
||||
for (int i = 0; i < 1000000; i++) {
|
||||
x.add(gen.nextGaussian());
|
||||
}
|
||||
double q0 = 0;
|
||||
int i = 0;
|
||||
for (Centroid centroid : x.centroids()) {
|
||||
double q1 = q0 + (double) centroid.count() / x.size();
|
||||
double dk = scale.k(q1, compression, x.size()) - scale.k(q0, compression, x.size());
|
||||
if (centroid.count() > 1) {
|
||||
assertTrue(String.format(Locale.ROOT, "K-size for centroid %d at %.3f is %.3f", i, centroid.mean(), dk), dk <= 1);
|
||||
}
|
||||
q0 = q1;
|
||||
i++;
|
||||
}
|
||||
}
|
||||
}
|
|
@ -0,0 +1,280 @@
|
|||
/*
|
||||
* Licensed to Elasticsearch B.V. under one or more contributor
|
||||
* license agreements. See the NOTICE file distributed with
|
||||
* this work for additional information regarding copyright
|
||||
* ownership. Elasticsearch B.V. licenses this file to you under
|
||||
* the Apache License, Version 2.0 (the "License"); you may
|
||||
* not use this file except in compliance with the License.
|
||||
* You may obtain a copy of the License at
|
||||
*
|
||||
* http://www.apache.org/licenses/LICENSE-2.0
|
||||
*
|
||||
* Unless required by applicable law or agreed to in writing,
|
||||
* software distributed under the License is distributed on an
|
||||
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||
* KIND, either express or implied. See the License for the
|
||||
* specific language governing permissions and limitations
|
||||
* under the License.
|
||||
*
|
||||
* This project is based on a modification of https://github.com/tdunning/t-digest which is licensed under the Apache 2.0 License.
|
||||
*/
|
||||
|
||||
package org.elasticsearch.tdigest;
|
||||
|
||||
import org.elasticsearch.test.ESTestCase;
|
||||
|
||||
import java.util.HashMap;
|
||||
import java.util.Locale;
|
||||
import java.util.Map;
|
||||
|
||||
import static java.lang.Math.abs;
|
||||
import static java.lang.Math.max;
|
||||
|
||||
/**
|
||||
* Validate internal consistency of scale functions.
|
||||
*/
|
||||
public class ScaleFunctionTests extends ESTestCase {
|
||||
|
||||
public void asinApproximation() {
|
||||
for (double x = 0; x < 1; x += 1e-4) {
|
||||
assertEquals(Math.asin(x), ScaleFunction.fastAsin(x), 1e-6);
|
||||
}
|
||||
assertEquals(Math.asin(1), ScaleFunction.fastAsin(1), 0);
|
||||
assertTrue(Double.isNaN(ScaleFunction.fastAsin(1.0001)));
|
||||
}
|
||||
|
||||
/**
|
||||
* Test that the basic single pass greedy t-digest construction has expected behavior with all scale functions.
|
||||
* <p>
|
||||
* This also throws off a diagnostic file that can be visualized if desired under the name of
|
||||
* scale-function-sizes.csv
|
||||
*/
|
||||
public void testSize() {
|
||||
for (double compression : new double[] { 20, 50, 100, 200, 500, 1000, 2000 }) {
|
||||
for (double n : new double[] { 10, 20, 50, 100, 200, 500, 1_000, 10_000, 100_000 }) {
|
||||
Map<String, Integer> clusterCount = new HashMap<>();
|
||||
for (ScaleFunction k : ScaleFunction.values()) {
|
||||
if (k.toString().equals("K_0")) {
|
||||
continue;
|
||||
}
|
||||
double k0 = k.k(0, compression, n);
|
||||
int m = 0;
|
||||
for (int i = 0; i < n;) {
|
||||
double cnt = 1;
|
||||
while (i + cnt < n && k.k((i + cnt + 1) / (n - 1), compression, n) - k0 < 1) {
|
||||
cnt++;
|
||||
}
|
||||
double size = n * max(k.max(i / (n - 1), compression, n), k.max((i + cnt) / (n - 1), compression, n));
|
||||
|
||||
// check that we didn't cross the midline (which makes the size limit very conservative)
|
||||
double left = i - (n - 1) / 2;
|
||||
double right = i + cnt - (n - 1) / 2;
|
||||
boolean sameSide = left * right > 0;
|
||||
if (k.toString().endsWith("NO_NORM") == false && sameSide) {
|
||||
assertTrue(
|
||||
String.format(Locale.ROOT, "%s %.0f %.0f %.3f vs %.3f @ %.3f", k, compression, n, cnt, size, i / (n - 1)),
|
||||
cnt == 1 || cnt <= max(1.1 * size, size + 1)
|
||||
);
|
||||
}
|
||||
i += cnt;
|
||||
k0 = k.k(i / (n - 1), compression, n);
|
||||
m++;
|
||||
}
|
||||
clusterCount.put(k.toString(), m);
|
||||
|
||||
if (k.toString().endsWith("NO_NORM") == false) {
|
||||
assertTrue(
|
||||
String.format(Locale.ROOT, "%s %d, %.0f", k, m, compression),
|
||||
n < 3 * compression || (m >= compression / 3 && m <= compression)
|
||||
);
|
||||
}
|
||||
}
|
||||
// make sure that the approximate version gets same results
|
||||
assertEquals(clusterCount.get("K_1"), clusterCount.get("K_1_FAST"));
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Validates the bounds on the shape of the different scale functions. The basic idea is
|
||||
* that diff difference between minimum and maximum values of k in the region where we
|
||||
* can have centroids with >1 sample should be small enough to meet the size limit of
|
||||
* the digest, but not small enough to degrade accuracy.
|
||||
*/
|
||||
public void testK() {
|
||||
for (ScaleFunction k : ScaleFunction.values()) {
|
||||
if (k.name().contains("NO_NORM")) {
|
||||
continue;
|
||||
}
|
||||
if (k.name().contains("K_0")) {
|
||||
continue;
|
||||
}
|
||||
for (double compression : new double[] { 50, 100, 200, 500, 1000 }) {
|
||||
for (int n : new int[] { 10, 100, 1000, 10000, 100000, 1_000_000, 10_000_000 }) {
|
||||
// first confirm that the shortcut (with norm) and the full version agree
|
||||
double norm = k.normalizer(compression, n);
|
||||
for (double q : new double[] { 0.0001, 0.001, 0.01, 0.1, 0.2, 0.5 }) {
|
||||
if (q * n > 1) {
|
||||
assertEquals(
|
||||
String.format(Locale.ROOT, "%s q: %.4f, compression: %.0f, n: %d", k, q, compression, n),
|
||||
k.k(q, compression, n),
|
||||
k.k(q, norm),
|
||||
1e-10
|
||||
);
|
||||
assertEquals(
|
||||
String.format(Locale.ROOT, "%s q: %.4f, compression: %.0f, n: %d", k, q, compression, n),
|
||||
k.k(1 - q, compression, n),
|
||||
k.k(1 - q, norm),
|
||||
1e-10
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
// now estimate the number of centroids
|
||||
double mink = Double.POSITIVE_INFINITY;
|
||||
double maxk = Double.NEGATIVE_INFINITY;
|
||||
double singletons = 0;
|
||||
while (singletons < n / 2.0) {
|
||||
// could we group more than one sample?
|
||||
double diff2 = k.k((singletons + 2.0) / n, norm) - k.k(singletons / n, norm);
|
||||
if (diff2 < 1) {
|
||||
// yes!
|
||||
double q = singletons / n;
|
||||
mink = Math.min(mink, k.k(q, norm));
|
||||
maxk = Math.max(maxk, k.k(1 - q, norm));
|
||||
break;
|
||||
}
|
||||
singletons++;
|
||||
}
|
||||
// did we consume all the data with singletons?
|
||||
if (Double.isInfinite(mink) || Double.isInfinite(maxk)) {
|
||||
// just make sure of this
|
||||
assertEquals(n, 2 * singletons, 0);
|
||||
mink = 0;
|
||||
maxk = 0;
|
||||
}
|
||||
// estimate number of clusters. The real number would be a bit more than this
|
||||
double diff = maxk - mink + 2 * singletons;
|
||||
|
||||
// mustn't have too many
|
||||
String label = String.format(
|
||||
Locale.ROOT,
|
||||
"max diff: %.3f, scale: %s, compression: %.0f, n: %d",
|
||||
diff,
|
||||
k,
|
||||
compression,
|
||||
n
|
||||
);
|
||||
assertTrue(label, diff <= Math.min(n, compression / 2 + 10));
|
||||
|
||||
// nor too few. This is where issue #151 shows up
|
||||
label = String.format(Locale.ROOT, "min diff: %.3f, scale: %s, compression: %.0f, n: %d", diff, k, compression, n);
|
||||
assertTrue(label, diff >= Math.min(n, compression / 4));
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
public void testNonDecreasing() {
|
||||
for (ScaleFunction scale : ScaleFunction.values()) {
|
||||
for (double compression : new double[] { 20, 50, 100, 200, 500, 1000 }) {
|
||||
for (int n : new int[] { 10, 100, 1000, 10000, 100_000, 1_000_000, 10_000_000 }) {
|
||||
double norm = scale.normalizer(compression, n);
|
||||
double last = Double.NEGATIVE_INFINITY;
|
||||
for (double q = -1; q < 2; q += 0.01) {
|
||||
double k1 = scale.k(q, norm);
|
||||
double k2 = scale.k(q, compression, n);
|
||||
String remark = String.format(
|
||||
Locale.ROOT,
|
||||
"Different ways to compute scale function %s should agree, " + "compression=%.0f, n=%d, q=%.2f",
|
||||
scale,
|
||||
compression,
|
||||
n,
|
||||
q
|
||||
);
|
||||
assertEquals(remark, k1, k2, 1e-10);
|
||||
assertTrue(String.format(Locale.ROOT, "Scale %s function should not decrease", scale), k1 >= last);
|
||||
last = k1;
|
||||
}
|
||||
last = Double.NEGATIVE_INFINITY;
|
||||
for (double k = scale.q(0, norm) - 2; k < scale.q(1, norm) + 2; k += 0.01) {
|
||||
double q1 = scale.q(k, norm);
|
||||
double q2 = scale.q(k, compression, n);
|
||||
String remark = String.format(
|
||||
Locale.ROOT,
|
||||
"Different ways to compute inverse scale function %s should agree, " + "compression=%.0f, n=%d, q=%.2f",
|
||||
scale,
|
||||
compression,
|
||||
n,
|
||||
k
|
||||
);
|
||||
assertEquals(remark, q1, q2, 1e-10);
|
||||
assertTrue(String.format(Locale.ROOT, "Inverse scale %s function should not decrease", scale), q1 >= last);
|
||||
last = q1;
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Validates the fast asin approximation
|
||||
*/
|
||||
public void testApproximation() {
|
||||
double worst = 0;
|
||||
double old = Double.NEGATIVE_INFINITY;
|
||||
for (double x = -1; x < 1; x += 0.00001) {
|
||||
double ex = Math.asin(x);
|
||||
double actual = ScaleFunction.fastAsin(x);
|
||||
double error = ex - actual;
|
||||
// System.out.printf("%.8f, %.8f, %.8f, %.12f\n", x, ex, actual, error * 1e6);
|
||||
assertEquals("Bad approximation", 0, error, 1e-6);
|
||||
assertTrue("Not monotonic", actual >= old);
|
||||
worst = Math.max(worst, Math.abs(error));
|
||||
old = actual;
|
||||
}
|
||||
assertEquals(Math.asin(1), ScaleFunction.fastAsin(1), 0);
|
||||
}
|
||||
|
||||
/**
|
||||
* Validates that the forward and reverse scale functions are as accurate as intended.
|
||||
*/
|
||||
public void testInverseScale() {
|
||||
for (ScaleFunction f : ScaleFunction.values()) {
|
||||
double tolerance = f.toString().contains("FAST") ? 2e-4 : 1e-10;
|
||||
|
||||
for (double n : new double[] { 1000, 3_000, 10_000 }) {
|
||||
double epsilon = 1.0 / n;
|
||||
for (double compression : new double[] { 20, 100, 1000 }) {
|
||||
double oldK = f.k(0, compression, n);
|
||||
for (int i = 1; i < n; i++) {
|
||||
double q = i / n;
|
||||
double k = f.k(q, compression, n);
|
||||
assertTrue(String.format(Locale.ROOT, "monoticity %s(%.0f, %.0f) @ %.5f", f, compression, n, q), k > oldK);
|
||||
oldK = k;
|
||||
|
||||
double qx = f.q(k, compression, n);
|
||||
double kx = f.k(qx, compression, n);
|
||||
assertEquals(String.format(Locale.ROOT, "Q: %s(%.0f, %.0f) @ %.5f", f, compression, n, q), q, qx, 1e-6);
|
||||
double absError = abs(k - kx);
|
||||
double relError = absError / max(0.01, max(abs(k), abs(kx)));
|
||||
String info = String.format(
|
||||
Locale.ROOT,
|
||||
"K: %s(%.0f, %.0f) @ %.5f [%.5g, %.5g]",
|
||||
f,
|
||||
compression,
|
||||
n,
|
||||
q,
|
||||
absError,
|
||||
relError
|
||||
);
|
||||
assertEquals(info, 0, absError, tolerance);
|
||||
assertEquals(info, 0, relError, tolerance);
|
||||
}
|
||||
assertTrue(f.k(0, compression, n) < f.k(epsilon, compression, n));
|
||||
assertTrue(f.k(1, compression, n) > f.k(1 - epsilon, compression, n));
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
|
@ -0,0 +1,421 @@
|
|||
/*
|
||||
* Licensed to Elasticsearch B.V. under one or more contributor
|
||||
* license agreements. See the NOTICE file distributed with
|
||||
* this work for additional information regarding copyright
|
||||
* ownership. Elasticsearch B.V. licenses this file to you under
|
||||
* the Apache License, Version 2.0 (the "License"); you may
|
||||
* not use this file except in compliance with the License.
|
||||
* You may obtain a copy of the License at
|
||||
*
|
||||
* http://www.apache.org/licenses/LICENSE-2.0
|
||||
*
|
||||
* Unless required by applicable law or agreed to in writing,
|
||||
* software distributed under the License is distributed on an
|
||||
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||
* KIND, either express or implied. See the License for the
|
||||
* specific language governing permissions and limitations
|
||||
* under the License.
|
||||
*
|
||||
* This project is based on a modification of https://github.com/tdunning/t-digest which is licensed under the Apache 2.0 License.
|
||||
*/
|
||||
|
||||
package org.elasticsearch.tdigest;
|
||||
|
||||
import org.elasticsearch.test.ESTestCase;
|
||||
|
||||
import java.util.HashMap;
|
||||
import java.util.Locale;
|
||||
import java.util.Map;
|
||||
import java.util.Random;
|
||||
|
||||
public class SortTests extends ESTestCase {
|
||||
|
||||
public void testReverse() {
|
||||
int[] x = new int[0];
|
||||
|
||||
// don't crash with no input
|
||||
Sort.reverse(x);
|
||||
|
||||
// reverse stuff!
|
||||
x = new int[] { 1, 2, 3, 4, 5 };
|
||||
Sort.reverse(x);
|
||||
for (int i = 0; i < 5; i++) {
|
||||
assertEquals(5 - i, x[i]);
|
||||
}
|
||||
|
||||
// reverse some stuff back
|
||||
Sort.reverse(x, 1, 3);
|
||||
assertEquals(5, x[0]);
|
||||
assertEquals(2, x[1]);
|
||||
assertEquals(3, x[2]);
|
||||
assertEquals(4, x[3]);
|
||||
assertEquals(1, x[4]);
|
||||
|
||||
// another no-op
|
||||
Sort.reverse(x, 3, 0);
|
||||
assertEquals(5, x[0]);
|
||||
assertEquals(2, x[1]);
|
||||
assertEquals(3, x[2]);
|
||||
assertEquals(4, x[3]);
|
||||
assertEquals(1, x[4]);
|
||||
|
||||
x = new int[] { 1, 2, 3, 4, 5, 6 };
|
||||
Sort.reverse(x);
|
||||
for (int i = 0; i < 6; i++) {
|
||||
assertEquals(6 - i, x[i]);
|
||||
}
|
||||
}
|
||||
|
||||
public void testEmpty() {
|
||||
Sort.sort(new int[] {}, new double[] {}, null, 0);
|
||||
}
|
||||
|
||||
public void testOne() {
|
||||
int[] order = new int[1];
|
||||
Sort.sort(order, new double[] { 1 }, new double[] { 1 }, 1);
|
||||
assertEquals(0, order[0]);
|
||||
}
|
||||
|
||||
public void testIdentical() {
|
||||
int[] order = new int[6];
|
||||
double[] values = new double[6];
|
||||
|
||||
Sort.sort(order, values, null, values.length);
|
||||
checkOrder(order, values);
|
||||
}
|
||||
|
||||
public void testRepeated() {
|
||||
int n = 50;
|
||||
int[] order = new int[n];
|
||||
double[] values = new double[n];
|
||||
for (int i = 0; i < n; i++) {
|
||||
values[i] = Math.rint(10 * ((double) i / n)) / 10.0;
|
||||
}
|
||||
|
||||
Sort.sort(order, values, null, values.length);
|
||||
checkOrder(order, values);
|
||||
}
|
||||
|
||||
public void testRepeatedSortByWeight() {
|
||||
// this needs to be long enough to force coverage of both quicksort and insertion sort
|
||||
// (i.e. >64)
|
||||
int n = 125;
|
||||
int[] order = new int[n];
|
||||
double[] values = new double[n];
|
||||
double[] weights = new double[n];
|
||||
double totalWeight = 0;
|
||||
|
||||
// generate evenly distributed values and weights
|
||||
for (int i = 0; i < n; i++) {
|
||||
int k = ((i + 5) * 37) % n;
|
||||
values[i] = Math.floor(k / 25.0);
|
||||
weights[i] = (k % 25) + 1;
|
||||
totalWeight += weights[i];
|
||||
}
|
||||
|
||||
// verify: test weights should be evenly distributed
|
||||
double[] tmp = new double[5];
|
||||
for (int i = 0; i < n; i++) {
|
||||
tmp[(int) values[i]] += weights[i];
|
||||
}
|
||||
for (double v : tmp) {
|
||||
assertEquals(totalWeight / tmp.length, v, 0);
|
||||
}
|
||||
|
||||
// now sort ...
|
||||
Sort.sort(order, values, weights, n);
|
||||
|
||||
// and verify our somewhat unusual ordering of the result
|
||||
// within the first two quintiles, value is constant, weights increase within each quintile
|
||||
int delta = order.length / 5;
|
||||
double sum = checkSubOrder(0.0, order, values, weights, 0, delta, 1);
|
||||
assertEquals(totalWeight * 0.2, sum, 0);
|
||||
sum = checkSubOrder(sum, order, values, weights, delta, 2 * delta, 1);
|
||||
assertEquals(totalWeight * 0.4, sum, 0);
|
||||
|
||||
// in the middle quintile, weights go up and then down after the median
|
||||
sum = checkMidOrder(totalWeight / 2, sum, order, values, weights, 2 * delta, 3 * delta);
|
||||
assertEquals(totalWeight * 0.6, sum, 0);
|
||||
|
||||
// in the last two quintiles, weights decrease
|
||||
sum = checkSubOrder(sum, order, values, weights, 3 * delta, 4 * delta, -1);
|
||||
assertEquals(totalWeight * 0.8, sum, 0);
|
||||
sum = checkSubOrder(sum, order, values, weights, 4 * delta, 5 * delta, -1);
|
||||
assertEquals(totalWeight, sum, 0);
|
||||
}
|
||||
|
||||
public void testStableSort() {
|
||||
// this needs to be long enough to force coverage of both quicksort and insertion sort
|
||||
// (i.e. >64)
|
||||
int n = 70;
|
||||
int z = 10;
|
||||
int[] order = new int[n];
|
||||
double[] values = new double[n];
|
||||
double[] weights = new double[n];
|
||||
double totalWeight = 0;
|
||||
|
||||
// generate evenly distributed values and weights
|
||||
for (int i = 0; i < n; i++) {
|
||||
int k = ((i + 5) * 37) % n;
|
||||
values[i] = Math.floor(k / (double) z);
|
||||
weights[i] = (k % z) + 1;
|
||||
totalWeight += weights[i];
|
||||
}
|
||||
|
||||
// verify: test weights should be evenly distributed
|
||||
double[] tmp = new double[n / z];
|
||||
for (int i = 0; i < n; i++) {
|
||||
tmp[(int) values[i]] += weights[i];
|
||||
}
|
||||
for (double v : tmp) {
|
||||
assertEquals(totalWeight / tmp.length, v, 0);
|
||||
}
|
||||
|
||||
// now sort ...
|
||||
Sort.stableSort(order, values, n);
|
||||
|
||||
// and verify stability of the ordering
|
||||
// values must be in order and they must appear in their original ordering
|
||||
double last = -1;
|
||||
for (int j : order) {
|
||||
double m = values[j] * n + j;
|
||||
assertTrue(m > last);
|
||||
last = m;
|
||||
}
|
||||
}
|
||||
|
||||
private double checkMidOrder(double medianWeight, double sofar, int[] order, double[] values, double[] weights, int start, int end) {
|
||||
double value = values[order[start]];
|
||||
double last = 0;
|
||||
assertTrue(sofar < medianWeight);
|
||||
for (int i = start; i < end; i++) {
|
||||
assertEquals(value, values[order[i]], 0);
|
||||
double w = weights[order[i]];
|
||||
assertTrue(w > 0);
|
||||
if (sofar > medianWeight) {
|
||||
w = 2 * medianWeight - w;
|
||||
}
|
||||
assertTrue(w >= last);
|
||||
sofar += weights[order[i]];
|
||||
}
|
||||
assertTrue(sofar > medianWeight);
|
||||
return sofar;
|
||||
}
|
||||
|
||||
private double checkSubOrder(double sofar, int[] order, double[] values, double[] weights, int start, int end, int ordering) {
|
||||
double lastWeight = weights[order[start]] * ordering;
|
||||
double value = values[order[start]];
|
||||
for (int i = start; i < end; i++) {
|
||||
assertEquals(value, values[order[i]], 0);
|
||||
double newOrderedWeight = weights[order[i]] * ordering;
|
||||
assertTrue(newOrderedWeight >= lastWeight);
|
||||
lastWeight = newOrderedWeight;
|
||||
sofar += weights[order[i]];
|
||||
}
|
||||
return sofar;
|
||||
}
|
||||
|
||||
public void testShort() {
|
||||
int[] order = new int[6];
|
||||
double[] values = new double[6];
|
||||
|
||||
// all duplicates
|
||||
for (int i = 0; i < 6; i++) {
|
||||
values[i] = 1;
|
||||
}
|
||||
|
||||
Sort.sort(order, values, null, values.length);
|
||||
checkOrder(order, values);
|
||||
|
||||
values[0] = 0.8;
|
||||
values[1] = 0.3;
|
||||
|
||||
Sort.sort(order, values, null, values.length);
|
||||
checkOrder(order, values);
|
||||
|
||||
values[5] = 1.5;
|
||||
values[4] = 1.2;
|
||||
|
||||
Sort.sort(order, values, null, values.length);
|
||||
checkOrder(order, values);
|
||||
}
|
||||
|
||||
public void testLonger() {
|
||||
int[] order = new int[20];
|
||||
double[] values = new double[20];
|
||||
for (int i = 0; i < 20; i++) {
|
||||
values[i] = (i * 13) % 20;
|
||||
}
|
||||
Sort.sort(order, values, null, values.length);
|
||||
checkOrder(order, values);
|
||||
}
|
||||
|
||||
public void testMultiPivots() {
|
||||
// more pivots than low split on first pass
|
||||
// multiple pivots, but more low data on second part of recursion
|
||||
int[] order = new int[30];
|
||||
double[] values = new double[30];
|
||||
for (int i = 0; i < 9; i++) {
|
||||
values[i] = i + 20 * (i % 2);
|
||||
}
|
||||
|
||||
for (int i = 9; i < 20; i++) {
|
||||
values[i] = 10;
|
||||
}
|
||||
|
||||
for (int i = 20; i < 30; i++) {
|
||||
values[i] = i - 20 * (i % 2);
|
||||
}
|
||||
values[29] = 29;
|
||||
values[24] = 25;
|
||||
values[26] = 25;
|
||||
|
||||
Sort.sort(order, values, null, values.length);
|
||||
checkOrder(order, values);
|
||||
}
|
||||
|
||||
public void testMultiPivotsInPlace() {
|
||||
// more pivots than low split on first pass
|
||||
// multiple pivots, but more low data on second part of recursion
|
||||
double[] keys = new double[30];
|
||||
for (int i = 0; i < 9; i++) {
|
||||
keys[i] = i + 20 * (i % 2);
|
||||
}
|
||||
|
||||
for (int i = 9; i < 20; i++) {
|
||||
keys[i] = 10;
|
||||
}
|
||||
|
||||
for (int i = 20; i < 30; i++) {
|
||||
keys[i] = i - 20 * (i % 2);
|
||||
}
|
||||
keys[29] = 29;
|
||||
keys[24] = 25;
|
||||
keys[26] = 25;
|
||||
|
||||
double[] v = valuesFromKeys(keys, 0);
|
||||
|
||||
Sort.sort(keys, v);
|
||||
checkOrder(keys, 0, keys.length, v);
|
||||
}
|
||||
|
||||
public void testRandomized() {
|
||||
Random rand = random();
|
||||
|
||||
for (int k = 0; k < 100; k++) {
|
||||
int[] order = new int[30];
|
||||
double[] values = new double[30];
|
||||
for (int i = 0; i < 30; i++) {
|
||||
values[i] = rand.nextDouble();
|
||||
}
|
||||
|
||||
Sort.sort(order, values, null, values.length);
|
||||
checkOrder(order, values);
|
||||
}
|
||||
}
|
||||
|
||||
public void testRandomizedShortSort() {
|
||||
Random rand = random();
|
||||
|
||||
for (int k = 0; k < 100; k++) {
|
||||
double[] keys = new double[30];
|
||||
for (int i = 0; i < 10; i++) {
|
||||
keys[i] = i;
|
||||
}
|
||||
for (int i = 10; i < 20; i++) {
|
||||
keys[i] = rand.nextDouble();
|
||||
}
|
||||
for (int i = 20; i < 30; i++) {
|
||||
keys[i] = i;
|
||||
}
|
||||
double[] v0 = valuesFromKeys(keys, 0);
|
||||
double[] v1 = valuesFromKeys(keys, 1);
|
||||
|
||||
Sort.sort(keys, 10, 10, v0, v1);
|
||||
checkOrder(keys, 10, 10, v0, v1);
|
||||
checkValues(keys, 0, keys.length, v0, v1);
|
||||
for (int i = 0; i < 10; i++) {
|
||||
assertEquals(i, keys[i], 0);
|
||||
}
|
||||
for (int i = 20; i < 30; i++) {
|
||||
assertEquals(i, keys[i], 0);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Generates a vector of values corresponding to a vector of keys.
|
||||
*
|
||||
* @param keys A vector of keys
|
||||
* @param k Which value vector to generate
|
||||
* @return The new vector containing frac(key_i * 3 * 5^k)
|
||||
*/
|
||||
private double[] valuesFromKeys(double[] keys, int k) {
|
||||
double[] r = new double[keys.length];
|
||||
double scale = 3;
|
||||
for (int i = 0; i < k; i++) {
|
||||
scale = scale * 5;
|
||||
}
|
||||
for (int i = 0; i < keys.length; i++) {
|
||||
r[i] = fractionalPart(keys[i] * scale);
|
||||
}
|
||||
return r;
|
||||
}
|
||||
|
||||
/**
|
||||
* Verifies that keys are in order and that each value corresponds to the keys
|
||||
*
|
||||
* @param key Array of keys
|
||||
* @param start The starting offset of keys and values to check
|
||||
* @param length The number of keys and values to check
|
||||
* @param values Arrays of associated values. Value_{ki} = frac(key_i * 3 * 5^k)
|
||||
*/
|
||||
private void checkOrder(double[] key, int start, int length, double[]... values) {
|
||||
assert start + length <= key.length;
|
||||
|
||||
for (int i = start; i < start + length - 1; i++) {
|
||||
assertTrue(String.format(Locale.ROOT, "bad ordering at %d, %f > %f", i, key[i], key[i + 1]), key[i] <= key[i + 1]);
|
||||
}
|
||||
|
||||
checkValues(key, start, length, values);
|
||||
}
|
||||
|
||||
private void checkValues(double[] key, int start, int length, double[]... values) {
|
||||
double scale = 3;
|
||||
for (int k = 0; k < values.length; k++) {
|
||||
double[] v = values[k];
|
||||
assertEquals(key.length, v.length);
|
||||
for (int i = start; i < length; i++) {
|
||||
assertEquals(
|
||||
String.format(Locale.ROOT, "value %d not correlated, key=%.5f, k=%d, v=%.5f", i, key[i], k, values[k][i]),
|
||||
fractionalPart(key[i] * scale),
|
||||
values[k][i],
|
||||
0
|
||||
);
|
||||
}
|
||||
scale = scale * 5;
|
||||
}
|
||||
}
|
||||
|
||||
private double fractionalPart(double v) {
|
||||
return v - Math.floor(v);
|
||||
}
|
||||
|
||||
private void checkOrder(int[] order, double[] values) {
|
||||
double previous = -Double.MAX_VALUE;
|
||||
Map<Integer, Integer> counts = new HashMap<Integer, Integer>();
|
||||
for (int i = 0; i < values.length; i++) {
|
||||
counts.put(i, counts.getOrDefault(i, 0) + 1);
|
||||
double v = values[order[i]];
|
||||
if (v < previous) {
|
||||
throw new IllegalArgumentException("Values out of order");
|
||||
}
|
||||
previous = v;
|
||||
}
|
||||
|
||||
assertEquals(order.length, counts.size());
|
||||
for (var entry : counts.entrySet()) {
|
||||
assertEquals(1, entry.getValue().intValue());
|
||||
}
|
||||
}
|
||||
}
|
|
@ -0,0 +1,546 @@
|
|||
/*
|
||||
* Licensed to Elasticsearch B.V. under one or more contributor
|
||||
* license agreements. See the NOTICE file distributed with
|
||||
* this work for additional information regarding copyright
|
||||
* ownership. Elasticsearch B.V. licenses this file to you under
|
||||
* the Apache License, Version 2.0 (the "License"); you may
|
||||
* not use this file except in compliance with the License.
|
||||
* You may obtain a copy of the License at
|
||||
*
|
||||
* http://www.apache.org/licenses/LICENSE-2.0
|
||||
*
|
||||
* Unless required by applicable law or agreed to in writing,
|
||||
* software distributed under the License is distributed on an
|
||||
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||
* KIND, either express or implied. See the License for the
|
||||
* specific language governing permissions and limitations
|
||||
* under the License.
|
||||
*
|
||||
* This project is based on a modification of https://github.com/tdunning/t-digest which is licensed under the Apache 2.0 License.
|
||||
*/
|
||||
|
||||
package org.elasticsearch.tdigest;
|
||||
|
||||
import org.elasticsearch.test.ESTestCase;
|
||||
|
||||
import java.util.ArrayList;
|
||||
import java.util.Collections;
|
||||
import java.util.List;
|
||||
import java.util.Locale;
|
||||
import java.util.Random;
|
||||
|
||||
/**
|
||||
* Base test case for TDigests, just extend this class and implement the abstract methods.
|
||||
*/
|
||||
public abstract class TDigestTests extends ESTestCase {
|
||||
|
||||
public interface DigestFactory {
|
||||
TDigest create();
|
||||
}
|
||||
|
||||
protected abstract DigestFactory factory(double compression);
|
||||
|
||||
private DigestFactory factory() {
|
||||
return factory(100);
|
||||
}
|
||||
|
||||
public void testBigJump() {
|
||||
TDigest digest = factory().create();
|
||||
for (int i = 1; i < 20; i++) {
|
||||
digest.add(i);
|
||||
}
|
||||
digest.add(1_000_000);
|
||||
|
||||
assertEquals(10.5, digest.quantile(0.50), 1e-5);
|
||||
assertEquals(16.5, digest.quantile(0.80), 1e-5);
|
||||
assertEquals(18.5, digest.quantile(0.90), 1e-5);
|
||||
assertEquals(500_000, digest.quantile(0.95), 10);
|
||||
assertEquals(1_000_000, digest.quantile(0.98), 100);
|
||||
assertEquals(1_000_000, digest.quantile(1.00), 0);
|
||||
|
||||
assertEquals(0.9, digest.cdf(19), 0.05);
|
||||
assertEquals(0.95, digest.cdf(500_000), 1e-5);
|
||||
assertEquals(0.975, digest.cdf(1_000_000), 1e-5);
|
||||
|
||||
digest = factory(80).create();
|
||||
digest.setScaleFunction(ScaleFunction.K_0);
|
||||
|
||||
for (int j = 0; j < 100; j++) {
|
||||
for (int i = 1; i < 20; i++) {
|
||||
digest.add(i);
|
||||
}
|
||||
digest.add(1_000_000);
|
||||
}
|
||||
assertEquals(18.0, digest.quantile(0.885), 0.15);
|
||||
assertEquals(19.0, digest.quantile(0.915), 0.1);
|
||||
assertEquals(19.0, digest.quantile(0.935), 0.1);
|
||||
assertEquals(1_000_000.0, digest.quantile(0.965), 0.1);
|
||||
}
|
||||
|
||||
public void testSmallCountQuantile() {
|
||||
List<Double> data = List.of(15.0, 20.0, 32.0, 60.0);
|
||||
TDigest td = factory(200).create();
|
||||
for (Double datum : data) {
|
||||
td.add(datum);
|
||||
}
|
||||
assertEquals(15.0, td.quantile(0.00), 1e-5);
|
||||
assertEquals(15.0, td.quantile(0.10), 1e-5);
|
||||
assertEquals(17.5, td.quantile(0.25), 1e-5);
|
||||
assertEquals(26.0, td.quantile(0.50), 1e-5);
|
||||
assertEquals(46.0, td.quantile(0.75), 1e-5);
|
||||
assertEquals(60.0, td.quantile(0.90), 1e-5);
|
||||
assertEquals(60.0, td.quantile(1.00), 1e-5);
|
||||
}
|
||||
|
||||
public void testExplicitSkewedData() {
|
||||
double[] data = new double[] {
|
||||
245,
|
||||
246,
|
||||
247.249,
|
||||
240,
|
||||
243,
|
||||
248,
|
||||
250,
|
||||
241,
|
||||
244,
|
||||
245,
|
||||
245,
|
||||
247,
|
||||
243,
|
||||
242,
|
||||
241,
|
||||
50100,
|
||||
51246,
|
||||
52247,
|
||||
52249,
|
||||
51240,
|
||||
53243,
|
||||
59248,
|
||||
59250,
|
||||
57241,
|
||||
56244,
|
||||
55245,
|
||||
56245,
|
||||
575247,
|
||||
58243,
|
||||
51242,
|
||||
54241 };
|
||||
|
||||
TDigest digest = factory().create();
|
||||
for (double x : data) {
|
||||
digest.add(x);
|
||||
}
|
||||
|
||||
assertEquals(Dist.quantile(0.5, data), digest.quantile(0.5), 0);
|
||||
}
|
||||
|
||||
public void testQuantile() {
|
||||
double[] samples = new double[] { 1.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 4.0, 5.0, 6.0, 7.0 };
|
||||
|
||||
TDigest hist1 = factory().create();
|
||||
List<Double> data = new ArrayList<>();
|
||||
|
||||
for (int j = 0; j < 100; j++) {
|
||||
for (double x : samples) {
|
||||
data.add(x);
|
||||
hist1.add(x);
|
||||
}
|
||||
}
|
||||
TDigest hist2 = factory().create();
|
||||
hist1.compress();
|
||||
hist2.add(hist1);
|
||||
Collections.sort(data);
|
||||
hist2.compress();
|
||||
double x1 = hist1.quantile(0.5);
|
||||
double x2 = hist2.quantile(0.5);
|
||||
assertEquals(Dist.quantile(0.5, data), x1, 0.2);
|
||||
assertEquals(x1, x2, 0.01);
|
||||
}
|
||||
|
||||
/**
|
||||
* Brute force test that cdf and quantile give reference behavior in digest made up of all singletons.
|
||||
*/
|
||||
public void testSingletonQuantiles() {
|
||||
double[] data = new double[11];
|
||||
TDigest digest = factory().create();
|
||||
for (int i = 0; i < data.length; i++) {
|
||||
digest.add(i);
|
||||
data[i] = i;
|
||||
}
|
||||
|
||||
for (double x = digest.getMin() - 0.1; x <= digest.getMax() + 0.1; x += 1e-3) {
|
||||
assertEquals(String.valueOf(x), Dist.cdf(x, data), digest.cdf(x), 0.1);
|
||||
}
|
||||
|
||||
for (int i = 0; i <= 1000; i++) {
|
||||
double q = 0.001 * i;
|
||||
double dist = Dist.quantile(q, data);
|
||||
double td = digest.quantile(q);
|
||||
assertEquals(String.valueOf(q), dist, td, 0.5);
|
||||
}
|
||||
}
|
||||
|
||||
public void testCentroidsWithIncreasingWeights() {
|
||||
ArrayList<Double> data = new ArrayList<>();
|
||||
TDigest digest = factory().create();
|
||||
for (int i = 1; i <= 10; i++) {
|
||||
digest.add(i, i);
|
||||
for (int j = 0; j < i; j++) {
|
||||
data.add((double) i);
|
||||
}
|
||||
}
|
||||
|
||||
for (double x = digest.getMin() - 0.1; x <= digest.getMax() + 0.1; x += 1e-3) {
|
||||
assertEquals(String.valueOf(x), Dist.cdf(x, data), digest.cdf(x), 0.5);
|
||||
}
|
||||
|
||||
for (int i = 0; i <= 1000; i++) {
|
||||
double q = 0.001 * i;
|
||||
double dist = Dist.quantile(q, data);
|
||||
double td = digest.quantile(q);
|
||||
assertEquals(String.valueOf(q), dist, td, 0.75);
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Verifies behavior involving interpolation between singleton centroids.
|
||||
*/
|
||||
public void testSingleSingleRange() {
|
||||
TDigest digest = factory().create();
|
||||
digest.add(1);
|
||||
digest.add(2);
|
||||
digest.add(3);
|
||||
|
||||
// verify the cdf is a step between singletons
|
||||
assertEquals(0.5 / 3.0, digest.cdf(1), 0);
|
||||
assertEquals(1.5 / 3.0, digest.cdf(2), 0);
|
||||
assertEquals(2.5 / 3.0, digest.cdf(3), 0);
|
||||
}
|
||||
|
||||
/**
|
||||
* Tests cases where min or max is not the same as the extreme centroid which has weight>1. In these cases min and
|
||||
* max give us a little information we wouldn't otherwise have.
|
||||
*/
|
||||
public void testSingletonAtEnd() {
|
||||
TDigest digest = factory().create();
|
||||
digest.add(1);
|
||||
digest.add(2);
|
||||
digest.add(3);
|
||||
|
||||
assertEquals(1, digest.getMin(), 0);
|
||||
assertEquals(3, digest.getMax(), 0);
|
||||
assertEquals(3, digest.centroidCount());
|
||||
assertEquals(0, digest.cdf(0), 0);
|
||||
assertEquals(0, digest.cdf(1 - 1e-9), 0);
|
||||
assertEquals(0.5 / 3, digest.cdf(1), 1e-10);
|
||||
assertEquals(1.0 / 6, digest.cdf(1 + 1e-10), 1e-10);
|
||||
assertEquals(0.9, digest.cdf(3 - 1e-9), 0.1);
|
||||
assertEquals(2.5 / 3, digest.cdf(3), 0);
|
||||
assertEquals(1.0, digest.cdf(3 + 1e-9), 0);
|
||||
|
||||
digest.add(1);
|
||||
assertEquals(1.0 / 4, digest.cdf(1), 0);
|
||||
|
||||
// normally min == mean[0] because weight[0] == 1
|
||||
// we can force this not to be true for testing
|
||||
digest = factory().create();
|
||||
digest.setScaleFunction(ScaleFunction.K_0);
|
||||
for (int i = 0; i < 100; i++) {
|
||||
digest.add(1);
|
||||
digest.add(2);
|
||||
digest.add(3);
|
||||
}
|
||||
// This sample would normally be added to the first cluster that already exists
|
||||
// but there is special code in place to prevent that centroid from ever
|
||||
// having weight of more than one
|
||||
// As such, near q=0, cdf and quantiles
|
||||
// should reflect this single sample as a singleton
|
||||
digest.add(0);
|
||||
assertTrue(digest.centroidCount() > 0);
|
||||
Centroid first = digest.centroids().iterator().next();
|
||||
assertEquals(1, first.count());
|
||||
assertEquals(first.mean(), digest.getMin(), 0.0);
|
||||
assertEquals(0.0, digest.getMin(), 0);
|
||||
assertEquals(0, digest.cdf(0 - 1e-9), 0);
|
||||
assertEquals(0.5 / digest.size(), digest.cdf(0), 1e-10);
|
||||
assertEquals(0.5 / digest.size(), digest.cdf(1e-9), 1e-10);
|
||||
|
||||
assertEquals(0, digest.quantile(0), 0);
|
||||
assertEquals(0.0, digest.quantile(0.5 / digest.size()), 0.1);
|
||||
assertEquals(0.4, digest.quantile(1.0 / digest.size()), 0.2);
|
||||
assertEquals(first.mean(), 0.0, 1e-5);
|
||||
|
||||
digest.add(4);
|
||||
Centroid last = digest.centroids().stream().reduce((prev, next) -> next).orElse(null);
|
||||
assertNotNull(last);
|
||||
assertEquals(1.0, last.count(), 0.0);
|
||||
assertEquals(4, last.mean(), 0);
|
||||
assertEquals(1.0, digest.cdf(digest.getMax() + 1e-9), 0);
|
||||
assertEquals(1 - 0.5 / digest.size(), digest.cdf(digest.getMax()), 0);
|
||||
assertEquals(1.0, digest.cdf((digest.getMax() - 1e-9)), 0.01);
|
||||
|
||||
assertEquals(4, digest.quantile(1), 0);
|
||||
assertEquals(last.mean(), 4, 0);
|
||||
}
|
||||
|
||||
public void testFewRepeatedValues() {
|
||||
TDigest d = factory().create();
|
||||
for (int i = 0; i < 2; ++i) {
|
||||
d.add(9000);
|
||||
}
|
||||
for (int i = 0; i < 11; ++i) {
|
||||
d.add(3000);
|
||||
}
|
||||
for (int i = 0; i < 26; ++i) {
|
||||
d.add(1000);
|
||||
}
|
||||
|
||||
assertEquals(3000.0, d.quantile(0.90), 1e-5);
|
||||
assertEquals(6300.0, d.quantile(0.95), 1e-5);
|
||||
assertEquals(8640.0, d.quantile(0.96), 1e-5);
|
||||
assertEquals(9000.0, d.quantile(0.97), 1e-5);
|
||||
assertEquals(9000.0, d.quantile(1.00), 1e-5);
|
||||
}
|
||||
|
||||
public void testSingleValue() {
|
||||
Random rand = random();
|
||||
final TDigest digest = factory().create();
|
||||
final double value = rand.nextDouble() * 1000;
|
||||
digest.add(value);
|
||||
final double q = rand.nextDouble();
|
||||
for (double qValue : new double[] { 0, q, 1 }) {
|
||||
assertEquals(value, digest.quantile(qValue), 0.001f);
|
||||
}
|
||||
}
|
||||
|
||||
public void testFewValues() {
|
||||
// When there are few values in the tree, quantiles should be exact
|
||||
final TDigest digest = factory().create();
|
||||
final Random r = random();
|
||||
final int length = r.nextInt(10);
|
||||
final List<Double> values = new ArrayList<>();
|
||||
for (int i = 0; i < length; ++i) {
|
||||
final double value;
|
||||
if (i == 0 || r.nextBoolean()) {
|
||||
value = r.nextDouble() * 100;
|
||||
} else {
|
||||
// introduce duplicates
|
||||
value = values.get(i - 1);
|
||||
}
|
||||
digest.add(value);
|
||||
values.add(value);
|
||||
}
|
||||
Collections.sort(values);
|
||||
|
||||
// for this value of the compression, the tree shouldn't have merged any node
|
||||
assertEquals(digest.centroids().size(), values.size());
|
||||
for (double q : new double[] { 0, 1e-10, r.nextDouble(), 0.5, 1 - 1e-10, 1 }) {
|
||||
double q1 = Dist.quantile(q, values);
|
||||
double q2 = digest.quantile(q);
|
||||
assertEquals(String.valueOf(q), q1, q2, q1);
|
||||
}
|
||||
}
|
||||
|
||||
public void testEmptyDigest() {
|
||||
TDigest digest = factory().create();
|
||||
assertEquals(0, digest.centroids().size());
|
||||
assertEquals(0, digest.centroids().size());
|
||||
}
|
||||
|
||||
public void testEmpty() {
|
||||
final TDigest digest = factory().create();
|
||||
final double q = random().nextDouble();
|
||||
assertTrue(Double.isNaN(digest.quantile(q)));
|
||||
}
|
||||
|
||||
public void testMoreThan2BValues() {
|
||||
final TDigest digest = factory().create();
|
||||
// carefully build a t-digest that is as if we added 3 uniform values from [0,1]
|
||||
double n = 3e9;
|
||||
double q0 = 0;
|
||||
for (int i = 0; i < 200 && q0 < 1 - 1e-10; ++i) {
|
||||
double k0 = digest.scale.k(q0, digest.compression(), n);
|
||||
double q = digest.scale.q(k0 + 1, digest.compression(), n);
|
||||
int m = (int) Math.max(1, n * (q - q0));
|
||||
digest.add((q + q0) / 2, m);
|
||||
q0 = q0 + m / n;
|
||||
}
|
||||
digest.compress();
|
||||
assertEquals(3_000_000_000L, digest.size());
|
||||
assertTrue(digest.size() > Integer.MAX_VALUE);
|
||||
final double[] quantiles = new double[] { 0, 0.1, 0.5, 0.9, 1 };
|
||||
double prev = Double.NEGATIVE_INFINITY;
|
||||
for (double q : quantiles) {
|
||||
final double v = digest.quantile(q);
|
||||
assertTrue(String.format(Locale.ROOT, "q=%.1f, v=%.4f, pref=%.4f", q, v, prev), v >= prev);
|
||||
prev = v;
|
||||
}
|
||||
}
|
||||
|
||||
public void testSorted() {
|
||||
final TDigest digest = factory().create();
|
||||
Random gen = random();
|
||||
for (int i = 0; i < 10000; ++i) {
|
||||
int w = 1 + gen.nextInt(10);
|
||||
double x = gen.nextDouble();
|
||||
for (int j = 0; j < w; j++) {
|
||||
digest.add(x);
|
||||
}
|
||||
}
|
||||
Centroid previous = null;
|
||||
for (Centroid centroid : digest.centroids()) {
|
||||
if (previous != null) {
|
||||
if (previous.mean() <= centroid.mean()) {
|
||||
assertTrue(Double.compare(previous.mean(), centroid.mean()) <= 0);
|
||||
}
|
||||
}
|
||||
previous = centroid;
|
||||
}
|
||||
}
|
||||
|
||||
public void testNaN() {
|
||||
final TDigest digest = factory().create();
|
||||
Random gen = random();
|
||||
final int iters = gen.nextInt(100);
|
||||
for (int i = 0; i < iters; ++i) {
|
||||
digest.add(gen.nextDouble(), 1 + gen.nextInt(10));
|
||||
}
|
||||
try {
|
||||
// both versions should fail
|
||||
if (gen.nextBoolean()) {
|
||||
digest.add(Double.NaN);
|
||||
} else {
|
||||
digest.add(Double.NaN, 1);
|
||||
}
|
||||
fail("NaN should be an illegal argument");
|
||||
} catch (IllegalArgumentException e) {
|
||||
// expected
|
||||
}
|
||||
}
|
||||
|
||||
public void testMidPointRule() {
|
||||
TDigest dist = factory(200).create();
|
||||
dist.add(1);
|
||||
dist.add(2);
|
||||
|
||||
for (int i = 0; i < 1000; i++) {
|
||||
dist.add(1);
|
||||
dist.add(2);
|
||||
if (i % 8 == 0) {
|
||||
String message = String.format(Locale.ROOT, "i = %d", i);
|
||||
assertEquals(message, 0, dist.cdf(1 - 1e-9), 0);
|
||||
assertEquals(message, 0.25, dist.cdf(1), 0.1);
|
||||
assertEquals(message, 0.75, dist.cdf(2), 0.1);
|
||||
assertEquals(message, 1, dist.cdf(2 + 1e-9), 0);
|
||||
|
||||
assertEquals(1.0, dist.quantile(0.0), 1e-5);
|
||||
assertEquals(1.0, dist.quantile(0.1), 1e-5);
|
||||
assertEquals(1.0, dist.quantile(0.2), 1e-5);
|
||||
|
||||
assertTrue(dist.quantile(0.5) > 1.0);
|
||||
assertTrue(dist.quantile(0.5) < 2.0);
|
||||
|
||||
assertEquals(2.0, dist.quantile(0.7), 1e-5);
|
||||
assertEquals(2.0, dist.quantile(0.8), 1e-5);
|
||||
assertEquals(2.0, dist.quantile(0.9), 1e-5);
|
||||
assertEquals(2.0, dist.quantile(1.0), 1e-5);
|
||||
}
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
public void testThreePointExample() {
|
||||
TDigest tdigest = factory().create();
|
||||
double x0 = 0.18615591526031494;
|
||||
double x1 = 0.4241943657398224;
|
||||
double x2 = 0.8813006281852722;
|
||||
|
||||
tdigest.add(x0);
|
||||
tdigest.add(x1);
|
||||
tdigest.add(x2);
|
||||
|
||||
double p10 = tdigest.quantile(0.1);
|
||||
double p50 = tdigest.quantile(0.5);
|
||||
double p90 = tdigest.quantile(0.9);
|
||||
double p95 = tdigest.quantile(0.95);
|
||||
double p99 = tdigest.quantile(0.99);
|
||||
|
||||
assertTrue(Double.compare(p10, p50) <= 0);
|
||||
assertTrue(Double.compare(p50, p90) <= 0);
|
||||
assertTrue(Double.compare(p90, p95) <= 0);
|
||||
assertTrue(Double.compare(p95, p99) <= 0);
|
||||
|
||||
assertEquals(x0, tdigest.quantile(0.0), 0);
|
||||
assertEquals(x2, tdigest.quantile(1.0), 0);
|
||||
|
||||
assertTrue(String.valueOf(p10), Double.compare(x0, p10) <= 0);
|
||||
assertTrue(String.valueOf(p10), Double.compare(x1, p10) >= 0);
|
||||
assertTrue(String.valueOf(p99), Double.compare(x1, p99) <= 0);
|
||||
assertTrue(String.valueOf(p99), Double.compare(x2, p99) >= 0);
|
||||
}
|
||||
|
||||
public void testSingletonInACrowd() {
|
||||
TDigest dist = factory().create();
|
||||
for (int i = 0; i < 10000; i++) {
|
||||
dist.add(10);
|
||||
}
|
||||
dist.add(20);
|
||||
dist.compress();
|
||||
|
||||
// The actual numbers depend on how the digest get constructed.
|
||||
// A singleton on the right boundary yields much better accuracy, e.g. q(0.9999) == 10.
|
||||
// Otherwise, quantiles above 0.9 use interpolation between 10 and 20, thus returning higher values.
|
||||
assertEquals(10.0, dist.quantile(0), 0);
|
||||
assertEquals(10.0, dist.quantile(0.9), 0);
|
||||
assertEquals(19.0, dist.quantile(0.99999), 1);
|
||||
assertEquals(20.0, dist.quantile(1), 0);
|
||||
}
|
||||
|
||||
public void testScaling() {
|
||||
final Random gen = random();
|
||||
|
||||
List<Double> data = new ArrayList<>();
|
||||
for (int i = 0; i < 100000; i++) {
|
||||
data.add(gen.nextDouble());
|
||||
}
|
||||
Collections.sort(data);
|
||||
|
||||
for (double compression : new double[] { 10, 20, 50, 100, 200, 500, 1000 }) {
|
||||
TDigest dist = factory(compression).create();
|
||||
for (Double x : data) {
|
||||
dist.add(x);
|
||||
}
|
||||
dist.compress();
|
||||
|
||||
for (double q : new double[] { 0.001, 0.01, 0.1, 0.5 }) {
|
||||
double estimate = dist.quantile(q);
|
||||
double actual = data.get((int) (q * data.size()));
|
||||
if (Double.compare(estimate, 0) != 0) {
|
||||
assertTrue(Double.compare(Math.abs(actual - estimate) / estimate, 1) < 0);
|
||||
} else {
|
||||
assertEquals(Double.compare(estimate, 0), 0);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
public void testMonotonicity() {
|
||||
TDigest digest = factory().create();
|
||||
final Random gen = random();
|
||||
for (int i = 0; i < 100000; i++) {
|
||||
digest.add(gen.nextDouble());
|
||||
}
|
||||
|
||||
double lastQuantile = -1;
|
||||
double lastX = -1;
|
||||
for (double z = 0; z <= 1; z += 1e-4) {
|
||||
double x = digest.quantile(z);
|
||||
assertTrue("q: " + z + " x: " + x + " last: " + lastX, Double.compare(x, lastX) >= 0);
|
||||
lastX = x;
|
||||
|
||||
double q = digest.cdf(z);
|
||||
assertTrue("Q: " + z, Double.compare(q, lastQuantile) >= 0);
|
||||
lastQuantile = q;
|
||||
}
|
||||
}
|
||||
}
|
|
@ -43,7 +43,7 @@ tasks.named("yamlRestTestV7CompatTransform").configure { task ->
|
|||
task.skipTest("search.aggregation/20_terms/numeric profiler", "The profiler results aren't backwards compatible.")
|
||||
task.skipTest("search.aggregation/210_top_hits_nested_metric/top_hits aggregation with sequence numbers", "#42809 the use nested path and filter sort throws an exception")
|
||||
task.skipTest("search.aggregation/370_doc_count_field/Test filters agg with doc_count", "Uses profiler for assertions which is not backwards compatible")
|
||||
|
||||
task.skipTest("search.aggregation/420_percentile_ranks_tdigest_metric/filtered", "Uses t-digest library which is not backwards compatible")
|
||||
task.addAllowedWarningRegex("\\[types removal\\].*")
|
||||
}
|
||||
|
||||
|
|
|
@ -70,15 +70,15 @@ filtered:
|
|||
percentile_ranks_int:
|
||||
percentile_ranks:
|
||||
field: int
|
||||
values: [50]
|
||||
values: [51]
|
||||
percentile_ranks_double:
|
||||
percentile_ranks:
|
||||
field: double
|
||||
values: [50]
|
||||
values: [51]
|
||||
|
||||
- match: { hits.total.value: 3 }
|
||||
- close_to: { aggregations.percentile_ranks_int.values.50\\.0: { value: 16.0, error: 1} }
|
||||
- close_to: { aggregations.percentile_ranks_double.values.50\\.0: { value: 16.0, error: 1} }
|
||||
- close_to: { aggregations.percentile_ranks_int.values.51\\.0: { value: 16.0, error: 1} }
|
||||
- close_to: { aggregations.percentile_ranks_double.values.51\\.0: { value: 16.0, error: 1} }
|
||||
|
||||
---
|
||||
missing field with missing param:
|
||||
|
@ -99,7 +99,7 @@ missing field with missing param:
|
|||
- match: { hits.total.value: 4 }
|
||||
- close_to: { aggregations.percentile_ranks_missing.values.50\\.0: { value: 100.0, error: 1} }
|
||||
- close_to: { aggregations.percentile_ranks_missing.values.99\\.0: { value: 100.0, error: 1} }
|
||||
|
||||
|
||||
---
|
||||
missing field without missing param:
|
||||
- do:
|
||||
|
@ -160,7 +160,6 @@ invalid params:
|
|||
non-keyed test:
|
||||
- skip:
|
||||
features: close_to
|
||||
|
||||
- do:
|
||||
search:
|
||||
body:
|
||||
|
@ -174,7 +173,7 @@ non-keyed test:
|
|||
|
||||
- match: { hits.total.value: 4 }
|
||||
- match: { aggregations.percentile_ranks_int.values.0.key: 50}
|
||||
- close_to: { aggregations.percentile_ranks_int.values.0.value: { value: 37.0, error: 1} }
|
||||
- close_to: { aggregations.percentile_ranks_int.values.0.value: { value: 30.0, error: 10.0 } }
|
||||
- match: { aggregations.percentile_ranks_int.values.1.key: 99}
|
||||
- close_to: { aggregations.percentile_ranks_int.values.1.value: { value: 61.5, error: 1} }
|
||||
- close_to: { aggregations.percentile_ranks_int.values.1.value: { value: 55.0, error: 10.0 } }
|
||||
|
||||
|
|
|
@ -43,9 +43,9 @@ setup:
|
|||
double_field: 151.0
|
||||
string_field: foo
|
||||
|
||||
|
||||
---
|
||||
"Basic test":
|
||||
|
||||
- do:
|
||||
search:
|
||||
rest_total_hits_as_int: true
|
||||
|
@ -115,8 +115,79 @@ setup:
|
|||
|
||||
|
||||
---
|
||||
"Only aggs test":
|
||||
"Basic test - approximate":
|
||||
- skip:
|
||||
features: close_to
|
||||
- do:
|
||||
search:
|
||||
rest_total_hits_as_int: true
|
||||
body:
|
||||
aggs:
|
||||
percentiles_int:
|
||||
percentiles:
|
||||
field: int_field
|
||||
percentiles_double:
|
||||
percentiles:
|
||||
field: double_field
|
||||
|
||||
- match: { hits.total: 4 }
|
||||
- length: { hits.hits: 4 }
|
||||
|
||||
- close_to: { aggregations.percentiles_int.values.1\.0: { value: 5.0, error: 5.0 } }
|
||||
- close_to: { aggregations.percentiles_int.values.5\.0: { value: 5.0, error: 5.0 } }
|
||||
- close_to: { aggregations.percentiles_int.values.25\.0: { value: 30.0, error: 10.0 } }
|
||||
- close_to: { aggregations.percentiles_int.values.50\.0: { value: 76.0, error: 1.0 } }
|
||||
- close_to: { aggregations.percentiles_int.values.75\.0: { value: 120.0, error: 10.0 } }
|
||||
- close_to: { aggregations.percentiles_int.values.95\.0: { value: 146.0, error: 5.0 } }
|
||||
- close_to: { aggregations.percentiles_int.values.99\.0: { value: 150.0, error: 1.0 } }
|
||||
|
||||
- close_to: { aggregations.percentiles_double.values.1\.0: { value: 5.0, error: 5.0 } }
|
||||
- close_to: { aggregations.percentiles_double.values.5\.0: { value: 5.0, error: 5.0 } }
|
||||
- close_to: { aggregations.percentiles_double.values.25\.0: { value: 30.0, error: 10.0 } }
|
||||
- close_to: { aggregations.percentiles_double.values.50\.0: { value: 76.0, error: 1.0 } }
|
||||
- close_to: { aggregations.percentiles_double.values.75\.0: { value: 120.0, error: 10.0 } }
|
||||
- close_to: { aggregations.percentiles_double.values.95\.0: { value: 146.0, error: 5.0 } }
|
||||
- close_to: { aggregations.percentiles_double.values.99\.0: { value: 150.0, error: 1.0 } }
|
||||
|
||||
- do:
|
||||
search:
|
||||
rest_total_hits_as_int: true
|
||||
body:
|
||||
aggs:
|
||||
percentiles_int:
|
||||
percentiles:
|
||||
field: int_field
|
||||
tdigest:
|
||||
compression: 200
|
||||
percentiles_double:
|
||||
percentiles:
|
||||
field: double_field
|
||||
tdigest:
|
||||
compression: 200
|
||||
|
||||
- match: { hits.total: 4 }
|
||||
- length: { hits.hits: 4 }
|
||||
|
||||
- close_to: { aggregations.percentiles_int.values.1\.0: { value: 5.0, error: 5.0 } }
|
||||
- close_to: { aggregations.percentiles_int.values.5\.0: { value: 5.0, error: 5.0 } }
|
||||
- close_to: { aggregations.percentiles_int.values.25\.0: { value: 30.0, error: 10.0 } }
|
||||
- close_to: { aggregations.percentiles_int.values.50\.0: { value: 76.0, error: 1.0 } }
|
||||
- close_to: { aggregations.percentiles_int.values.75\.0: { value: 120.0, error: 10.0 } }
|
||||
- close_to: { aggregations.percentiles_int.values.95\.0: { value: 146.0, error: 5.0 } }
|
||||
- close_to: { aggregations.percentiles_int.values.99\.0: { value: 150.0, error: 1.0 } }
|
||||
|
||||
- close_to: { aggregations.percentiles_double.values.1\.0: { value: 5.0, error: 5.0 } }
|
||||
- close_to: { aggregations.percentiles_double.values.5\.0: { value: 5.0, error: 5.0 } }
|
||||
- close_to: { aggregations.percentiles_double.values.25\.0: { value: 30.0, error: 10.0 } }
|
||||
- close_to: { aggregations.percentiles_double.values.50\.0: { value: 76.0, error: 1.0 } }
|
||||
- close_to: { aggregations.percentiles_double.values.75\.0: { value: 120.0, error: 10.0 } }
|
||||
- close_to: { aggregations.percentiles_double.values.95\.0: { value: 146.0, error: 5.0 } }
|
||||
- close_to: { aggregations.percentiles_double.values.99\.0: { value: 150.0, error: 1.0 } }
|
||||
|
||||
---
|
||||
"Only aggs test":
|
||||
- skip:
|
||||
features: close_to
|
||||
- do:
|
||||
search:
|
||||
rest_total_hits_as_int: true
|
||||
|
@ -133,27 +204,26 @@ setup:
|
|||
- match: { hits.total: 4 }
|
||||
- length: { hits.hits: 0 }
|
||||
|
||||
- match: { aggregations.percentiles_int.values.1\.0: 1.0 }
|
||||
- match: { aggregations.percentiles_int.values.5\.0: 1.0 }
|
||||
- match: { aggregations.percentiles_int.values.25\.0: 26.0 }
|
||||
- match: { aggregations.percentiles_int.values.50\.0: 76.0 }
|
||||
- match: { aggregations.percentiles_int.values.75\.0: 126.0 }
|
||||
- match: { aggregations.percentiles_int.values.95\.0: 151.0 }
|
||||
- match: { aggregations.percentiles_int.values.99\.0: 151.0 }
|
||||
|
||||
- match: { aggregations.percentiles_double.values.1\.0: 1.0 }
|
||||
- match: { aggregations.percentiles_double.values.5\.0: 1.0 }
|
||||
- match: { aggregations.percentiles_double.values.25\.0: 26.0 }
|
||||
- match: { aggregations.percentiles_double.values.50\.0: 76.0 }
|
||||
- match: { aggregations.percentiles_double.values.75\.0: 126.0 }
|
||||
- match: { aggregations.percentiles_double.values.95\.0: 151.0 }
|
||||
- match: { aggregations.percentiles_double.values.99\.0: 151.0 }
|
||||
|
||||
- close_to: { aggregations.percentiles_int.values.1\.0: { value: 5.0, error: 5.0 } }
|
||||
- close_to: { aggregations.percentiles_int.values.5\.0: { value: 5.0, error: 5.0 } }
|
||||
- close_to: { aggregations.percentiles_int.values.25\.0: { value: 30.0, error: 10.0 } }
|
||||
- close_to: { aggregations.percentiles_int.values.50\.0: { value: 76.0, error: 1.0 } }
|
||||
- close_to: { aggregations.percentiles_int.values.75\.0: { value: 120.0, error: 10.0 } }
|
||||
- close_to: { aggregations.percentiles_int.values.95\.0: { value: 146.0, error: 5.0 } }
|
||||
- close_to: { aggregations.percentiles_int.values.99\.0: { value: 150.0, error: 1.0 } }
|
||||
|
||||
- close_to: { aggregations.percentiles_double.values.1\.0: { value: 5.0, error: 5.0 } }
|
||||
- close_to: { aggregations.percentiles_double.values.5\.0: { value: 5.0, error: 5.0 } }
|
||||
- close_to: { aggregations.percentiles_double.values.25\.0: { value: 30.0, error: 10.0 } }
|
||||
- close_to: { aggregations.percentiles_double.values.50\.0: { value: 76.0, error: 1.0 } }
|
||||
- close_to: { aggregations.percentiles_double.values.75\.0: { value: 120.0, error: 10.0 } }
|
||||
- close_to: { aggregations.percentiles_double.values.95\.0: { value: 146.0, error: 5.0 } }
|
||||
- close_to: { aggregations.percentiles_double.values.99\.0: { value: 150.0, error: 1.0 } }
|
||||
|
||||
---
|
||||
"Filtered test":
|
||||
|
||||
- skip:
|
||||
features: close_to
|
||||
- do:
|
||||
search:
|
||||
rest_total_hits_as_int: true
|
||||
|
@ -175,25 +245,25 @@ setup:
|
|||
- match: { hits.total: 3 }
|
||||
- length: { hits.hits: 3 }
|
||||
|
||||
- match: { aggregations.percentiles_int.values.1\.0: 51.0 }
|
||||
- match: { aggregations.percentiles_int.values.5\.0: 51.0 }
|
||||
- match: { aggregations.percentiles_int.values.25\.0: 63.5 }
|
||||
- match: { aggregations.percentiles_int.values.50\.0: 101.0 }
|
||||
- match: { aggregations.percentiles_int.values.75\.0: 138.5 }
|
||||
- match: { aggregations.percentiles_int.values.95\.0: 151.0 }
|
||||
- match: { aggregations.percentiles_int.values.99\.0: 151.0 }
|
||||
- close_to: { aggregations.percentiles_int.values.1\.0: { value: 52.0, error: 1.0 } }
|
||||
- close_to: { aggregations.percentiles_int.values.5\.0: { value: 54.0, error: 3.0 } }
|
||||
- close_to: { aggregations.percentiles_int.values.25\.0: { value: 70.0, error: 10.0 } }
|
||||
- close_to: { aggregations.percentiles_int.values.50\.0: { value: 101.0, error: 1.0 } }
|
||||
- close_to: { aggregations.percentiles_int.values.75\.0: { value: 130.0, error: 10.0 } }
|
||||
- close_to: { aggregations.percentiles_int.values.95\.0: { value: 148.0, error: 3.0 } }
|
||||
- close_to: { aggregations.percentiles_int.values.99\.0: { value: 150.0, error: 1.0 } }
|
||||
|
||||
- close_to: { aggregations.percentiles_double.values.1\.0: { value: 52.0, error: 1.0 } }
|
||||
- close_to: { aggregations.percentiles_double.values.5\.0: { value: 54.0, error: 3.0 } }
|
||||
- close_to: { aggregations.percentiles_double.values.25\.0: { value: 70.0, error: 10.0 } }
|
||||
- close_to: { aggregations.percentiles_double.values.50\.0: { value: 101.0, error: 1.0 } }
|
||||
- close_to: { aggregations.percentiles_double.values.75\.0: { value: 130.0, error: 10.0 } }
|
||||
- close_to: { aggregations.percentiles_double.values.95\.0: { value: 148.0, error: 3.0 } }
|
||||
- close_to: { aggregations.percentiles_double.values.99\.0: { value: 150.0, error: 1.0 } }
|
||||
|
||||
- match: { aggregations.percentiles_double.values.1\.0: 51.0 }
|
||||
- match: { aggregations.percentiles_double.values.5\.0: 51.0 }
|
||||
- match: { aggregations.percentiles_double.values.25\.0: 63.5 }
|
||||
- match: { aggregations.percentiles_double.values.50\.0: 101.0 }
|
||||
- match: { aggregations.percentiles_double.values.75\.0: 138.5 }
|
||||
- match: { aggregations.percentiles_double.values.95\.0: 151.0 }
|
||||
- match: { aggregations.percentiles_double.values.99\.0: 151.0 }
|
||||
|
||||
---
|
||||
"Missing field with missing param":
|
||||
|
||||
- do:
|
||||
search:
|
||||
rest_total_hits_as_int: true
|
||||
|
@ -233,7 +303,8 @@ setup:
|
|||
|
||||
---
|
||||
"Metadata test":
|
||||
|
||||
- skip:
|
||||
features: close_to
|
||||
- do:
|
||||
search:
|
||||
rest_total_hits_as_int: true
|
||||
|
@ -250,17 +321,17 @@ setup:
|
|||
- match: { aggregations.percentiles_int.meta.foo: "bar" }
|
||||
|
||||
|
||||
- match: { aggregations.percentiles_int.values.1\.0: 1.0 }
|
||||
- match: { aggregations.percentiles_int.values.5\.0: 1.0 }
|
||||
- match: { aggregations.percentiles_int.values.25\.0: 26.0 }
|
||||
- match: { aggregations.percentiles_int.values.50\.0: 76.0 }
|
||||
- match: { aggregations.percentiles_int.values.75\.0: 126.0 }
|
||||
- match: { aggregations.percentiles_int.values.95\.0: 151.0 }
|
||||
- match: { aggregations.percentiles_int.values.99\.0: 151.0 }
|
||||
- close_to: { aggregations.percentiles_int.values.1\.0: { value: 5.0, error: 5.0 } }
|
||||
- close_to: { aggregations.percentiles_int.values.5\.0: { value: 5.0, error: 5.0 } }
|
||||
- close_to: { aggregations.percentiles_int.values.25\.0: { value: 30.0, error: 10.0 } }
|
||||
- close_to: { aggregations.percentiles_int.values.50\.0: { value: 76.0, error: 1.0 } }
|
||||
- close_to: { aggregations.percentiles_int.values.75\.0: { value: 120.0, error: 10.0 } }
|
||||
- close_to: { aggregations.percentiles_int.values.95\.0: { value: 146.0, error: 5.0 } }
|
||||
- close_to: { aggregations.percentiles_int.values.99\.0: { value: 150.0, error: 1.0 } }
|
||||
|
||||
|
||||
---
|
||||
"Invalid params test":
|
||||
|
||||
- do:
|
||||
catch: /\[compression\] must be greater than or equal to 0. Found \[-1.0\]/
|
||||
search:
|
||||
|
@ -316,9 +387,11 @@ setup:
|
|||
percentiles:
|
||||
field: string_field
|
||||
|
||||
|
||||
---
|
||||
"Explicit Percents test":
|
||||
|
||||
- skip:
|
||||
features: close_to
|
||||
- do:
|
||||
search:
|
||||
rest_total_hits_as_int: true
|
||||
|
@ -337,17 +410,19 @@ setup:
|
|||
- match: { hits.total: 4 }
|
||||
- length: { hits.hits: 4 }
|
||||
|
||||
- match: { aggregations.percentiles_int.values.5\.0: 1.0 }
|
||||
- match: { aggregations.percentiles_int.values.25\.0: 26.0 }
|
||||
- match: { aggregations.percentiles_int.values.50\.0: 76.0 }
|
||||
- close_to: { aggregations.percentiles_int.values.5\.0: { value: 5.0, error: 4.0 } }
|
||||
- close_to: { aggregations.percentiles_int.values.25\.0: { value: 30.0, error: 10.0 } }
|
||||
- close_to: { aggregations.percentiles_int.values.50\.0: { value: 76.0, error: 1.0 } }
|
||||
|
||||
- close_to: { aggregations.percentiles_double.values.5\.0: { value: 5.0, error: 4.0 } }
|
||||
- close_to: { aggregations.percentiles_double.values.25\.0: { value: 30.0, error: 10.0 } }
|
||||
- close_to: { aggregations.percentiles_double.values.50\.0: { value: 76.0, error: 1.0 } }
|
||||
|
||||
- match: { aggregations.percentiles_double.values.5\.0: 1.0 }
|
||||
- match: { aggregations.percentiles_double.values.25\.0: 26.0 }
|
||||
- match: { aggregations.percentiles_double.values.50\.0: 76.0 }
|
||||
|
||||
---
|
||||
"Non-keyed test":
|
||||
|
||||
- skip:
|
||||
features: close_to
|
||||
- do:
|
||||
search:
|
||||
rest_total_hits_as_int: true
|
||||
|
@ -364,8 +439,8 @@ setup:
|
|||
- length: { hits.hits: 4 }
|
||||
|
||||
- match: { aggregations.percentiles_int.values.0.key: 5.0 }
|
||||
- match: { aggregations.percentiles_int.values.0.value: 1.0 }
|
||||
- close_to: { aggregations.percentiles_int.values.0.value: { value: 5.0, error: 4.0 } }
|
||||
- match: { aggregations.percentiles_int.values.1.key: 25.0 }
|
||||
- match: { aggregations.percentiles_int.values.1.value: 26.0 }
|
||||
- close_to: { aggregations.percentiles_int.values.1.value: { value: 30.0, error: 10.0 } }
|
||||
- match: { aggregations.percentiles_int.values.2.key: 50.0 }
|
||||
- match: { aggregations.percentiles_int.values.2.value: 76.0 }
|
||||
- close_to: { aggregations.percentiles_int.values.2.value: { value: 76.0, error: 1.0 } }
|
||||
|
|
|
@ -34,6 +34,7 @@ dependencies {
|
|||
api project(":libs:elasticsearch-plugin-api")
|
||||
api project(":libs:elasticsearch-plugin-analysis-api")
|
||||
api project(':libs:elasticsearch-grok')
|
||||
api project(":libs:elasticsearch-tdigest")
|
||||
|
||||
implementation project(':libs:elasticsearch-plugin-classloader')
|
||||
// no compile dependency by server, but server defines security policy for this codebase so it i>
|
||||
|
@ -57,8 +58,6 @@ dependencies {
|
|||
api project(":libs:elasticsearch-cli")
|
||||
implementation 'com.carrotsearch:hppc:0.8.1'
|
||||
|
||||
// percentiles aggregation
|
||||
api 'com.tdunning:t-digest:3.2'
|
||||
// precentil ranks aggregation
|
||||
api 'org.hdrhistogram:HdrHistogram:2.1.9'
|
||||
|
||||
|
|
|
@ -1,4 +0,0 @@
|
|||
The code for the t-digest was originally authored by Ted Dunning
|
||||
|
||||
A number of small but very helpful changes have been contributed by Adrien Grand (https://github.com/jpountz)
|
||||
|
|
@ -28,6 +28,7 @@ module org.elasticsearch.server {
|
|||
requires org.elasticsearch.plugin;
|
||||
requires org.elasticsearch.plugin.analysis;
|
||||
requires org.elasticsearch.grok;
|
||||
requires org.elasticsearch.tdigest;
|
||||
|
||||
requires com.sun.jna;
|
||||
requires hppc;
|
||||
|
@ -35,7 +36,6 @@ module org.elasticsearch.server {
|
|||
requires jopt.simple;
|
||||
requires log4j2.ecs.layout;
|
||||
requires org.lz4.java;
|
||||
requires t.digest;
|
||||
|
||||
requires org.apache.logging.log4j;
|
||||
requires org.apache.logging.log4j.core;
|
||||
|
|
|
@ -49,6 +49,10 @@ abstract class AbstractInternalTDigestPercentiles extends InternalNumericMetrics
|
|||
this.keys = keys;
|
||||
this.state = state;
|
||||
this.keyed = keyed;
|
||||
|
||||
if (state != null) {
|
||||
state.compress();
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
|
@ -106,10 +110,6 @@ abstract class AbstractInternalTDigestPercentiles extends InternalNumericMetrics
|
|||
return format;
|
||||
}
|
||||
|
||||
public long getEstimatedMemoryFootprint() {
|
||||
return state.byteSize();
|
||||
}
|
||||
|
||||
/**
|
||||
* Return the internal {@link TDigestState} sketch for this metric.
|
||||
*/
|
||||
|
|
|
@ -8,8 +8,7 @@
|
|||
|
||||
package org.elasticsearch.search.aggregations.metrics;
|
||||
|
||||
import com.tdunning.math.stats.Centroid;
|
||||
import com.tdunning.math.stats.TDigest;
|
||||
import org.elasticsearch.tdigest.TDigest;
|
||||
|
||||
import java.util.List;
|
||||
|
||||
|
@ -18,11 +17,6 @@ public final class EmptyTDigestState extends TDigestState {
|
|||
super(1.0D);
|
||||
}
|
||||
|
||||
@Override
|
||||
public TDigest recordAllData() {
|
||||
throw new UnsupportedOperationException("Immutable Empty TDigest");
|
||||
}
|
||||
|
||||
@Override
|
||||
public void add(double x, int w) {
|
||||
throw new UnsupportedOperationException("Immutable Empty TDigest");
|
||||
|
@ -34,32 +28,10 @@ public final class EmptyTDigestState extends TDigestState {
|
|||
}
|
||||
|
||||
@Override
|
||||
public void add(double x, int w, List<Double> data) {
|
||||
throw new UnsupportedOperationException("Immutable Empty TDigest");
|
||||
}
|
||||
|
||||
@Override
|
||||
public void compress() {
|
||||
throw new UnsupportedOperationException("Immutable Empty TDigest");
|
||||
}
|
||||
|
||||
@Override
|
||||
public void add(double x) {
|
||||
throw new UnsupportedOperationException("Immutable Empty TDigest");
|
||||
}
|
||||
public void compress() {}
|
||||
|
||||
@Override
|
||||
public void add(TDigest other) {
|
||||
throw new UnsupportedOperationException("Immutable Empty TDigest");
|
||||
}
|
||||
|
||||
@Override
|
||||
protected Centroid createCentroid(double mean, int id) {
|
||||
throw new UnsupportedOperationException("Immutable Empty TDigest");
|
||||
}
|
||||
|
||||
@Override
|
||||
public boolean isRecording() {
|
||||
return false;
|
||||
}
|
||||
}
|
||||
|
|
|
@ -45,7 +45,6 @@ public class InternalMedianAbsoluteDeviation extends InternalNumericMetricsAggre
|
|||
InternalMedianAbsoluteDeviation(String name, Map<String, Object> metadata, DocValueFormat format, TDigestState valuesSketch) {
|
||||
super(name, Objects.requireNonNull(format), metadata);
|
||||
this.valuesSketch = Objects.requireNonNull(valuesSketch);
|
||||
|
||||
this.medianAbsoluteDeviation = computeMedianAbsoluteDeviation(this.valuesSketch);
|
||||
}
|
||||
|
||||
|
|
|
@ -7,17 +7,16 @@
|
|||
*/
|
||||
package org.elasticsearch.search.aggregations.metrics;
|
||||
|
||||
import com.tdunning.math.stats.AVLTreeDigest;
|
||||
import com.tdunning.math.stats.Centroid;
|
||||
|
||||
import org.elasticsearch.common.io.stream.StreamInput;
|
||||
import org.elasticsearch.common.io.stream.StreamOutput;
|
||||
import org.elasticsearch.tdigest.AVLTreeDigest;
|
||||
import org.elasticsearch.tdigest.Centroid;
|
||||
|
||||
import java.io.IOException;
|
||||
import java.util.Iterator;
|
||||
|
||||
/**
|
||||
* Extension of {@link com.tdunning.math.stats.TDigest} with custom serialization.
|
||||
* Extension of {@link org.elasticsearch.tdigest.TDigest} with custom serialization.
|
||||
*/
|
||||
public class TDigestState extends AVLTreeDigest {
|
||||
|
||||
|
@ -54,10 +53,13 @@ public class TDigestState extends AVLTreeDigest {
|
|||
|
||||
@Override
|
||||
public boolean equals(Object obj) {
|
||||
if (obj == null || obj instanceof TDigestState == false) {
|
||||
if (obj instanceof TDigestState == false) {
|
||||
return false;
|
||||
}
|
||||
TDigestState that = (TDigestState) obj;
|
||||
if (this == that) {
|
||||
return true;
|
||||
}
|
||||
if (compression != that.compression) {
|
||||
return false;
|
||||
}
|
||||
|
@ -67,9 +69,10 @@ public class TDigestState extends AVLTreeDigest {
|
|||
if (this.getMin() != that.getMin()) {
|
||||
return false;
|
||||
}
|
||||
if (this.isRecording() != that.isRecording()) {
|
||||
if (this.centroidCount() != that.centroidCount()) {
|
||||
return false;
|
||||
}
|
||||
|
||||
Iterator<? extends Centroid> thisCentroids = centroids().iterator();
|
||||
Iterator<? extends Centroid> thatCentroids = that.centroids().iterator();
|
||||
while (thisCentroids.hasNext()) {
|
||||
|
@ -87,14 +90,13 @@ public class TDigestState extends AVLTreeDigest {
|
|||
|
||||
@Override
|
||||
public int hashCode() {
|
||||
int h = 31 * Double.hashCode(compression);
|
||||
int h = 31 * Double.hashCode(compression) + Integer.hashCode(centroidCount());
|
||||
for (Centroid centroid : centroids()) {
|
||||
h = 31 * h + Double.hashCode(centroid.mean());
|
||||
h = 31 * h + centroid.count();
|
||||
}
|
||||
h = 31 * h + Double.hashCode(getMax());
|
||||
h = 31 * h + Double.hashCode(getMin());
|
||||
h = 31 * h + Boolean.hashCode(isRecording());
|
||||
return h;
|
||||
}
|
||||
}
|
||||
|
|
|
@ -16,10 +16,6 @@ public class EmptyTDigestStateTests extends ESTestCase {
|
|||
|
||||
private static final TDigestState singleton = new EmptyTDigestState();
|
||||
|
||||
public void testRecordAllData() {
|
||||
expectThrows(UnsupportedOperationException.class, singleton::recordAllData);
|
||||
}
|
||||
|
||||
public void testAddValue() {
|
||||
expectThrows(UnsupportedOperationException.class, () -> singleton.add(randomDouble()));
|
||||
}
|
||||
|
@ -32,22 +28,11 @@ public class EmptyTDigestStateTests extends ESTestCase {
|
|||
expectThrows(UnsupportedOperationException.class, () -> singleton.add(randomDouble(), randomInt(10)));
|
||||
}
|
||||
|
||||
public void testCompress() {
|
||||
expectThrows(UnsupportedOperationException.class, singleton::compress);
|
||||
}
|
||||
|
||||
public void testTestAddList() {
|
||||
expectThrows(
|
||||
UnsupportedOperationException.class,
|
||||
() -> singleton.add(randomDouble(), randomInt(10), List.of(randomDouble(), randomDouble()))
|
||||
);
|
||||
expectThrows(UnsupportedOperationException.class, () -> singleton.add(randomDouble(), randomInt(10)));
|
||||
}
|
||||
|
||||
public void testTestAddListTDigest() {
|
||||
expectThrows(UnsupportedOperationException.class, () -> singleton.add(List.of(new EmptyTDigestState(), new EmptyTDigestState())));
|
||||
}
|
||||
|
||||
public void testIsRecording() {
|
||||
assertFalse(singleton.isRecording());
|
||||
}
|
||||
}
|
||||
|
|
|
@ -38,7 +38,6 @@ public class InternalTDigestPercentilesRanksTests extends InternalPercentilesRan
|
|||
final TDigestState state = new TDigestState(100);
|
||||
Arrays.stream(values).forEach(state::add);
|
||||
|
||||
assertEquals(state.centroidCount(), values.length);
|
||||
return new InternalTDigestPercentileRanks(name, percents, state, keyed, format, metadata);
|
||||
}
|
||||
|
||||
|
|
|
@ -39,7 +39,6 @@ public class InternalTDigestPercentilesTests extends InternalPercentilesTestCase
|
|||
final TDigestState state = new TDigestState(100);
|
||||
Arrays.stream(values).forEach(state::add);
|
||||
|
||||
assertEquals(state.centroidCount(), values.length);
|
||||
return new InternalTDigestPercentiles(name, percents, state, keyed, format, metadata);
|
||||
}
|
||||
|
||||
|
|
|
@ -7,8 +7,6 @@
|
|||
|
||||
package org.elasticsearch.xpack.analytics.boxplot;
|
||||
|
||||
import com.tdunning.math.stats.Centroid;
|
||||
|
||||
import org.elasticsearch.common.io.stream.StreamInput;
|
||||
import org.elasticsearch.common.io.stream.StreamOutput;
|
||||
import org.elasticsearch.search.DocValueFormat;
|
||||
|
@ -16,6 +14,7 @@ import org.elasticsearch.search.aggregations.AggregationReduceContext;
|
|||
import org.elasticsearch.search.aggregations.InternalAggregation;
|
||||
import org.elasticsearch.search.aggregations.metrics.InternalNumericMetricsAggregation;
|
||||
import org.elasticsearch.search.aggregations.metrics.TDigestState;
|
||||
import org.elasticsearch.tdigest.Centroid;
|
||||
import org.elasticsearch.xcontent.XContentBuilder;
|
||||
|
||||
import java.io.IOException;
|
||||
|
@ -190,6 +189,7 @@ public class InternalBoxplot extends InternalNumericMetricsAggregation.MultiValu
|
|||
InternalBoxplot(String name, TDigestState state, DocValueFormat formatter, Map<String, Object> metadata) {
|
||||
super(name, formatter, metadata);
|
||||
this.state = state;
|
||||
this.state.compress();
|
||||
}
|
||||
|
||||
/**
|
||||
|
@ -198,6 +198,7 @@ public class InternalBoxplot extends InternalNumericMetricsAggregation.MultiValu
|
|||
public InternalBoxplot(StreamInput in) throws IOException {
|
||||
super(in);
|
||||
state = TDigestState.read(in);
|
||||
state.compress();
|
||||
}
|
||||
|
||||
@Override
|
||||
|
|
|
@ -7,14 +7,13 @@
|
|||
|
||||
package org.elasticsearch.xpack.analytics;
|
||||
|
||||
import com.tdunning.math.stats.Centroid;
|
||||
import com.tdunning.math.stats.TDigest;
|
||||
|
||||
import org.HdrHistogram.DoubleHistogram;
|
||||
import org.HdrHistogram.DoubleHistogramIterationValue;
|
||||
import org.apache.lucene.document.BinaryDocValuesField;
|
||||
import org.elasticsearch.common.io.stream.BytesStreamOutput;
|
||||
import org.elasticsearch.search.aggregations.metrics.TDigestState;
|
||||
import org.elasticsearch.tdigest.Centroid;
|
||||
import org.elasticsearch.tdigest.TDigest;
|
||||
|
||||
import java.io.IOException;
|
||||
|
||||
|
|
|
@ -6,8 +6,6 @@
|
|||
*/
|
||||
package org.elasticsearch.xpack.analytics.aggregations.metrics;
|
||||
|
||||
import com.tdunning.math.stats.Centroid;
|
||||
|
||||
import org.HdrHistogram.DoubleHistogram;
|
||||
import org.HdrHistogram.DoubleHistogramIterationValue;
|
||||
import org.apache.lucene.tests.util.TestUtil;
|
||||
|
@ -23,6 +21,7 @@ import org.elasticsearch.search.aggregations.metrics.InternalTDigestPercentiles;
|
|||
import org.elasticsearch.search.aggregations.metrics.PercentilesAggregationBuilder;
|
||||
import org.elasticsearch.search.aggregations.metrics.PercentilesMethod;
|
||||
import org.elasticsearch.search.aggregations.metrics.TDigestState;
|
||||
import org.elasticsearch.tdigest.Centroid;
|
||||
import org.elasticsearch.test.ESSingleNodeTestCase;
|
||||
import org.elasticsearch.xcontent.XContentBuilder;
|
||||
import org.elasticsearch.xcontent.XContentFactory;
|
||||
|
|
|
@ -89,6 +89,8 @@ tasks.named("yamlRestTestV7CompatTest").configure {
|
|||
'aggregate-metrics/90_tsdb_mappings/aggregate_double_metric with wrong time series mappings',
|
||||
'analytics/histogram/histogram with wrong time series mappings',
|
||||
'analytics/histogram/histogram with time series mappings',
|
||||
'ml/evaluate_data_frame/Test classification auc_roc',
|
||||
'ml/evaluate_data_frame/Test classification auc_roc with default top_classes_field',
|
||||
].join(',')
|
||||
}
|
||||
|
||||
|
|
|
@ -20,6 +20,7 @@ module org.elasticsearch.xcore {
|
|||
requires org.apache.lucene.core;
|
||||
requires org.apache.lucene.join;
|
||||
requires unboundid.ldapsdk;
|
||||
requires org.elasticsearch.tdigest;
|
||||
|
||||
exports org.elasticsearch.index.engine.frozen;
|
||||
exports org.elasticsearch.license;
|
||||
|
|
|
@ -198,7 +198,7 @@ SELECT MAX(languages) max, MIN(languages) min, SUM(languages) sum, AVG(languages
|
|||
null |null |null |null |null |null |null |null
|
||||
1 |1 |15 |1 |1.0 |100.0 |NaN |NaN
|
||||
2 |2 |38 |2 |2.0 |100.0 |NaN |NaN
|
||||
3 |3 |51 |3 |3.0 |100.0 |NaN |NaN
|
||||
3 |3 |51 |3 |3.0 |50.0 |NaN |NaN
|
||||
4 |4 |72 |4 |4.0 |0.0 |NaN |NaN
|
||||
;
|
||||
|
||||
|
@ -1497,7 +1497,7 @@ SELECT PERCENTILE_RANK(bytes_in, 0) as "PERCENTILE_RANK_AllZeros" FROM logs WHER
|
|||
|
||||
PERCENTILE_RANK_AllZeros
|
||||
------------------------
|
||||
100.0
|
||||
50.0
|
||||
;
|
||||
|
||||
|
||||
|
@ -1741,4 +1741,4 @@ null |null |10.0.2.129 |1
|
|||
30 |null |10.0.0.147 |1
|
||||
32 |null |10.0.1.177 |1
|
||||
48 |null |10.0.0.109 |1
|
||||
;
|
||||
;
|
||||
|
|
|
@ -567,7 +567,7 @@ ORDER BY status;
|
|||
|
||||
percentile:d | percentile_rank:d | status:s
|
||||
---------------------+-------------------+---------------
|
||||
1.8836190713044468E19|1.970336796004502 |Error
|
||||
1.8836190713044468E19|0.0 |Error
|
||||
1.7957483822449326E19|26.644793296251386 |OK
|
||||
;
|
||||
|
||||
|
|
|
@ -722,7 +722,7 @@ setup:
|
|||
}
|
||||
}
|
||||
}
|
||||
- match: { classification.auc_roc.value: 0.7754152761810909 }
|
||||
- match: { classification.auc_roc.value: 0.77541527618109091 }
|
||||
- is_false: classification.auc_roc.curve
|
||||
---
|
||||
"Test classification auc_roc with default top_classes_field":
|
||||
|
@ -742,7 +742,7 @@ setup:
|
|||
}
|
||||
}
|
||||
}
|
||||
- match: { classification.auc_roc.value: 0.7754152761810909 }
|
||||
- match: { classification.auc_roc.value: 0.77541527618109091 }
|
||||
- is_false: classification.auc_roc.curve
|
||||
---
|
||||
"Test classification accuracy with missing predicted_field":
|
||||
|
|
|
@ -897,7 +897,7 @@ public class VectorTileRestIT extends ESRestTestCase {
|
|||
"percentiles": {
|
||||
"field": "value1",
|
||||
"percents": [95, 99, 99.9]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}""");
|
||||
|
|
|
@ -185,3 +185,10 @@ tasks.named("yamlRestTestV7CompatTransform").configure{ task ->
|
|||
task.replaceKeyInDo("ssl.certificates", "xpack-ssl.certificates", "Test get SSL certificates")
|
||||
task.addAllowedWarningRegexForTest(".*_xpack/ssl.* is deprecated.*", "Test get SSL certificates")
|
||||
}
|
||||
|
||||
tasks.named("yamlRestTestV7CompatTest").configure {
|
||||
systemProperty 'tests.rest.blacklist', [
|
||||
'ml/evaluate_data_frame/Test classification auc_roc',
|
||||
'ml/evaluate_data_frame/Test classification auc_roc with default top_classes_field',
|
||||
].join(',')
|
||||
}
|
||||
|
|
Loading…
Reference in New Issue