Probable Bug in HyperLogLogPlusPlus

Hello everyone,
Somehow I am facing an issue while running certain aggregation query when cardinality is high.
The thread seems to be stuck at trace:

"elasticsearch[es-data-pz-listening4-es7-125-1b][search][T#12]" #167 daemon prio=5 os_prio=0 cpu=406513.48ms elapsed=2727.28s tid=0x0000aaab04deb130 nid=0x2d28bf runnable  [0x0000ffcf471fd000]
   java.lang.Thread.State: RUNNABLE
	at org.elasticsearch.search.aggregations.metrics.HyperLogLogPlusPlus$LinearCounting.addEncoded(org.elasticsearch.server@8.15.5/HyperLogLogPlusPlus.java:346)
	at org.elasticsearch.search.aggregations.metrics.AbstractLinearCounting.collect(org.elasticsearch.server@8.15.5/AbstractLinearCounting.java:48)
	at org.elasticsearch.search.aggregations.metrics.HyperLogLogPlusPlus.collect(org.elasticsearch.server@8.15.5/HyperLogLogPlusPlus.java:136)
	at org.elasticsearch.search.aggregations.metrics.CardinalityAggregator$OrdinalsCollector.postCollect(org.elasticsearch.server@8.15.5/CardinalityAggregator.java:352)
	at org.elasticsearch.search.aggregations.metrics.CardinalityAggregator.postCollectLastCollector(org.elasticsearch.server@8.15.5/CardinalityAggregator.java:157)
	at org.elasticsearch.search.aggregations.metrics.CardinalityAggregator.getLeafCollector(org.elasticsearch.server@8.15.5/CardinalityAggregator.java:148)
	at org.elasticsearch.search.aggregations.AggregatorBase.getLeafCollector(org.elasticsearch.server@8.15.5/AggregatorBase.java:233)
	at org.elasticsearch.search.aggregations.MultiBucketCollector$1.getLeafCollector(org.elasticsearch.server@8.15.5/MultiBucketCollector.java:92)
	at org.elasticsearch.search.aggregations.AggregatorBase.getLeafCollector(org.elasticsearch.server@8.15.5/AggregatorBase.java:232)
	at org.elasticsearch.search.aggregations.AdaptingAggregator.getLeafCollector(org.elasticsearch.server@8.15.5/AdaptingAggregator.java:86)
	at org.elasticsearch.search.aggregations.MultiBucketCollector.getLeafCollector(org.elasticsearch.server@8.15.5/MultiBucketCollector.java:174)
	at org.elasticsearch.search.aggregations.bucket.terms.GroupByAggregator$WithPrimaryField$SubAggsLeafBucketCollector.<init>(GroupByAggregator.java:861)
	at org.elasticsearch.search.aggregations.bucket.terms.GroupByAggregator$WithPrimaryField.computeDifferedSubAggregations(GroupByAggregator.java:531)
	at org.elasticsearch.search.aggregations.bucket.terms.GroupByAggregator$WithPrimaryField.buildAggregation(GroupByAggregator.java:500)
	at org.elasticsearch.search.aggregations.bucket.terms.GroupByAggregator$WithPrimaryField.buildAggregations(GroupByAggregator.java:479)
	at org.elasticsearch.search.aggregations.bucket.BucketsAggregator.buildSubAggsForBuckets(org.elasticsearch.server@8.15.5/BucketsAggregator.java:190)
	at org.elasticsearch.search.aggregations.bucket.BucketsAggregator.buildAggregationsForFixedBucketCount(org.elasticsearch.server@8.15.5/BucketsAggregator.java:323)
	at org.elasticsearch.search.aggregations.bucket.filter.FiltersAggregator.buildAggregations(org.elasticsearch.server@8.15.5/FiltersAggregator.java:215)
	at org.elasticsearch.search.aggregations.AdaptingAggregator.buildAggregations(org.elasticsearch.server@8.15.5/AdaptingAggregator.java:101)
	at org.elasticsearch.search.aggregations.Aggregator.buildTopLevel(org.elasticsearch.server@8.15.5/Aggregator.java:160)
	at org.elasticsearch.search.aggregations.AggregatorCollector.doPostCollection(org.elasticsearch.server@8.15.5/AggregatorCollector.java:47)
	at org.elasticsearch.search.query.QueryPhaseCollector.doPostCollection(org.elasticsearch.server@8.15.5/QueryPhaseCollector.java:379)
	at org.elasticsearch.search.internal.ContextIndexSearcher.doAggregationPostCollection(org.elasticsearch.server@8.15.5/ContextIndexSearcher.java:467)
	at org.elasticsearch.search.internal.ContextIndexSearcher.search(org.elasticsearch.server@8.15.5/ContextIndexSearcher.java:456)
	at org.elasticsearch.search.internal.ContextIndexSearcher.search(org.elasticsearch.server@8.15.5/ContextIndexSearcher.java:367)
	at org.elasticsearch.search.internal.ContextIndexSearcher.search(org.elasticsearch.server@8.15.5/ContextIndexSearcher.java:350)
	at org.elasticsearch.search.query.QueryPhase.addCollectorsAndSearch(org.elasticsearch.server@8.15.5/QueryPhase.java:225)
	at org.elasticsearch.search.query.QueryPhase.executeQuery(org.elasticsearch.server@8.15.5/QueryPhase.java:148)
	at org.elasticsearch.search.query.QueryPhase.execute(org.elasticsearch.server@8.15.5/QueryPhase.java:62)
	at org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(org.elasticsearch.server@8.15.5/SearchService.java:557)
	at org.elasticsearch.search.SearchService.executeQueryPhase(org.elasticsearch.server@8.15.5/SearchService.java:777)
	at org.elasticsearch.search.SearchService.lambda$executeQueryPhase$7(org.elasticsearch.server@8.15.5/SearchService.java:620)
	at org.elasticsearch.search.SearchService$$Lambda$8597/0x00000078026aed70.get(org.elasticsearch.server@8.15.5/Unknown Source)
	at org.elasticsearch.search.SearchService.runSync(org.elasticsearch.server@8.15.5/SearchService.java:725)
	at org.elasticsearch.search.SearchService.ensureAfterSeqNoRefreshed(org.elasticsearch.server@8.15.5/SearchService.java:635)
	at org.elasticsearch.search.SearchService.lambda$executeQueryPhase$8(org.elasticsearch.server@8.15.5/SearchService.java:620)
	at org.elasticsearch.search.SearchService$$Lambda$8565/0x00000078026a9bf8.accept(org.elasticsearch.server@8.15.5/Unknown Source)
	at org.elasticsearch.action.ActionListenerImplementations$DelegatingFailureActionListener.onResponse(org.elasticsearch.server@8.15.5/ActionListenerImplementations.java:217)
	at org.elasticsearch.search.SearchService.lambda$rewriteAndFetchShardRequest$25(org.elasticsearch.server@8.15.5/SearchService.java:1972)
	at org.elasticsearch.search.SearchService$$Lambda$8595/0x00000078026ae900.accept(org.elasticsearch.server@8.15.5/Unknown Source)
	at org.elasticsearch.index.shard.IndexShard.ensureShardSearchActive(org.elasticsearch.server@8.15.5/IndexShard.java:4285)
	at org.elasticsearch.search.SearchService.lambda$rewriteAndFetchShardRequest$26(org.elasticsearch.server@8.15.5/SearchService.java:1972)
	at org.elasticsearch.search.SearchService$$Lambda$8568/0x00000078026aa4d0.accept(org.elasticsearch.server@8.15.5/Unknown Source)
	at org.elasticsearch.action.ActionListenerImplementations$ResponseWrappingActionListener.onResponse(org.elasticsearch.server@8.15.5/ActionListenerImplementations.java:245)
	at org.elasticsearch.index.query.Rewriteable.rewriteAndFetch(org.elasticsearch.server@8.15.5/Rewriteable.java:109)
	at org.elasticsearch.index.query.Rewriteable.rewriteAndFetch(org.elasticsearch.server@8.15.5/Rewriteable.java:77)
	at org.elasticsearch.search.SearchService.rewriteAndFetchShardRequest(org.elasticsearch.server@8.15.5/SearchService.java:1978)
	at org.elasticsearch.search.SearchService.lambda$executeQueryPhase$9(org.elasticsearch.server@8.15.5/SearchService.java:623)
	at org.elasticsearch.search.SearchService$$Lambda$8566/0x00000078026a9e30.run(org.elasticsearch.server@8.15.5/Unknown Source)
	at org.elasticsearch.common.util.concurrent.ParentAware$ParentAwareRunnable.run(org.elasticsearch.server@8.15.5/ParentAware.java:37)
	at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(org.elasticsearch.server@8.15.5/TimedRunnable.java:33)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(org.elasticsearch.server@8.15.5/ThreadContext.java:984)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(org.elasticsearch.server@8.15.5/AbstractRunnable.java:26)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@17.0.9/ThreadPoolExecutor.java:1136)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@17.0.9/ThreadPoolExecutor.java:635)
	at java.lang.Thread.run(java.base@17.0.9/Thread.java:842)

and the tasks keeps running, upon reading the code it seems to me like a concurrency bug, not entirely sure.

As I understand it, when two threads enter org.elasticsearch.search.aggregations.metrics.HyperLogLogPlusPlus#collect simultaneously and the hll.runLens ByteArray has only one empty slot, a race condition occurs: both threads can read the same slot as empty, one thread inserts and takes it, and the other thread continues probing. If the second thread's encoded value isn't already in the table, it will loop indefinitely because all slots are now occupied and it never finds its value or an empty slot.

Although I am not able to reproduce this locally but it is deterministic in QA environment.

Comment says org/elasticsearch/search/aggregations/metrics/HyperLogLogPlusPlus.java:147

"It's safe to reuse lc's readSpare because we're single threaded"

However I added debug logs and logged threadName in method addEncoded and it is clearly called from multiple threads.

1 Like

Apparently you are using 8.15.5 which is old. Could you try with at least 8.19.7 or better 9.2.1.
The problem might have been solved in the meantime, :thinking:

1 Like

Yes, I compared the code prior to posting here, no direct change is visible that addresses this issue, although will try it.
Thank you.

I added logs to debug this issue further

org.elasticsearch.search.aggregations.metrics.HyperLogLogPlusPlus.LinearCounting#addEncoded

final int currentSize = size(bucketOrd);
final int actualOccupiedSlots = recomputedSize(bucketOrd);
logger.error("[HLL_DEBUG] addEncoded_START: thread={}, index={}, bucketOrd={}, encoded={}, startIndex={}, currentSize={}, actualOccupiedSlots={}, threshold={}, capacity={}, mask={}, loadFactor={}",
                    threadName, indexName, bucketOrd, encoded, startIndex, currentSize, actualOccupiedSlots, threshold, capacity, mask,
                    capacity > 0 ? (double)actualOccupiedSlots / capacity : 0.0);

Got Following Log:
[2025-11-25T09:58:40,289][ERROR][o.e.s.a.m.HyperLogLogPlusPlus] [es-data-pz-listening4-es7-125-1b] [HLL_DEBUG] addEncoded_START: thread=elasticsearch[es-data-pz-listening4-es7-125-1b][search][T#7], index=lst_p1738_v_2_20250527_0330, bucketOrd=0, encoded=29173840, startIndex=80, currentSize=1473, actualOccupiedSlots=2048, threshold=1536, capacity=2048, mask=2047, loadFactor=1.0

According to this it seems that all the slots are occupied and yet we have not upgraded to HLL thus this thread remains stuck in the circular infinite loop of Linear Probing. Even though our Max Load Factor is 0.75.

1 Like

The HyperLogLogPlusPlus class is not thread safe so sharing it between two threads is a bug. In the stack trace you shared are the following traces:

	at org.elasticsearch.search.aggregations.bucket.terms.GroupByAggregator$WithPrimaryField$SubAggsLeafBucketCollector.<init>(GroupByAggregator.java:861)
	at org.elasticsearch.search.aggregations.bucket.terms.GroupByAggregator$WithPrimaryField.computeDifferedSubAggregations(GroupByAggregator.java:531)
	at org.elasticsearch.search.aggregations.bucket.terms.GroupByAggregator$WithPrimaryField.buildAggregation(GroupByAggregator.java:500)
	at org.elasticsearch.search.aggregations.bucket.terms.GroupByAggregator$WithPrimaryField.buildAggregations(GroupByAggregator.java:479)

This GroupByAggregator does not come from Elasticsearch codebase so it seems to me you are using a modified version. I wonder if that aggregator is doing something wrong with concurrency.

3 Likes