Probable Bug in HyperLogLogPlusPlus

Hello everyone,
Somehow I am facing an issue while running certain aggregation query when cardinality is high.
The thread seems to be stuck at trace:

"elasticsearch[es-data-pz-listening4-es7-125-1b][search][T#12]" #167 daemon prio=5 os_prio=0 cpu=406513.48ms elapsed=2727.28s tid=0x0000aaab04deb130 nid=0x2d28bf runnable  [0x0000ffcf471fd000]
   java.lang.Thread.State: RUNNABLE
	at org.elasticsearch.search.aggregations.metrics.HyperLogLogPlusPlus$LinearCounting.addEncoded(org.elasticsearch.server@8.15.5/HyperLogLogPlusPlus.java:346)
	at org.elasticsearch.search.aggregations.metrics.AbstractLinearCounting.collect(org.elasticsearch.server@8.15.5/AbstractLinearCounting.java:48)
	at org.elasticsearch.search.aggregations.metrics.HyperLogLogPlusPlus.collect(org.elasticsearch.server@8.15.5/HyperLogLogPlusPlus.java:136)
	at org.elasticsearch.search.aggregations.metrics.CardinalityAggregator$OrdinalsCollector.postCollect(org.elasticsearch.server@8.15.5/CardinalityAggregator.java:352)
	at org.elasticsearch.search.aggregations.metrics.CardinalityAggregator.postCollectLastCollector(org.elasticsearch.server@8.15.5/CardinalityAggregator.java:157)
	at org.elasticsearch.search.aggregations.metrics.CardinalityAggregator.getLeafCollector(org.elasticsearch.server@8.15.5/CardinalityAggregator.java:148)
	at org.elasticsearch.search.aggregations.AggregatorBase.getLeafCollector(org.elasticsearch.server@8.15.5/AggregatorBase.java:233)
	at org.elasticsearch.search.aggregations.MultiBucketCollector$1.getLeafCollector(org.elasticsearch.server@8.15.5/MultiBucketCollector.java:92)
	at org.elasticsearch.search.aggregations.AggregatorBase.getLeafCollector(org.elasticsearch.server@8.15.5/AggregatorBase.java:232)
	at org.elasticsearch.search.aggregations.AdaptingAggregator.getLeafCollector(org.elasticsearch.server@8.15.5/AdaptingAggregator.java:86)
	at org.elasticsearch.search.aggregations.MultiBucketCollector.getLeafCollector(org.elasticsearch.server@8.15.5/MultiBucketCollector.java:174)
	at org.elasticsearch.search.aggregations.bucket.terms.GroupByAggregator$WithPrimaryField$SubAggsLeafBucketCollector.<init>(GroupByAggregator.java:861)
	at org.elasticsearch.search.aggregations.bucket.terms.GroupByAggregator$WithPrimaryField.computeDifferedSubAggregations(GroupByAggregator.java:531)
	at org.elasticsearch.search.aggregations.bucket.terms.GroupByAggregator$WithPrimaryField.buildAggregation(GroupByAggregator.java:500)
	at org.elasticsearch.search.aggregations.bucket.terms.GroupByAggregator$WithPrimaryField.buildAggregations(GroupByAggregator.java:479)
	at org.elasticsearch.search.aggregations.bucket.BucketsAggregator.buildSubAggsForBuckets(org.elasticsearch.server@8.15.5/BucketsAggregator.java:190)
	at org.elasticsearch.search.aggregations.bucket.BucketsAggregator.buildAggregationsForFixedBucketCount(org.elasticsearch.server@8.15.5/BucketsAggregator.java:323)
	at org.elasticsearch.search.aggregations.bucket.filter.FiltersAggregator.buildAggregations(org.elasticsearch.server@8.15.5/FiltersAggregator.java:215)
	at org.elasticsearch.search.aggregations.AdaptingAggregator.buildAggregations(org.elasticsearch.server@8.15.5/AdaptingAggregator.java:101)
	at org.elasticsearch.search.aggregations.Aggregator.buildTopLevel(org.elasticsearch.server@8.15.5/Aggregator.java:160)
	at org.elasticsearch.search.aggregations.AggregatorCollector.doPostCollection(org.elasticsearch.server@8.15.5/AggregatorCollector.java:47)
	at org.elasticsearch.search.query.QueryPhaseCollector.doPostCollection(org.elasticsearch.server@8.15.5/QueryPhaseCollector.java:379)
	at org.elasticsearch.search.internal.ContextIndexSearcher.doAggregationPostCollection(org.elasticsearch.server@8.15.5/ContextIndexSearcher.java:467)
	at org.elasticsearch.search.internal.ContextIndexSearcher.search(org.elasticsearch.server@8.15.5/ContextIndexSearcher.java:456)
	at org.elasticsearch.search.internal.ContextIndexSearcher.search(org.elasticsearch.server@8.15.5/ContextIndexSearcher.java:367)
	at org.elasticsearch.search.internal.ContextIndexSearcher.search(org.elasticsearch.server@8.15.5/ContextIndexSearcher.java:350)
	at org.elasticsearch.search.query.QueryPhase.addCollectorsAndSearch(org.elasticsearch.server@8.15.5/QueryPhase.java:225)
	at org.elasticsearch.search.query.QueryPhase.executeQuery(org.elasticsearch.server@8.15.5/QueryPhase.java:148)
	at org.elasticsearch.search.query.QueryPhase.execute(org.elasticsearch.server@8.15.5/QueryPhase.java:62)
	at org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(org.elasticsearch.server@8.15.5/SearchService.java:557)
	at org.elasticsearch.search.SearchService.executeQueryPhase(org.elasticsearch.server@8.15.5/SearchService.java:777)
	at org.elasticsearch.search.SearchService.lambda$executeQueryPhase$7(org.elasticsearch.server@8.15.5/SearchService.java:620)
	at org.elasticsearch.search.SearchService$$Lambda$8597/0x00000078026aed70.get(org.elasticsearch.server@8.15.5/Unknown Source)
	at org.elasticsearch.search.SearchService.runSync(org.elasticsearch.server@8.15.5/SearchService.java:725)
	at org.elasticsearch.search.SearchService.ensureAfterSeqNoRefreshed(org.elasticsearch.server@8.15.5/SearchService.java:635)
	at org.elasticsearch.search.SearchService.lambda$executeQueryPhase$8(org.elasticsearch.server@8.15.5/SearchService.java:620)
	at org.elasticsearch.search.SearchService$$Lambda$8565/0x00000078026a9bf8.accept(org.elasticsearch.server@8.15.5/Unknown Source)
	at org.elasticsearch.action.ActionListenerImplementations$DelegatingFailureActionListener.onResponse(org.elasticsearch.server@8.15.5/ActionListenerImplementations.java:217)
	at org.elasticsearch.search.SearchService.lambda$rewriteAndFetchShardRequest$25(org.elasticsearch.server@8.15.5/SearchService.java:1972)
	at org.elasticsearch.search.SearchService$$Lambda$8595/0x00000078026ae900.accept(org.elasticsearch.server@8.15.5/Unknown Source)
	at org.elasticsearch.index.shard.IndexShard.ensureShardSearchActive(org.elasticsearch.server@8.15.5/IndexShard.java:4285)
	at org.elasticsearch.search.SearchService.lambda$rewriteAndFetchShardRequest$26(org.elasticsearch.server@8.15.5/SearchService.java:1972)
	at org.elasticsearch.search.SearchService$$Lambda$8568/0x00000078026aa4d0.accept(org.elasticsearch.server@8.15.5/Unknown Source)
	at org.elasticsearch.action.ActionListenerImplementations$ResponseWrappingActionListener.onResponse(org.elasticsearch.server@8.15.5/ActionListenerImplementations.java:245)
	at org.elasticsearch.index.query.Rewriteable.rewriteAndFetch(org.elasticsearch.server@8.15.5/Rewriteable.java:109)
	at org.elasticsearch.index.query.Rewriteable.rewriteAndFetch(org.elasticsearch.server@8.15.5/Rewriteable.java:77)
	at org.elasticsearch.search.SearchService.rewriteAndFetchShardRequest(org.elasticsearch.server@8.15.5/SearchService.java:1978)
	at org.elasticsearch.search.SearchService.lambda$executeQueryPhase$9(org.elasticsearch.server@8.15.5/SearchService.java:623)
	at org.elasticsearch.search.SearchService$$Lambda$8566/0x00000078026a9e30.run(org.elasticsearch.server@8.15.5/Unknown Source)
	at org.elasticsearch.common.util.concurrent.ParentAware$ParentAwareRunnable.run(org.elasticsearch.server@8.15.5/ParentAware.java:37)
	at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(org.elasticsearch.server@8.15.5/TimedRunnable.java:33)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(org.elasticsearch.server@8.15.5/ThreadContext.java:984)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(org.elasticsearch.server@8.15.5/AbstractRunnable.java:26)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@17.0.9/ThreadPoolExecutor.java:1136)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@17.0.9/ThreadPoolExecutor.java:635)
	at java.lang.Thread.run(java.base@17.0.9/Thread.java:842)

and the tasks keeps running, upon reading the code it seems to me like a concurrency bug, not entirely sure.

As I understand it, when two threads enter org.elasticsearch.search.aggregations.metrics.HyperLogLogPlusPlus#collect simultaneously and the hll.runLens ByteArray has only one empty slot, a race condition occurs: both threads can read the same slot as empty, one thread inserts and takes it, and the other thread continues probing. If the second thread's encoded value isn't already in the table, it will loop indefinitely because all slots are now occupied and it never finds its value or an empty slot.

Although I am not able to reproduce this locally but it is deterministic in QA environment.

Comment says org/elasticsearch/search/aggregations/metrics/HyperLogLogPlusPlus.java:147

"It's safe to reuse lc's readSpare because we're single threaded"

However I added debug logs and logged threadName in method addEncoded and it is clearly called from multiple threads.

1 Like

Apparently you are using 8.15.5 which is old. Could you try with at least 8.19.7 or better 9.2.1.
The problem might have been solved in the meantime, :thinking:

1 Like

Yes, I compared the code prior to posting here, no direct change is visible that addresses this issue, although will try it.
Thank you.