Elasticsearch avg aggregation is taking ages to return in 7.2

We recently moved from 1.7 to 7.2 and I was running some aggregations to verify a few statistics

{
  "size": 0,
  "aggs": {
    "avgcsat": {
      "avg": {
        "field": "csat"
      }
    }
  }
}

The csat field is an integer. Doc values are enabled for it by default (i.e. we never set it explicitly)

The same query when executed on 1.7 returned the result in almost 600ms. In 7.2 the query is not even returning and I see an increase in load & CPU on data nodes

We have around 2TB of data (~220495625 documents)

This field only contains 5 type of values 0,1,2,3,4,5

We have one other field called state which is of long data type, and when I did an avg aggregation on it, the results were returned instantaneously. Some issue with integer data type it seems then?

The csat field is a sparse field as compared to the state field

Mapping of the two fields

{
  "state": {
    "type": "long",
    "store": true,
    "null_value": 0
  }
}

{
  "csat": {
    "type": "integer",
    "store": true
  }
}

Update

Not only avg, but even term aggregation, histogram is taking a lot of time for csat as compared to state

Here is what hot thread is showing for one node (its same for every other node)

96.5% (482.4ms out of 500ms) cpu usage by thread 'elasticsearch[es7advcl02-14][search][T#8]'
     10/10 snapshots sharing following 39 elements
       app//org.apache.lucene.codecs.lucene80.IndexedDISI.advanceExact(IndexedDISI.java:399)
       app//org.apache.lucene.codecs.lucene80.Lucene80DocValuesProducer$SparseNumericDocValues.advanceExact(Lucene80DocValuesProducer.java:424)
       app//org.elasticsearch.index.fielddata.FieldData$DoubleCastedValues.advanceExact(FieldData.java:446)
       app//org.elasticsearch.index.fielddata.SingletonSortedNumericDoubleValues.advanceExact(SingletonSortedNumericDoubleValues.java:44)
       app//org.elasticsearch.search.aggregations.metrics.AvgAggregator$1.collect(AvgAggregator.java:83)
       app//org.elasticsearch.search.aggregations.LeafBucketCollector.collect(LeafBucketCollector.java:82)
       app//org.apache.lucene.search.MatchAllDocsQuery$1$1.score(MatchAllDocsQuery.java:64)
       app//org.apache.lucene.search.BulkScorer.score(BulkScorer.java:39)
       app//org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:652)
       app//org.apache.lucene.search.XIndexSearcher.search(XIndexSearcher.java:44)
       app//org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:177)
       app//org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:443)
       app//org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:271)
       app//org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:114)
       app//org.elasticsearch.indices.IndicesService.lambda$loadIntoContext$18(IndicesService.java:1305)
       app//org.elasticsearch.indices.IndicesService$$Lambda$4388/0x0000000802064840.accept(Unknown Source)
       app//org.elasticsearch.indices.IndicesService.lambda$cacheShardLevelResult$19(IndicesService.java:1362)
       app//org.elasticsearch.indices.IndicesService$$Lambda$4389/0x0000000802064c40.get(Unknown Source)
       app//org.elasticsearch.indices.IndicesRequestCache$Loader.load(IndicesRequestCache.java:174)
       app//org.elasticsearch.indices.IndicesRequestCache$Loader.load(IndicesRequestCache.java:157)
       app//org.elasticsearch.common.cache.Cache.computeIfAbsent(Cache.java:433)
       app//org.elasticsearch.indices.IndicesRequestCache.getOrCompute(IndicesRequestCache.java:123)
       app//org.elasticsearch.indices.IndicesService.cacheShardLevelResult(IndicesService.java:1368)
       app//org.elasticsearch.indices.IndicesService.loadIntoContext(IndicesService.java:1302)
       app//org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:333)
       app//org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:360)
       app//org.elasticsearch.search.SearchService.lambda$executeQueryPhase$1(SearchService.java:340)
       app//org.elasticsearch.search.SearchService$$Lambda$4236/0x0000000802024040.apply(Unknown Source)
       app//org.elasticsearch.action.ActionListener.lambda$map$2(ActionListener.java:145)
       app//org.elasticsearch.action.ActionListener$$Lambda$3643/0x0000000801dab040.accept(Unknown Source)
       app//org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:62)
       app//org.elasticsearch.search.SearchService$2.doRun(SearchService.java:1052)
       app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       app//org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44)
       app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:758)
       app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.base@12.0.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
       java.base@12.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
       java.base@12.0.1/java.lang.Thread.run(Thread.java:835)

One other observation
Since I mentioned csat is a sparse field, i added exists filter in my query to only consider docs that have csat field set. The query execution time reduce drastically, but is the right way to go? Why aggregation is slow if i do not provide exists filter? I'm also not sure if document count is playing a role here

First I aggregated just using

  1. query A which had X docs (slow)
  2. query A + exists filter which had Y docs. X > Y (fast)
  3. query B which has M docs M > X (much much greater) (slowww)
  4. query B + exists filter which has N docs M > N > X > Y (fast)

As you see, if more docs indeed was the culprit then query 4 should've been slow too right? But it was fast

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.