We recently moved from 1.7 to 7.2 and I was running some aggregations to verify a few statistics
{
"size": 0,
"aggs": {
"avgcsat": {
"avg": {
"field": "csat"
}
}
}
}
The csat
field is an integer. Doc values are enabled for it by default (i.e. we never set it explicitly)
The same query when executed on 1.7 returned the result in almost 600ms. In 7.2 the query is not even returning and I see an increase in load & CPU on data nodes
We have around 2TB of data (~220495625 documents)
This field only contains 5 type of values 0,1,2,3,4,5
We have one other field called state
which is of long data type, and when I did an avg aggregation on it, the results were returned instantaneously. Some issue with integer data type it seems then?
The csat
field is a sparse field as compared to the state
field
Mapping of the two fields
{
"state": {
"type": "long",
"store": true,
"null_value": 0
}
}
{
"csat": {
"type": "integer",
"store": true
}
}
Update
Not only avg, but even term aggregation, histogram is taking a lot of time for csat
as compared to state
Here is what hot thread is showing for one node (its same for every other node)
96.5% (482.4ms out of 500ms) cpu usage by thread 'elasticsearch[es7advcl02-14][search][T#8]'
10/10 snapshots sharing following 39 elements
app//org.apache.lucene.codecs.lucene80.IndexedDISI.advanceExact(IndexedDISI.java:399)
app//org.apache.lucene.codecs.lucene80.Lucene80DocValuesProducer$SparseNumericDocValues.advanceExact(Lucene80DocValuesProducer.java:424)
app//org.elasticsearch.index.fielddata.FieldData$DoubleCastedValues.advanceExact(FieldData.java:446)
app//org.elasticsearch.index.fielddata.SingletonSortedNumericDoubleValues.advanceExact(SingletonSortedNumericDoubleValues.java:44)
app//org.elasticsearch.search.aggregations.metrics.AvgAggregator$1.collect(AvgAggregator.java:83)
app//org.elasticsearch.search.aggregations.LeafBucketCollector.collect(LeafBucketCollector.java:82)
app//org.apache.lucene.search.MatchAllDocsQuery$1$1.score(MatchAllDocsQuery.java:64)
app//org.apache.lucene.search.BulkScorer.score(BulkScorer.java:39)
app//org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:652)
app//org.apache.lucene.search.XIndexSearcher.search(XIndexSearcher.java:44)
app//org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:177)
app//org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:443)
app//org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:271)
app//org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:114)
app//org.elasticsearch.indices.IndicesService.lambda$loadIntoContext$18(IndicesService.java:1305)
app//org.elasticsearch.indices.IndicesService$$Lambda$4388/0x0000000802064840.accept(Unknown Source)
app//org.elasticsearch.indices.IndicesService.lambda$cacheShardLevelResult$19(IndicesService.java:1362)
app//org.elasticsearch.indices.IndicesService$$Lambda$4389/0x0000000802064c40.get(Unknown Source)
app//org.elasticsearch.indices.IndicesRequestCache$Loader.load(IndicesRequestCache.java:174)
app//org.elasticsearch.indices.IndicesRequestCache$Loader.load(IndicesRequestCache.java:157)
app//org.elasticsearch.common.cache.Cache.computeIfAbsent(Cache.java:433)
app//org.elasticsearch.indices.IndicesRequestCache.getOrCompute(IndicesRequestCache.java:123)
app//org.elasticsearch.indices.IndicesService.cacheShardLevelResult(IndicesService.java:1368)
app//org.elasticsearch.indices.IndicesService.loadIntoContext(IndicesService.java:1302)
app//org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:333)
app//org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:360)
app//org.elasticsearch.search.SearchService.lambda$executeQueryPhase$1(SearchService.java:340)
app//org.elasticsearch.search.SearchService$$Lambda$4236/0x0000000802024040.apply(Unknown Source)
app//org.elasticsearch.action.ActionListener.lambda$map$2(ActionListener.java:145)
app//org.elasticsearch.action.ActionListener$$Lambda$3643/0x0000000801dab040.accept(Unknown Source)
app//org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:62)
app//org.elasticsearch.search.SearchService$2.doRun(SearchService.java:1052)
app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
app//org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44)
app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:758)
app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
java.base@12.0.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
java.base@12.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
java.base@12.0.1/java.lang.Thread.run(Thread.java:835)
One other observation
Since I mentioned csat is a sparse field, i added exists filter in my query to only consider docs that have csat field set. The query execution time reduce drastically, but is the right way to go? Why aggregation is slow if i do not provide exists filter? I'm also not sure if document count is playing a role here
First I aggregated just using
- query A which had X docs (slow)
- query A + exists filter which had Y docs. X > Y (fast)
- query B which has M docs M > X (much much greater) (slowww)
- query B + exists filter which has N docs M > N > X > Y (fast)
As you see, if more docs indeed was the culprit then query 4 should've been slow too right? But it was fast