Nested aggregation of 7 documents takes 1 min

I have the following query where

  1. query will return 7 hits and if execute without agg, it complete in 62ms.
  2. removing nested "hour_bucket" agg will complete in 180ms
  3. but executing nested aggs takes about 1 min

is the query below the only aggregate against 7 documents only or doing something else?
Is there any way to improve?

mapping

{
	"user_id":{"type":"keyword"},
	"connected":{"type":"keyword"},
	"start_time":{"type":"date","format":"yyyyMMddHHmmss.SSS"},
	"record_id": {"type":"keyword"}
}

query: my-index/_search?size=0

{
	"query": {
		"bool": {
			"filter": [<filters...>]
		}
	},
	"aggs": {
		"users": {
			"terms": {
				"size": 2147483647,
				"field": "user_id"
			},
			"aggs": {
				"hour_buckets": {
					"aggs": {
						"Missed": {
							"filter": {
								"bool": {
									"must_not": {
										"term": {
											"connected": "Yes"
										}
									}
								}
							},
							"aggs": {
								"Missed": {
									"sum": {
										"field": "inbound"
									}
								}
							}
						},
						"Answered": {
							"filter": {
								"term": {
									"connected": "Yes"
								}
							},
							"aggs": {
								"Answered": {
									"value_count": {
										"field": "record_id"
									}
								}
							}
						}
					},
					"terms": {
						"min_doc_count": 0,
						"script": "doc['start_time'].date.hourOfDay",
						"size": 24
					}
				}
			}
			
		}
	}
}

Try setting the size field in the terms aggregation to something more reasonable and see if that makes any difference.

@Christian_Dahlqvist, thank you for your advice. I set changed size value from 2147483647 to 10,000, 1,000, and 100 but still took 54-56 seconds. Anything else I can try?

We could speculate but next I'd try dropping different sub-branches of the hour_bucket aggregation to identify any hotspots.

@Mark_Harwood, I did. I dropped "Missed" only and "Answered" only but no change (53 seconds plus).
It only changes when I drop the entire "aggs" of "hour_buckets" (75 ms).
Do you know if ES is doing agg of resulting query of 7 hits? It certainly doesn't seem so (I'm only guessing from how long it takes to complete agg).
I also try to add range filter in query clause to "Answered" and "Missed" filter but no change there either (56 seconds with an additional filter within nested agg.

What happens if you drop both, leaving only the scripted terms agg?
I'd have expected script execution times to be a multiple of the number of matching docs, not the number of docs in the index.

@Mark_Harwood, when I drop both and leaving only the scripted "terms" agg took 56 seconds so culprit seems to be this one? Based on your expectation (script execution times to be a multiple of the number of matching docs and not the number of docs in the index), it should be still fast as there are only 7 matching docs :thinking:

Agreed. Each of these 7 docs only has one user and one timestamp, right? Otherwise a doc with arrays full of user ids would create many user buckets.

@Mark_Harwood, correct. Each of these 7 docs only has one user and one field called "start_time".
When I changed "script" to "field" and picked some keyword field in "terms" and execution time was 41 ms.
But I can now confirm your expectation. We store data in multiple indices (create daily) and I query individually.

/day1_index/
total doc in index: 2,913,609
match: 0 docs
execution time: 45 ms

/day2_index/
total doc in index: 2,878,023
match: 2 docs
execution time: 20 seconds

/day3_index/
total doc in index: 2,906,909
match: 5 docs
execution time: 50 seconds

So the question is why the script execution is so slow... Any idea?

Hmm. Can you get a hot threads dump while the slow query is running?

Another thing to try is to see if it is scripting in general or the combination of dates and scripts - can you try get the script to return a keyword field like the record_id and time that?

It appears to be script in general. I changed from "doc['start_time'].date.hourOfDay" to "doc['record_id']" and took 56 seconds (no change). I also changed to "doc['start_time'].date.hour" and this one took 28 seconds. Not good enough but much improvement...

Below is the hot threads dump after at 15 seconds mark of the query in question execution. the dump is located here: https://drive.google.com/file/d/1jZQJLd2vWYi-U3-c9AE-ftN1N9jTDkoV/view?usp=sharing

Thanks for this. The hot threads look like monitoring activity is hammering the CPU so there may be some contention there. Do you have something calling the stats API aggressively? Our own monitoring services can be behind this so you might want to dial back the frequency of checks in the settings.

Putting aside the monitoring overheads, this may have taken time for different reasons - it could be loading global ordinals so can you run this terms aggregation on the record_id field and add the execution_hint setting to be "map" and time that?

Also, what version of elasticsearch are you running? There was an issue with logging deprecation warnings around use of joda time which got fixed around 6.7/7.2

The version of elasticsearch is 6.8.2. So sounds like the version we run have the "fix" in.

Adding "execution_hint": "map" to "script": "doc['record_id']" did not change the execution time.

I ran /usr/share/elasticsearch/bin/elasticsearch-plugin list but nothing was returned. So monitoring does not seems to be enabled :frowning:

I hope I have not exhausted the options yet...
I saw one suggestion to store "hour of the day" separately to aggregate but the question still remains why only 7 matched record script term aggregation would take nearly one minute.

@Mark_Harwood, sorry for taking up your time... last question - it appears that getHour and getHourOfDay seems to result in the same result but I could not find documentation to validate. Do you know if that is the case?

Thank you,

Not sure about that but it should be in the reference docs.

I meant for that test to be done on the terms agg without a script - so with ‘“field” : “record_id”’ instead

Maybe you have something else gathering stats through the stats api?

Oh sorry, yes I have done it without script and it completed in 117 ms. So definitely script is the culprit.

Ok - so maybe it’s CPU contention with all the stats gathering stuff we saw in the hot threads dump. Are these nodes running hot when you’re not busy doing these scripted queries?

I checked hot dump when running query and not running query.
When not running query, all node's cpu usage is near 1%. As soon as I run the query, 2 nodes hits over 80% and it is clear that running my script

84.0% (419.7ms out of 500ms) cpu usage by thread 'elasticsearch[analytics-elastic12-s01][search][T#5]'
     10/10 snapshots sharing following 31 elements
       java.security.AccessController.doPrivileged(Native Method)
       org.elasticsearch.index.fielddata.ScriptDocValues.deprecated(ScriptDocValues.java:126)
       org.elasticsearch.index.fielddata.ScriptDocValues$Dates.getDate(ScriptDocValues.java:277)
       java.lang.invoke.LambdaForm$DMH/1144748369.invokeSpecial_L_L(LambdaForm$DMH)
       java.lang.invoke.LambdaForm$BMH/1608103523.reinvoke(LambdaForm$BMH)
       java.lang.invoke.LambdaForm$MH/379110473.delegate(LambdaForm$MH)
       java.lang.invoke.LambdaForm$MH/1293090980.guard(LambdaForm$MH)
       java.lang.invoke.LambdaForm$MH/1242711236.linkToCallSite(LambdaForm$MH)
       org.elasticsearch.painless.PainlessScript$Script.execute(doc['start_time'].date.getHour():18)
       org.elasticsearch.painless.ScriptImpl.run(ScriptImpl.java:104)
       org.elasticsearch.search.aggregations.support.values.ScriptBytesValues.advanceExact(ScriptBytesValues.java:56)
       org.elasticsearch.search.aggregations.bucket.terms.StringTermsAggregator.buildAggregation(StringTermsAggregator.java:119)
       org.elasticsearch.search.aggregations.AggregatorFactory$MultiBucketAggregatorWrapper.buildAggregation(AggregatorFactory.java:151)
       org.elasticsearch.search.aggregations.bucket.BestBucketsDeferringCollector$2.buildAggregation(BestBucketsDeferringCollector.java:213)
       org.elasticsearch.search.aggregations.bucket.BucketsAggregator.bucketAggregations(BucketsAggregator.java:141)
       org.elasticsearch.search.aggregations.bucket.terms.GlobalOrdinalsStringTermsAggregator.buildAggregation(GlobalOrdinalsStringTermsAggregator.java:238)
       org.elasticsearch.search.aggregations.AggregationPhase.execute(AggregationPhase.java:130)
       org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:120)
       org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:350)
       org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:394)
       org.elasticsearch.search.SearchService.access$100(SearchService.java:126)
       org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:359)
       org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:355)
       org.elasticsearch.search.SearchService$4.doRun(SearchService.java:1107)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:41)
       org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)

note: I did find the getHour vs getHourOfDay doc.
this Shared API for package java.time | Painless Scripting Language [7.5] | Elastic takes you to func def and they are identical. in fact, it says

getHourOfDay() should be replaced with getHour() here Breaking changes in 7.0 | Elasticsearch Guide [7.5] | Elastic

I see references to field data and deprecation - can you share the full mapping for the date field?

Yes, so I switched to getHour() from getHourOfDay() which for some reason cut the time to half (from 60 sec to 30 sec).

Here is the mapping of index

{
  "app-20200201" : {
    "mappings" : {
      "app" : {
        "properties" : {
          "answer_time" : {
            "type" : "date",
            "format" : "yyyyMMddHHmmss.SSS"
          },
          "area_code_location" : {
            "type" : "keyword"
          },
          "call_duration_seconds" : {
            "type" : "integer",
            "null_value" : 0
          },
          "call_recording_result" : {
            "type" : "keyword"
          },
          "called_area_code" : {
            "type" : "keyword"
          },
          "called_number" : {
            "type" : "keyword"
          },
          "called_number_state" : {
            "type" : "keyword"
          },
          "calling_area_code" : {
            "type" : "keyword"
          },
          "calling_number" : {
            "type" : "keyword"
          },
          "calling_number_state" : {
            "type" : "keyword"
          },
          "connected" : {
            "type" : "keyword"
          },
          "customer_area_code" : {
            "type" : "keyword"
          },
          "customer_area_code_state" : {
            "type" : "keyword"
          },
          "customer_number" : {
            "type" : "keyword"
          },
          "direction" : {
            "type" : "keyword"
          },
          "end_time" : {
            "type" : "date",
            "format" : "yyyyMMddHHmmss.SSS"
          },
          "enterprise_id" : {
            "type" : "keyword"
          },
          "external" : {
            "type" : "integer"
          },
          "group_id" : {
            "type" : "keyword"
          },
          "inbound" : {
            "type" : "integer"
          },
          "inbound_duration_seconds" : {
            "type" : "integer",
            "null_value" : 0
          },
          "internal" : {
            "type" : "integer"
          },
          "international_duration_seconds" : {
            "type" : "integer",
            "null_value" : 0
          },
          "international_outbound" : {
            "type" : "integer"
          },
          "local_call_id" : {
            "type" : "keyword"
          },
          "other_party_name" : {
            "type" : "keyword"
          },
          "outbound" : {
            "type" : "integer"
          },
          "outbound_duration_seconds" : {
            "type" : "integer",
            "null_value" : 0
          },
          "record_id" : {
            "type" : "keyword"
          },
          "redirection_reason" : {
            "type" : "keyword"
          },
          "related_call_id" : {
            "type" : "keyword"
          },
          "related_call_id_reason" : {
            "type" : "keyword"
          },
          "remote_call_id" : {
            "type" : "keyword"
          },
          "remote_user_id" : {
            "type" : "keyword"
          },
          "returned_id" : {
            "type" : "keyword"
          },
          "start_time" : {
            "type" : "date",
            "format" : "yyyyMMddHHmmss.SSS"
          },
          "talk_duration_seconds" : {
            "type" : "integer",
            "null_value" : 0
          },
          "tollfree_duration_seconds" : {
            "type" : "integer",
            "null_value" : 0
          },
          "tollfree_inbound" : {
            "type" : "integer"
          },
          "user_area_code" : {
            "type" : "keyword"
          },
          "user_id" : {
            "type" : "keyword"
          },
          "user_number" : {
            "type" : "keyword"
          },
          "user_type" : {
            "type" : "keyword"
          }
        }
      }
    }
  }
}

Using non-deprecated methods will avoid logging warnings about use of deprecated features. Reading the docs you linked to it looks like you need to access the date field using the “value” property