Groovy script execution occassionally "hangs"

ES Version: 2.4.6
Java Version: openjdk version "1.8.0_191"
OS: Ubuntu 16.04.3 LTS
Node Stats: 12 data nodes, 48 shards, ~1M parent docs, ~300M child docs
Index Refresh Interval: 1s

Experiencing an odd issue and wanted to see if anyone could offer some insight on what the root issue might be. I've done preliminary research and the only issues I've seen relating to Groovy scripts is possibly memory leaks in 2.4 (https://github.com/elastic/elasticsearch/issues/18079) and some possible Java CodeCache bug (see https://bugs.openjdk.java.net/browse/JDK-8023191). However, in our case, we are using indexed/parameterized scripts to avoid constant script recompilation.

Our document model is pretty simple and revolves around a parent/child relationship. The query in question targets the child documents of a single parent record (routed to that parent.id) along with a few standard filters (i.e. non-script query filters). The query returns typically in ~10-20 milliseconds and experiences no problems on subsequent executions

However, we also ask for a few aggregations in the search request using a FILTERS aggregation. One particular aggregation filter uses a script-query filter that basically filters out any matches where the comparison of two dates is less than a certain threshold (e.g. return doc['a_date'].date.getMillis() - doc['b_date'].date.getMillis() <= parameterized_threshold)

When adding in the aggregations to the original search request, it seems to work fine (and be very performant) up until about the ~5th subsequent request (and by that I mean if a user was to keep making the same request -- possibly with a different parent.id -- upon receiving the response, with no delay). On the ~5th request (again this is an approximation), the search request just hangs and doesn't return for about ~40-60 secs. And after looking at the hot_threads isolated to the node I'm targeting, it hangs specifically inside GroovyScript.run:

       java.security.AccessController.doPrivileged(Native Method)
       org.elasticsearch.script.groovy.GroovyScriptEngineService$GroovyScript.run(GroovyScriptEngineService.java:278)
       org.elasticsearch.index.query.ScriptQueryParser$ScriptQuery$1$1.get(ScriptQueryParser.java:171)
       org.apache.lucene.search.RandomAccessWeight$1.matches(RandomAccessWeight.java:63)
       org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Weight.java:228)
       org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:172)
       org.apache.lucene.search.BulkScorer.score(BulkScorer.java:39)
       org.apache.lucene.search.LRUQueryCache.cacheImpl(LRUQueryCache.java:422)
       org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.cache(LRUQueryCache.java:608)
       org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.scorer(LRUQueryCache.java:625)
       org.elasticsearch.indices.cache.query.IndicesQueryCache$CachingWeightWrapper.scorer(IndicesQueryCache.java:263)
       org.apache.lucene.search.BooleanWeight.scorer(BooleanWeight.java:389)
       org.apache.lucene.search.Weight.bulkScorer(Weight.java:135)
       org.apache.lucene.search.BooleanWeight.bulkScorer(BooleanWeight.java:370)
       org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.cache(LRUQueryCache.java:604)
       org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.scorer(LRUQueryCache.java:625)
       org.elasticsearch.indices.cache.query.IndicesQueryCache$CachingWeightWrapper.scorer(IndicesQueryCache.java:263)
       org.elasticsearch.search.aggregations.bucket.filters.FiltersAggregator.getLeafCollector(FiltersAggregator.java:90)
       org.elasticsearch.search.aggregations.AggregatorBase.getLeafCollector(AggregatorBase.java:132)
       org.elasticsearch.search.aggregations.AggregatorBase.getLeafCollector(AggregatorBase.java:38)
       org.apache.lucene.search.MultiCollector.getLeafCollector(MultiCollector.java:121)
       org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:812)

I've isolated the problem to the aggregation filter that applies the script_query - when removed from the aggregation request, the problem goes away (meaning I can't re-produce the problem after repeated executions). What's even more bizarre is that the "hung" Groovy script appears even when there are no query matches ... meaning there is no data for the search request to aggregate!

My only conclusion right now is that the execution of the Groovy script is getting blocked by some GC or internal event like a cache-eviction event, but I can't verify that 100% at the moment. I did check various shard stats like fielddata cache on the shard I was targeting in my test and they seemed relatively stable (there does seem to be a large number of evictions for query_cache ... could that be related to the underlying problem???). However, we use scripts frequently in function_score queries and other ValueSourceMetrics aggregations with no apparent problems ... so my gut is that it's related to some underling nature of script_query

Shard stats:

store": {
    "size": "171.8gb",
    "size_in_bytes": 184534807270,
    "throttle_time": "0s",
    "throttle_time_in_millis": 0
  },
  "merges": {
    "current": 0,
    "current_docs": 0,
    "current_size": "0b",
    "current_size_in_bytes": 0,
    "total": 151463,
    "total_time": "4.3d",
    "total_time_in_millis": 379084042,
    "total_docs": 4128513473,
    "total_size": "1.4tb",
    "total_size_in_bytes": 1616384533860,
    "total_stopped_time": "0s",
    "total_stopped_time_in_millis": 0,
    "total_throttled_time": "2.5d",
    "total_throttled_time_in_millis": 219188827,
    "total_auto_throttle": "5mb",
    "total_auto_throttle_in_bytes": 5242880
  },
  "query_cache": {
    "memory_size": "206.6mb",
    "memory_size_in_bytes": 216649784,
    "total_count": 7766733200,
    "hit_count": 137685883,
    "miss_count": 7629047317,
    "cache_size": 733,
    "cache_count": 1098447,
    "evictions": 1097714
  },
  "fielddata": {
    "memory_size": "75.3mb",
    "memory_size_in_bytes": 79055664,
    "evictions": 0
  },
  "segments": {
    "count": 26,
    "memory": "402mb",
    "memory_in_bytes": 421545822,
    "terms_memory": "354.4mb",
    "terms_memory_in_bytes": 371677370,
    "stored_fields_memory": "35.3mb",
    "stored_fields_memory_in_bytes": 37068400,
    "term_vectors_memory": "0b",
    "term_vectors_memory_in_bytes": 0,
    "norms_memory": "62.4kb",
    "norms_memory_in_bytes": 63936,
    "doc_values_memory": "12.1mb",
    "doc_values_memory_in_bytes": 12736116,
    "index_writer_memory": "0b",
    "index_writer_memory_in_bytes": 0,
    "index_writer_max_memory": "512mb",
    "index_writer_max_memory_in_bytes": 536870912,
    "version_map_memory": "0b",
    "version_map_memory_in_bytes": 0,
    "fixed_bit_set": "33.5mb",
    "fixed_bit_set_memory_in_bytes": 35187432
  },

Any help or tips on where to look further would be greatly appreciated!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.