We recently added many frozen indices to our cluster. We have been having trouble querying them using the regular search api, with a frequent error that there is no search context found. We believe this is because some shards return results quickly, but by the time all of the frozen shards return results, the search contexts for the others have timed out. Our queries now just take too long to support regular searching.
We saw the async querying feature, which is suggested for long-running queries, and upgraded to 7.9.1 to help us solve our problem. We are still seeing the same issues with the search contexts. After running a query, this is a typical response we get when retrieving results:
{
"id" : "FjA0NzdfYkIzUVdpTVpaRDg1MnJSaEEeN0Nqc2NUUEVTa2k3Qk5tVzNNZlBpQToxMjE0NjA1",
"is_partial" : false,
"is_running" : false,
"start_time_in_millis" : 1600098471715,
"expiration_time_in_millis" : 1600530471715,
"response" : {
"took" : 848143,
"timed_out" : false,
"num_reduce_phases" : 6025,
"_shards" : {
"total" : 6026,
"successful" : 6007,
"skipped" : 0,
"failed" : 19,
"failures" : [
{
"shard" : 0,
"index" : "indexA",
"node" : "KTZdPUuXR7KCm8kbLckYlQ",
"reason" : {
"type" : "search_context_missing_exception",
"reason" : "No search context found for id [65009]"
}
},
{
"shard" : 0,
"index" : "indexB",
"node" : "KO3oWO4iSzadYkUgWEjgJw",
"reason" : {
"type" : "search_context_missing_exception",
"reason" : "No search context found for id [7971024]"
}
}
…
]
"hits" : {
"total" : {
"value" : 146810,
"relation" : "eq"
},
"max_score" : 24.484806,
"hits" : [ ]
}
}
}
- Many failures are listed for the search contexts, which we thought would not happen as often given that the purpose of this feature is for long-running queries. Are there any suggestions for keeping the contexts alive longer? Scroll is not supported. Will increasing our heap help, or is there just a limit to how many can be open before they start getting deleted?
- Even though we have a large number of hits, none are actually returned in the response. Sometimes we do see some returned, but not often.
- We never are able to see any partial results while the query is running, even though the docs indicate we should be able to.
This is a typical query we have been running:
POST <alias targeting hot, warm, and cold/frozen indices>/_async_search?ignore_throttled=false&size=1000&keep_on_completion=true&allow_partial_search_results=true
{
"sort": {
"_score": "desc"
},
"query": {
"bool": {
"must": [
{
"query_string": {
"default_operator": "OR",
"query": "someSearchTerm",
"fields": [
"fieldA",
"fieldB",
"fieldC", ...
],
"lenient": true
}
}
]
}
},
"track_total_hits": true
}
Using aggregations with buckets of size 1 seems to work consistently:
{
"query": {
...
},
"aggs": {
"test": {
"terms": {
"field": "someID",
"order": {"max_score": "desc"}
},
"aggs": {
"hits":{
"top_hits": {
"sort": [{
"_score": {
"order": "desc"
}
}],
"size": 1
}
},
"max_score": {
"max": {
"script": "_score"
}
}
}
}
}
}
There are some problems with this approach for us:
- The terms aggregation is limited to 1000
- We will not be able to easily do paging
- The term we are basing this on is not unique (there are a small amount of duplicate values), so we might have some unintentionally deduplicated results.
- Aggregations are going to make our queries even slower
We've experimented with some of the params listed for async search and regular search (keep_on_completion, allow_partial_search_results, batched_reduce_size, wait_for_completion_timeout), but nothing so far has given us consistently good results besides running the aggregation above. We've tried limiting our search to only frozen indices instead of a mix of hot, warm, and frozen. This gives us somewhat better results, but we still see the same issues, just less. Is all of this expected behavior for this feature or are there some bugs here? If this is expected, are there any recommendations for params or settings that can help us out?