Async query not returning results + search context failures

We recently added many frozen indices to our cluster. We have been having trouble querying them using the regular search api, with a frequent error that there is no search context found. We believe this is because some shards return results quickly, but by the time all of the frozen shards return results, the search contexts for the others have timed out. Our queries now just take too long to support regular searching.

We saw the async querying feature, which is suggested for long-running queries, and upgraded to 7.9.1 to help us solve our problem. We are still seeing the same issues with the search contexts. After running a query, this is a typical response we get when retrieving results:

{
  "id" : "FjA0NzdfYkIzUVdpTVpaRDg1MnJSaEEeN0Nqc2NUUEVTa2k3Qk5tVzNNZlBpQToxMjE0NjA1",
  "is_partial" : false,
  "is_running" : false,
  "start_time_in_millis" : 1600098471715,
  "expiration_time_in_millis" : 1600530471715,
  "response" : {
    "took" : 848143,
    "timed_out" : false,
    "num_reduce_phases" : 6025,
    "_shards" : {
      "total" : 6026,
      "successful" : 6007,
      "skipped" : 0,
      "failed" : 19,
      "failures" : [
        {
          "shard" : 0,
          "index" : "indexA",
          "node" : "KTZdPUuXR7KCm8kbLckYlQ",
          "reason" : {
            "type" : "search_context_missing_exception",
            "reason" : "No search context found for id [65009]"
          }
        },
        {
          "shard" : 0,
          "index" : "indexB",
          "node" : "KO3oWO4iSzadYkUgWEjgJw",
          "reason" : {
            "type" : "search_context_missing_exception",
            "reason" : "No search context found for id [7971024]"
          }
        }
        …
    ]
    "hits" : {
      "total" : {
        "value" : 146810,
        "relation" : "eq"
      },
      "max_score" : 24.484806,
      "hits" : [ ]
    }
  }
}
  1. Many failures are listed for the search contexts, which we thought would not happen as often given that the purpose of this feature is for long-running queries. Are there any suggestions for keeping the contexts alive longer? Scroll is not supported. Will increasing our heap help, or is there just a limit to how many can be open before they start getting deleted?
  2. Even though we have a large number of hits, none are actually returned in the response. Sometimes we do see some returned, but not often.
  3. We never are able to see any partial results while the query is running, even though the docs indicate we should be able to.

This is a typical query we have been running:

POST <alias targeting hot, warm, and cold/frozen indices>/_async_search?ignore_throttled=false&size=1000&keep_on_completion=true&allow_partial_search_results=true
{
  "sort": {
    "_score": "desc"
  },
  "query": {
    "bool": {
      "must": [
        {
          "query_string": {
            "default_operator": "OR",
            "query": "someSearchTerm",
            "fields": [
              "fieldA",
              "fieldB",
              "fieldC", ...
            ],
            "lenient": true
          }
        }
      ]
    }
  },
  "track_total_hits": true
}

Using aggregations with buckets of size 1 seems to work consistently:

{
  "query": {
     ...
  },
  "aggs": {
    "test": {
      "terms": {
        "field": "someID",
        "order": {"max_score": "desc"}
      },
      "aggs": {
        "hits":{
          "top_hits": {
            "sort": [{
              "_score": {
                "order": "desc"
              }
            }],
            "size": 1
          }
        },
        "max_score": {
          "max": {
            "script": "_score"
          }
        }
      }
    }
  }
}

There are some problems with this approach for us:

  1. The terms aggregation is limited to 1000
  2. We will not be able to easily do paging
  3. The term we are basing this on is not unique (there are a small amount of duplicate values), so we might have some unintentionally deduplicated results.
  4. Aggregations are going to make our queries even slower

We've experimented with some of the params listed for async search and regular search (keep_on_completion, allow_partial_search_results, batched_reduce_size, wait_for_completion_timeout), but nothing so far has given us consistently good results besides running the aggregation above. We've tried limiting our search to only frozen indices instead of a mix of hot, warm, and frozen. This gives us somewhat better results, but we still see the same issues, just less. Is all of this expected behavior for this feature or are there some bugs here? If this is expected, are there any recommendations for params or settings that can help us out?

Hi,
I don't think the error you are getting is expected, regardless of whether you are using _search or _async_search .

For long running queries, definitely async search is a better choice as it allows to not keep the connection open for a long time but rather get back to it and retrieve the status of the running search, including partial aggregation results.

One thing I noticed is that you are sorting by score. I assume that you are querying time based indices, and and the frozen ones are the oldest, is that correct? If so, have you tried sorting by timestamp instead?

Looking more closely, the "search context not found" error may be due to the fact that the search context has expired once we go and execute the fetch phase, hence the top hits cannot be fetched.

Could you try and set the size to 0 in your search request please?

Thanks for the quick response!

Our indices are time-based, but they are not logs. We have a field that indicates what time we acquired the doc. We just tried sorting by that field with size 1000 and got 92 search context failures, 68281 hits reported, but no actual results coming back in the hits array.

Running with size 0 didn't give any search context failures, but obviously no hits returned in the array even though we have matches.

Ok thanks for trying that out, that confirms that the errors are due to the context being expired when documents are fetched. WOuld it be possible for you to split your search into two: one for aggregations with size 0, and one for the search hits without aggregations?

We really aren't looking for aggregations at all. The only reason we were including them was because it was the only way we could force search results to come back. If we query without aggregations, 95% of the time we get nothing back in the hits array, even though there are matches (the error in this post was the result of running with no aggregations). Our ideal scenario for async would be to send one search query with no aggregations and get results back.

Ok, thanks for the info.

Can we go back to the query you want to run without aggregations in it, with the size that you need? Could you post that one? And could you try again sorting that one by timestamp descending? In that case, is it better? This way the frozen indices should come last, hence the fetch phase has more chances that their search context has not yet expired.

Besides seeing this specific query, it would be good to know if you expect top_hits returned by the frozen indices or not.

Cheers
Luca

Request:

POST <hot, warm, cold alias>/_async_search?size=100&keep_on_completion=true&allow_partial_search_results=true&ignore_throttled=false
{
  "sort": [
    {
      "timeField": {
        "order": "desc"
      }
    }
  ], 
  "query": {
    "bool": {
      "must": [
        {
          "query_string": {
            "default_operator": "OR",
            "query": "someSearchTerm",
            "fields": [
              "fieldA",
              "fieldB",
              "fieldC",
              ...
            ],
            "lenient": true
          }
        }
      ]
    }
  },
  "track_total_hits": true
}

Response:

{
  "id" : "Fk1lZ2JwLWJuUWd5Q0lkYzY1aUVhZWceN0Nqc2NUUEVTa2k3Qk5tVzNNZlBpQToyMzUyMTY2",
  "is_partial" : false,
  "is_running" : false,
  "start_time_in_millis" : 1600343485361,
  "expiration_time_in_millis" : 1600775485361,
  "response" : {
    "took" : 1019350,
    "timed_out" : false,
    "num_reduce_phases" : 1513,
    "_shards" : {
      "total" : 6050,
      "successful" : 6030,
      "skipped" : 0,
      "failed" : 20,
      "failures" : [
        {
          "shard" : 0,
          "index" : "indexA-2020-09-17",
          "node" : "jkqY2wFPSfq5lfSj1IBDgg",
          "reason" : {
            "type" : "search_context_missing_exception",
            "reason" : "No search context found for id [25513717]"
          }
        },
        {
          "shard" : 4,
          "index" : "indexA-2020-09-17",
          "node" : "C6nD9AW3TJSuGzwikLZtCA",
          "reason" : {
            "type" : "search_context_missing_exception",
            "reason" : "No search context found for id [26398540]"
          }
        },
        {
          "shard" : 1,
          "index" : "indexA-2020-09-17",
          "node" : "jkqY2wFPSfq5lfSj1IBDgg",
          "reason" : {
            "type" : "search_context_missing_exception",
            "reason" : "No search context found for id [25513716]"
          }
        },
        {
          "shard" : 2,
          "index" : "indexA-2020-09-17",
          "node" : "il1OydtoTRiEBeOcLHcfSQ",
          "reason" : {
            "type" : "search_context_missing_exception",
            "reason" : "No search context found for id [25241121]"
          }
        },
        {
          "shard" : 3,
          "index" : "indexA-2020-09-17",
          "node" : "C6nD9AW3TJSuGzwikLZtCA",
          "reason" : {
            "type" : "search_context_missing_exception",
            "reason" : "No search context found for id [26398541]"
          }
        },
        {
          "shard" : 1,
          "index" : "indexB-2020-09-17",
          "node" : "C6nD9AW3TJSuGzwikLZtCA",
          "reason" : {
            "type" : "search_context_missing_exception",
            "reason" : "No search context found for id [26398543]"
          }
        },
        {
          "shard" : 2,
          "index" : "indexB-2020-09-17",
          "node" : "jkqY2wFPSfq5lfSj1IBDgg",
          "reason" : {
            "type" : "search_context_missing_exception",
            "reason" : "No search context found for id [25513720]"
          }
        },
        {
          "shard" : 3,
          "index" : "indexB-2020-09-17",
          "node" : "C6nD9AW3TJSuGzwikLZtCA",
          "reason" : {
            "type" : "search_context_missing_exception",
            "reason" : "No search context found for id [26398545]"
          }
        },
        {
          "shard" : 0,
          "index" : "indexB-2020-09-17",
          "node" : "il1OydtoTRiEBeOcLHcfSQ",
          "reason" : {
            "type" : "search_context_missing_exception",
            "reason" : "No search context found for id [25241122]"
          }
        },
        {
          "shard" : 4,
          "index" : "indexB-2020-09-17",
          "node" : "il1OydtoTRiEBeOcLHcfSQ",
          "reason" : {
            "type" : "search_context_missing_exception",
            "reason" : "No search context found for id [25241125]"
          }
        },
        {
          "shard" : 1,
          "index" : "indexC-2020-09-17",
          "node" : "jkqY2wFPSfq5lfSj1IBDgg",
          "reason" : {
            "type" : "search_context_missing_exception",
            "reason" : "No search context found for id [25513719]"
          }
        },
        {
          "shard" : 3,
          "index" : "indexA-2020-09-16",
          "node" : "C6nD9AW3TJSuGzwikLZtCA",
          "reason" : {
            "type" : "search_context_missing_exception",
            "reason" : "No search context found for id [26398548]"
          }
        },
        {
          "shard" : 0,
          "index" : "indexA-2020-09-16",
          "node" : "jkqY2wFPSfq5lfSj1IBDgg",
          "reason" : {
            "type" : "search_context_missing_exception",
            "reason" : "No search context found for id [25513722]"
          }
        },
        {
          "shard" : 2,
          "index" : "indexA-2020-09-16",
          "node" : "il1OydtoTRiEBeOcLHcfSQ",
          "reason" : {
            "type" : "search_context_missing_exception",
            "reason" : "No search context found for id [25241127]"
          }
        },
        {
          "shard" : 4,
          "index" : "indexA-2020-09-16",
          "node" : "C6nD9AW3TJSuGzwikLZtCA",
          "reason" : {
            "type" : "search_context_missing_exception",
            "reason" : "No search context found for id [26398549]"
          }
        },
        {
          "shard" : 1,
          "index" : "indexA-2020-09-16",
          "node" : "jkqY2wFPSfq5lfSj1IBDgg",
          "reason" : {
            "type" : "search_context_missing_exception",
            "reason" : "No search context found for id [25513723]"
          }
        },
        {
          "shard" : 1,
          "index" : "indexB-2020-09-16",
          "node" : "il1OydtoTRiEBeOcLHcfSQ",
          "reason" : {
            "type" : "search_context_missing_exception",
            "reason" : "No search context found for id [25241128]"
          }
        },
        {
          "shard" : 2,
          "index" : "indexB-2020-09-16",
          "node" : "jkqY2wFPSfq5lfSj1IBDgg",
          "reason" : {
            "type" : "search_context_missing_exception",
            "reason" : "No search context found for id [25513724]"
          }
        },
        {
          "shard" : 3,
          "index" : "indexB-2020-09-16",
          "node" : "C6nD9AW3TJSuGzwikLZtCA",
          "reason" : {
            "type" : "search_context_missing_exception",
            "reason" : "No search context found for id [26398550]"
          }
        },
        {
          "shard" : 4,
          "index" : "indexB-2020-09-16",
          "node" : "C6nD9AW3TJSuGzwikLZtCA",
          "reason" : {
            "type" : "search_context_missing_exception",
            "reason" : "No search context found for id [26398551]"
          }
        }
      ]
    },
    "hits" : {
      "total" : {
        "value" : 48868,
        "relation" : "eq"
      },
      "max_score" : null,
      "hits" : [ ]
    }
  }
}

I also tried it ascending by date and got the same response except the indices/shards listed in the failures were our frozen ones. We do expect that some search results will be in the frozen indices.

Hi,
thanks for the details. Could you try removing track_total_hits? It defaults to 10000. Possibly even try with values lower than 10000 and observe if it makes a difference?

Thanks

We tried removing track_total_hits and also sorting by a date, and we got results returned. We are guessing this is because ES is able to target docs with specific dates and does not continue searching once it hits the size needed. We also never see any partial results coming back even for these queries that end up being successful, is this only for aggregations that partial results are expected?

Removing track_total_hits when sorting by _score gave us the same sorts of errors though. For our use case, we really require to be able to sort on _score to get the most relevant results. Is it not the intention of async search to be able to sort by _score? Is this a known type of query that it will not support?

Like you figured, track_total_hits set to true means that all shards are going to execute the query. Sorting by timestamp desc makes sure that the frozen indices are left last, so that if enough hits are found before we get to them, they won't even execute the query, or maybe some will but not all of them.

If you sort by score, frozen indices are no longer last, hence the slowdown is kind of expected, but we need to work on improving this; we are planning to execute query and fetch in the same roundtrip in this case so that the context does not expire between the execution of the query phase and fetch phase on the same shard.

Partial results don't hold hits at the moment in async search, only aggregations results. We are planning to add support for partial search hits soon.

Let me know if you have any further question.

Cheers
Luca

Thank you. Are there any issue numbers related to some of these future improvements that we can track?