ScrollAll Timeout

I'm running into timeouts using ScrollAll. Code is below. I have a large index (160mil items in this index) and need to be planning for substantially larger indices. We will routinely need to perform a scroll search to get millions of documents at a time. ScrollAll seems to fit the bill, but I keep getting timeouts on large requests after about 15 minutes. I did some reading and saw that (from what I can gather) Elasticsearch seems to divide the query timeout by the number of nodes (why??), which would explain my timeout in a 4-node environment (60m query timeout / 4 nodes = 15 min timeout). But even when I set the timeout value to "(nodeCount * 60)m" I still get timeouts around 15 minutes. Am I misunderstanding how to set the timeout for this type of query, or is there something else going on that I haven't found yet?

var scrollObserver = _elasticClient.ScrollAll<MyMessageType>("1m", numberOfSlices, s => s
    .MaxDegreeOfParallelism(numberOfSlices)
    .Search(search => search
        .Index(new[] { indexName })
        .Timeout("60m")
        .Size(1000)
        .Source(msg => msg.Includes(inc => inc.Field(f => f.MessageId)))
        .Query(q => MyQueryBuilder(q, data.SearchRequest))
    )
).Wait(TimeSpan.FromMinutes(60), async (r) =>
{
    var documents = r.SearchResponse.Documents.Select(d => d.MessageId);
    await AddToDbQueue(documents);
});

Any help available?

I believe the timeout is happening at the scroll/slice level somehow, with a single scroll pointer timing out. But I think it's having something to do with too many scrolls being open at once. If I reduce the scroll timeout from "1m" to something like "10s", that seems to be more reliable, but it will still happen sometimes. This might be borne out by this entry I'm seeing in the logs (thank you, Filebeat!):

2019-08-27 16:57:38.809  [2019-08-27T21:57:38,782][DEBUG][o.e.a.s.TransportSearchScrollAction] [data-0] [845767] Failed to execute query phase  
Caused by: org.elasticsearch.search.SearchContextMissingException: No search context found for id [845767]

(long stack trace after this, omitted for brevity)

Setting the timeout on the search query itself (i.e. search.Timeout("60m")) doesn't seem to have any effect.

Would you be able to answer the following:

  1. What version of NEST are you using?

  2. Do you have an example (query JSON is fine) of the query built by MyQueryBuilder(...)?

  3. What is numberOfSlices set to?

  4. What is AddToDbQueue(...) doing? How long typically does an invocation take to complete?

Passing an async delegate to Wait(...) will be an issue. The Action<IScrollAllResponse<T>> is not awaited (or synchronously waited) by the invocation of the void OnNext(IScrollAllResponse<T>) on ScrollAllObserver<T>, meaning the number of scroll requests concurrently issued can become unbounded.

Thanks for the info, @forloop. I'll answer your concerns in order:

  1. NEST version is 6.6.0, aligned with the version of Elasticsearch we're currently running. I'm a one-man dev on this project, so a major version upgrade is beyond what I'd like to attempt at this time.
  2. I'll add the REST request body below my reply.
  3. numberOfSlices is set to the minimum of (totalDocuments / 1000) or the total shard count, which is 17 primaries for this index.
  4. AddToDbQueue() just adds the retrieved message ids (IEnumerable<int>, <= 1000 count) to an Azure storage queue. It's a fast operation that should never take more than 50 ms or so.

If I simply change to AddToDbQueue(documents).GetAwaiter().GetResult() would that solve the issue you're talking about with await? If not, do you have a suggestion on proper implementation?

MyQueryBuilder() result (couldn't get the text formatter to play ball...sorry). I need to limit each result set to 1000 to prevent size issues with the storage queue message (6k limit, 1000 ints fits nicely):
{ "query": { "bool": { "filter": [ { "match": { "subject": { "operator": "and", "query": "enron" } } } ] } }, "size": 1000, "slice": { "id": 0, "max": 14 }, "sort": [ { "_doc": {} } ], "_source": { "includes": [ "messageId" ] }, "timeout": "240m" }

I would expect it to, along with removing async/await; just also be aware of potential deadlocking in waiting on the result of an async operation in certain synchronization contexts. May not be an issue here, but raising for awareness.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.