A couple of exceptions from stress testing scrolling

I was doing some stress testing, and I ran into a couple interesting
errors when I switched from sorting by _id to sorting on _uid. This
has only happened once, and rerunning the query over and over doesn't
seem to trigger it though. Please let me know if you need any more
info.

[2011-12-12 14:44:05,372][DEBUG][action.search.type ] [Bishop]
Failed to execute fetch phase
java.lang.ArrayIndexOutOfBoundsException: 198
at org.elasticsearch.search.fetch.FetchPhase.execute(FetchPhase.java:132)
at org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:434)
at org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteFetch(SearchServiceTransportAction.java:314)
at org.elasticsearch.action.search.type.TransportSearchScrollQueryThenFetchAction$AsyncAction.executeFetchPhase(TransportSearchScrollQueryThenFetchAction.java:206)
at org.elasticsearch.action.search.type.TransportSearchScrollQueryThenFetchAction$AsyncAction.access$1000(TransportSearchScrollQueryThenFetchAction.java:80)
at org.elasticsearch.action.search.type.TransportSearchScrollQueryThenFetchAction$AsyncAction$3.onResult(TransportSearchScrollQueryThenFetchAction.java:174)
at org.elasticsearch.action.search.type.TransportSearchScrollQueryThenFetchAction$AsyncAction$3.onResult(TransportSearchScrollQueryThenFetchAction.java:170)
at org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteQuery(SearchServiceTransportAction.java:195)
at org.elasticsearch.action.search.type.TransportSearchScrollQueryThenFetchAction$AsyncAction.executeQueryPhase(TransportSearchScrollQueryThenFetchAction.java:170)
at org.elasticsearch.action.search.type.TransportSearchScrollQueryThenFetchAction$AsyncAction.access$700(TransportSearchScrollQueryThenFetchAction.java:80)
at org.elasticsearch.action.search.type.TransportSearchScrollQueryThenFetchAction$AsyncAction$2.run(TransportSearchScrollQueryThenFetchAction.java:157)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:680)
[2011-12-12 14:44:06,963][DEBUG][action.search.type ] [Bishop]
Failed to execute fetch phase
org.elasticsearch.ElasticSearchParseException: failed to parse / load source
at org.elasticsearch.search.lookup.SourceLookup.loadSourceIfNeeded(SourceLookup.java:68)
at org.elasticsearch.search.lookup.SourceLookup.extractValue(SourceLookup.java:111)
at org.elasticsearch.search.fetch.FetchPhase.execute(FetchPhase.java:186)
at org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:434)
at org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteFetch(SearchServiceTransportAction.java:314)
at org.elasticsearch.action.search.type.TransportSearchScrollQueryThenFetchAction$AsyncAction.executeFetchPhase(TransportSearchScrollQueryThenFetchAction.java:206)
at org.elasticsearch.action.search.type.TransportSearchScrollQueryThenFetchAction$AsyncAction.access$1000(TransportSearchScrollQueryThenFetchAction.java:80)
at org.elasticsearch.action.search.type.TransportSearchScrollQueryThenFetchAction$AsyncAction$3.onResult(TransportSearchScrollQueryThenFetchAction.java:174)
at org.elasticsearch.action.search.type.TransportSearchScrollQueryThenFetchAction$AsyncAction$3.onResult(TransportSearchScrollQueryThenFetchAction.java:170)
at org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteQuery(SearchServiceTransportAction.java:195)
at org.elasticsearch.action.search.type.TransportSearchScrollQueryThenFetchAction$AsyncAction.executeQueryPhase(TransportSearchScrollQueryThenFetchAction.java:170)
at org.elasticsearch.action.search.type.TransportSearchScrollQueryThenFetchAction$AsyncAction.access$700(TransportSearchScrollQueryThenFetchAction.java:80)
at org.elasticsearch.action.search.type.TransportSearchScrollQueryThenFetchAction$AsyncAction$2.run(TransportSearchScrollQueryThenFetchAction.java:157)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:680)
Caused by: java.lang.IllegalArgumentException: docID must be >= 0 and
< maxDoc=3257 (got docID=-1)
at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:475)
at org.elasticsearch.search.lookup.SourceLookup.loadSourceIfNeeded(SourceLookup.java:60)
... 15 more
[2011-12-12 14:44:07,209][DEBUG][action.search.type ] [Bishop]
Failed to execute fetch phase
org.elasticsearch.ElasticSearchParseException: failed to parse / load source
at org.elasticsearch.search.lookup.SourceLookup.loadSourceIfNeeded(SourceLookup.java:68)
at org.elasticsearch.search.lookup.SourceLookup.extractValue(SourceLookup.java:111)
at org.elasticsearch.search.fetch.FetchPhase.execute(FetchPhase.java:186)
at org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:434)
at org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteFetch(SearchServiceTransportAction.java:314)
at org.elasticsearch.action.search.type.TransportSearchScrollQueryThenFetchAction$AsyncAction.executeFetchPhase(TransportSearchScrollQueryThenFetchAction.java:206)
at org.elasticsearch.action.search.type.TransportSearchScrollQueryThenFetchAction$AsyncAction.access$1000(TransportSearchScrollQueryThenFetchAction.java:80)
at org.elasticsearch.action.search.type.TransportSearchScrollQueryThenFetchAction$AsyncAction$3.onResult(TransportSearchScrollQueryThenFetchAction.java:174)
at org.elasticsearch.action.search.type.TransportSearchScrollQueryThenFetchAction$AsyncAction$3.onResult(TransportSearchScrollQueryThenFetchAction.java:170)
at org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteQuery(SearchServiceTransportAction.java:195)
at org.elasticsearch.action.search.type.TransportSearchScrollQueryThenFetchAction$AsyncAction.executeQueryPhase(TransportSearchScrollQueryThenFetchAction.java:170)
at org.elasticsearch.action.search.type.TransportSearchScrollQueryThenFetchAction$AsyncAction.access$700(TransportSearchScrollQueryThenFetchAction.java:80)
at org.elasticsearch.action.search.type.TransportSearchScrollQueryThenFetchAction$AsyncAction$2.run(TransportSearchScrollQueryThenFetchAction.java:157)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:680)
Caused by: java.lang.IllegalArgumentException: docID must be >= 0 and
< maxDoc=3607 (got docID=-1)
at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:475)
at org.elasticsearch.search.lookup.SourceLookup.loadSourceIfNeeded(SourceLookup.java:60)
... 15 more
[2011-12-12 14:44:08,358][DEBUG][action.search.type ] [Bishop]
Failed to execute fetch phase
java.lang.ArrayIndexOutOfBoundsException: 191
at org.elasticsearch.search.fetch.FetchPhase.execute(FetchPhase.java:132)
at org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:434)
at org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteFetch(SearchServiceTransportAction.java:314)
at org.elasticsearch.action.search.type.TransportSearchScrollQueryThenFetchAction$AsyncAction.executeFetchPhase(TransportSearchScrollQueryThenFetchAction.java:206)
at org.elasticsearch.action.search.type.TransportSearchScrollQueryThenFetchAction$AsyncAction.access$1000(TransportSearchScrollQueryThenFetchAction.java:80)
at org.elasticsearch.action.search.type.TransportSearchScrollQueryThenFetchAction$AsyncAction$3.onResult(TransportSearchScrollQueryThenFetchAction.java:174)
at org.elasticsearch.action.search.type.TransportSearchScrollQueryThenFetchAction$AsyncAction$3.onResult(TransportSearchScrollQueryThenFetchAction.java:170)
at org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteQuery(SearchServiceTransportAction.java:195)
at org.elasticsearch.action.search.type.TransportSearchScrollQueryThenFetchAction$AsyncAction.executeQueryPhase(TransportSearchScrollQueryThenFetchAction.java:170)
at org.elasticsearch.action.search.type.TransportSearchScrollQueryThenFetchAction$AsyncAction.access$700(TransportSearchScrollQueryThenFetchAction.java:80)
at org.elasticsearch.action.search.type.TransportSearchScrollQueryThenFetchAction$AsyncAction$2.run(TransportSearchScrollQueryThenFetchAction.java:157)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:680)

On Mon, Dec 12, 2011 at 2:50 PM, Erick Tryzelaar
erick.tryzelaar@gmail.com wrote:

I was doing some stress testing, and I ran into a couple interesting
errors when I switched from sorting by _id to sorting on _uid. This
has only happened once, and rerunning the query over and over doesn't
seem to trigger it though. Please let me know if you need any more
info.

[snip]

I just reindexed my test dataset, and got the exception again:
gist:1469788 · GitHub.

Even stranger, I'm now not getting all the docs returned. Out of 8849
documents (according to response.hits.total), I have gotten 6000,
7000, and 6849 documents returned. There's no error in the log (except
the gist exception I got once that I mentioned previously). I believe
that my ruby code is correct, which you can find here:
gist:1469802 · GitHub. I'm using the rubberband api.

On Mon, Dec 12, 2011 at 4:49 PM, Erick Tryzelaar
erick.tryzelaar@gmail.com wrote:

Even stranger, I'm now not getting all the docs returned. Out of 8849
documents (according to response.hits.total), I have gotten 6000,
7000, and 6849 documents returned. There's no error in the log (except
the gist exception I got once that I mentioned previously). I believe
that my ruby code is correct, which you can find here:
gist:1469802 · GitHub. I'm using the rubberband api.

To answer my own question about the missing docs, I tracked it down to
rubberband timing out when fetching the next batch of documents.
Occasionally it'd take longer than the default 5 seconds. Raising the
timeout to 60 seconds fixes this though.

Is there any way to recover from a problem like a timeout reading the
next patch of scrolled documents? ES keeps returning me the same
scroll_id, so I'm guessing grabbing the next scroll set is not
idempotent. So do I just need to abort the scroll when this situation
is detected?

just to double check ,you mean that when it times out (on a shard) then ES
misbehaves? I will try and recreate it. Now that it doesn't time out
because you increased the timeout, is it ok?

On Tue, Dec 13, 2011 at 2:55 AM, Erick Tryzelaar
erick.tryzelaar@gmail.comwrote:

On Mon, Dec 12, 2011 at 4:49 PM, Erick Tryzelaar
erick.tryzelaar@gmail.com wrote:

Even stranger, I'm now not getting all the docs returned. Out of 8849
documents (according to response.hits.total), I have gotten 6000,
7000, and 6849 documents returned. There's no error in the log (except
the gist exception I got once that I mentioned previously). I believe
that my ruby code is correct, which you can find here:
gist:1469802 · GitHub. I'm using the rubberband api.

To answer my own question about the missing docs, I tracked it down to
rubberband timing out when fetching the next batch of documents.
Occasionally it'd take longer than the default 5 seconds. Raising the
timeout to 60 seconds fixes this though.

Is there any way to recover from a problem like a timeout reading the
next patch of scrolled documents? ES keeps returning me the same
scroll_id, so I'm guessing grabbing the next scroll set is not
idempotent. So do I just need to abort the scroll when this situation
is detected?

Hi Shay,

No problem, I'm sorry I was unclear. Here's what happened:

  1. I connect to 127.0.0.1:9200.
  2. Rubberband (my ruby client library) transparently fetches the other
    machines in the cluster. I only have the one running on my localhost,
    but ES returns the external IP address so it thinks there are two ES
    servers.
  3. I submit a scrolling query request.
  4. I receive a reply which contains 1-1000 of 10,000 records and a scroll_id.
  5. I repeat 1. and 2. a couple times.
  6. At some point, I submit a request for 6000-7000 and the reply takes
    longer than 5 seconds.
  7. Rubberband gives up, closes the connection, and retries the same
    query with the same scroll_id to the IP address, which happens to be
    my machine's external IP address.
  8. ES returns records 7000-8000. Records 6000 through 7000 were skipped.
  9. Repeat until I get an empty set of hits.

The only way to detect that this happened at the moment is to keep a
running count of the records I received and compare it to the total in
response.hits.total. If I don't receive the same amount then I must
have missed some records so I have to restart from the beginning.

On Tue, Dec 13, 2011 at 2:32 PM, Shay Banon kimchy@gmail.com wrote:

just to double check ,you mean that when it times out (on a shard) then ES
misbehaves? I will try and recreate it. Now that it doesn't time out because
you increased the timeout, is it ok?

On Tue, Dec 13, 2011 at 2:55 AM, Erick Tryzelaar erick.tryzelaar@gmail.com
wrote:

On Mon, Dec 12, 2011 at 4:49 PM, Erick Tryzelaar
erick.tryzelaar@gmail.com wrote:

Even stranger, I'm now not getting all the docs returned. Out of 8849
documents (according to response.hits.total), I have gotten 6000,
7000, and 6849 documents returned. There's no error in the log (except
the gist exception I got once that I mentioned previously). I believe
that my ruby code is correct, which you can find here:
gist:1469802 · GitHub. I'm using the rubberband api.

To answer my own question about the missing docs, I tracked it down to
rubberband timing out when fetching the next batch of documents.
Occasionally it'd take longer than the default 5 seconds. Raising the
timeout to 60 seconds fixes this though.

Is there any way to recover from a problem like a timeout reading the
next patch of scrolled documents? ES keeps returning me the same
scroll_id, so I'm guessing grabbing the next scroll set is not
idempotent. So do I just need to abort the scroll when this situation
is detected?

Ahh, now I understand. The scroll_id does not include the "from", so, you
executed the scroll request to get 6000-7000, the connection timed out,
but, it still executed on elasticsearch side, so, the next request with the
"same" scroll id will advance to the next set (the "from" is stored in a
search context on each shard).

Potentially, we can improve that and have the "from" stored in the scroll
id, so you can repeat the execution of the same scroll id and get back the
same results. But, that can open a can of other problems.

On Wed, Dec 14, 2011 at 2:40 AM, Erick Tryzelaar
erick.tryzelaar@gmail.comwrote:

Hi Shay,

No problem, I'm sorry I was unclear. Here's what happened:

  1. I connect to 127.0.0.1:9200.
  2. Rubberband (my ruby client library) transparently fetches the other
    machines in the cluster. I only have the one running on my localhost,
    but ES returns the external IP address so it thinks there are two ES
    servers.
  3. I submit a scrolling query request.
  4. I receive a reply which contains 1-1000 of 10,000 records and a
    scroll_id.
  5. I repeat 1. and 2. a couple times.
  6. At some point, I submit a request for 6000-7000 and the reply takes
    longer than 5 seconds.
  7. Rubberband gives up, closes the connection, and retries the same
    query with the same scroll_id to the IP address, which happens to be
    my machine's external IP address.
  8. ES returns records 7000-8000. Records 6000 through 7000 were skipped.
  9. Repeat until I get an empty set of hits.

The only way to detect that this happened at the moment is to keep a
running count of the records I received and compare it to the total in
response.hits.total. If I don't receive the same amount then I must
have missed some records so I have to restart from the beginning.

On Tue, Dec 13, 2011 at 2:32 PM, Shay Banon kimchy@gmail.com wrote:

just to double check ,you mean that when it times out (on a shard) then
ES
misbehaves? I will try and recreate it. Now that it doesn't time out
because
you increased the timeout, is it ok?

On Tue, Dec 13, 2011 at 2:55 AM, Erick Tryzelaar <
erick.tryzelaar@gmail.com>
wrote:

On Mon, Dec 12, 2011 at 4:49 PM, Erick Tryzelaar
erick.tryzelaar@gmail.com wrote:

Even stranger, I'm now not getting all the docs returned. Out of 8849
documents (according to response.hits.total), I have gotten 6000,
7000, and 6849 documents returned. There's no error in the log (except
the gist exception I got once that I mentioned previously). I believe
that my ruby code is correct, which you can find here:
gist:1469802 · GitHub. I'm using the rubberband api.

To answer my own question about the missing docs, I tracked it down to
rubberband timing out when fetching the next batch of documents.
Occasionally it'd take longer than the default 5 seconds. Raising the
timeout to 60 seconds fixes this though.

Is there any way to recover from a problem like a timeout reading the
next patch of scrolled documents? ES keeps returning me the same
scroll_id, so I'm guessing grabbing the next scroll set is not
idempotent. So do I just need to abort the scroll when this situation
is detected?