A couple of exceptions from stress testing scrolling

Erick_Tryzelaar · December 12, 2011, 10:50pm

I was doing some stress testing, and I ran into a couple interesting
errors when I switched from sorting by _id to sorting on _uid. This
has only happened once, and rerunning the query over and over doesn't
seem to trigger it though. Please let me know if you need any more
info.

[2011-12-12 14:44:05,372][DEBUG][action.search.type ] [Bishop]
Failed to execute fetch phase
java.lang.ArrayIndexOutOfBoundsException: 198
at org.elasticsearch.search.fetch.FetchPhase.execute(FetchPhase.java:132)
at org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:434)
at org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteFetch(SearchServiceTransportAction.java:314)
at org.elasticsearch.action.search.type.TransportSearchScrollQueryThenFetchAction$AsyncAction.executeFetchPhase(TransportSearchScrollQueryThenFetchAction.java:206)
at org.elasticsearch.action.search.type.TransportSearchScrollQueryThenFetchAction$AsyncAction.access$1000(TransportSearchScrollQueryThenFetchAction.java:80)
at org.elasticsearch.action.search.type.TransportSearchScrollQueryThenFetchAction$AsyncAction$3.onResult(TransportSearchScrollQueryThenFetchAction.java:174)
at org.elasticsearch.action.search.type.TransportSearchScrollQueryThenFetchAction$AsyncAction$3.onResult(TransportSearchScrollQueryThenFetchAction.java:170)
at org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteQuery(SearchServiceTransportAction.java:195)
at org.elasticsearch.action.search.type.TransportSearchScrollQueryThenFetchAction$AsyncAction.executeQueryPhase(TransportSearchScrollQueryThenFetchAction.java:170)
at org.elasticsearch.action.search.type.TransportSearchScrollQueryThenFetchAction$AsyncAction.access$700(TransportSearchScrollQueryThenFetchAction.java:80)
at org.elasticsearch.action.search.type.TransportSearchScrollQueryThenFetchAction$AsyncAction$2.run(TransportSearchScrollQueryThenFetchAction.java:157)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:680)
[2011-12-12 14:44:06,963][DEBUG][action.search.type ] [Bishop]
Failed to execute fetch phase
org.elasticsearch.ElasticSearchParseException: failed to parse / load source
at org.elasticsearch.search.lookup.SourceLookup.loadSourceIfNeeded(SourceLookup.java:68)
at org.elasticsearch.search.lookup.SourceLookup.extractValue(SourceLookup.java:111)
at org.elasticsearch.search.fetch.FetchPhase.execute(FetchPhase.java:186)
at org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:434)
at org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteFetch(SearchServiceTransportAction.java:314)
at org.elasticsearch.action.search.type.TransportSearchScrollQueryThenFetchAction$AsyncAction.executeFetchPhase(TransportSearchScrollQueryThenFetchAction.java:206)
at org.elasticsearch.action.search.type.TransportSearchScrollQueryThenFetchAction$AsyncAction.access$1000(TransportSearchScrollQueryThenFetchAction.java:80)
at org.elasticsearch.action.search.type.TransportSearchScrollQueryThenFetchAction$AsyncAction$3.onResult(TransportSearchScrollQueryThenFetchAction.java:174)
at org.elasticsearch.action.search.type.TransportSearchScrollQueryThenFetchAction$AsyncAction$3.onResult(TransportSearchScrollQueryThenFetchAction.java:170)
at org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteQuery(SearchServiceTransportAction.java:195)
at org.elasticsearch.action.search.type.TransportSearchScrollQueryThenFetchAction$AsyncAction.executeQueryPhase(TransportSearchScrollQueryThenFetchAction.java:170)
at org.elasticsearch.action.search.type.TransportSearchScrollQueryThenFetchAction$AsyncAction.access$700(TransportSearchScrollQueryThenFetchAction.java:80)
at org.elasticsearch.action.search.type.TransportSearchScrollQueryThenFetchAction$AsyncAction$2.run(TransportSearchScrollQueryThenFetchAction.java:157)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:680)
Caused by: java.lang.IllegalArgumentException: docID must be >= 0 and
< maxDoc=3257 (got docID=-1)
at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:475)
at org.elasticsearch.search.lookup.SourceLookup.loadSourceIfNeeded(SourceLookup.java:60)
... 15 more
[2011-12-12 14:44:07,209][DEBUG][action.search.type ] [Bishop]
Failed to execute fetch phase
org.elasticsearch.ElasticSearchParseException: failed to parse / load source
at org.elasticsearch.search.lookup.SourceLookup.loadSourceIfNeeded(SourceLookup.java:68)
at org.elasticsearch.search.lookup.SourceLookup.extractValue(SourceLookup.java:111)
at org.elasticsearch.search.fetch.FetchPhase.execute(FetchPhase.java:186)
at org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:434)
at org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteFetch(SearchServiceTransportAction.java:314)
at org.elasticsearch.action.search.type.TransportSearchScrollQueryThenFetchAction$AsyncAction.executeFetchPhase(TransportSearchScrollQueryThenFetchAction.java:206)
at org.elasticsearch.action.search.type.TransportSearchScrollQueryThenFetchAction$AsyncAction.access$1000(TransportSearchScrollQueryThenFetchAction.java:80)
at org.elasticsearch.action.search.type.TransportSearchScrollQueryThenFetchAction$AsyncAction$3.onResult(TransportSearchScrollQueryThenFetchAction.java:174)
at org.elasticsearch.action.search.type.TransportSearchScrollQueryThenFetchAction$AsyncAction$3.onResult(TransportSearchScrollQueryThenFetchAction.java:170)
at org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteQuery(SearchServiceTransportAction.java:195)
at org.elasticsearch.action.search.type.TransportSearchScrollQueryThenFetchAction$AsyncAction.executeQueryPhase(TransportSearchScrollQueryThenFetchAction.java:170)
at org.elasticsearch.action.search.type.TransportSearchScrollQueryThenFetchAction$AsyncAction.access$700(TransportSearchScrollQueryThenFetchAction.java:80)
at org.elasticsearch.action.search.type.TransportSearchScrollQueryThenFetchAction$AsyncAction$2.run(TransportSearchScrollQueryThenFetchAction.java:157)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:680)
Caused by: java.lang.IllegalArgumentException: docID must be >= 0 and
< maxDoc=3607 (got docID=-1)
at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:475)
at org.elasticsearch.search.lookup.SourceLookup.loadSourceIfNeeded(SourceLookup.java:60)
... 15 more
[2011-12-12 14:44:08,358][DEBUG][action.search.type ] [Bishop]
Failed to execute fetch phase
java.lang.ArrayIndexOutOfBoundsException: 191
at org.elasticsearch.search.fetch.FetchPhase.execute(FetchPhase.java:132)
at org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:434)
at org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteFetch(SearchServiceTransportAction.java:314)
at org.elasticsearch.action.search.type.TransportSearchScrollQueryThenFetchAction$AsyncAction.executeFetchPhase(TransportSearchScrollQueryThenFetchAction.java:206)
at org.elasticsearch.action.search.type.TransportSearchScrollQueryThenFetchAction$AsyncAction.access$1000(TransportSearchScrollQueryThenFetchAction.java:80)
at org.elasticsearch.action.search.type.TransportSearchScrollQueryThenFetchAction$AsyncAction$3.onResult(TransportSearchScrollQueryThenFetchAction.java:174)
at org.elasticsearch.action.search.type.TransportSearchScrollQueryThenFetchAction$AsyncAction$3.onResult(TransportSearchScrollQueryThenFetchAction.java:170)
at org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteQuery(SearchServiceTransportAction.java:195)
at org.elasticsearch.action.search.type.TransportSearchScrollQueryThenFetchAction$AsyncAction.executeQueryPhase(TransportSearchScrollQueryThenFetchAction.java:170)
at org.elasticsearch.action.search.type.TransportSearchScrollQueryThenFetchAction$AsyncAction.access$700(TransportSearchScrollQueryThenFetchAction.java:80)
at org.elasticsearch.action.search.type.TransportSearchScrollQueryThenFetchAction$AsyncAction$2.run(TransportSearchScrollQueryThenFetchAction.java:157)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:680)

Erick_Tryzelaar · December 13, 2011, 12:49am

On Mon, Dec 12, 2011 at 2:50 PM, Erick Tryzelaar
erick.tryzelaar@gmail.com wrote:

I was doing some stress testing, and I ran into a couple interesting
errors when I switched from sorting by _id to sorting on _uid. This
has only happened once, and rerunning the query over and over doesn't
seem to trigger it though. Please let me know if you need any more
info.

[snip]

I just reindexed my test dataset, and got the exception again:
gist:1469788 · GitHub.

Even stranger, I'm now not getting all the docs returned. Out of 8849
documents (according to response.hits.total), I have gotten 6000,
7000, and 6849 documents returned. There's no error in the log (except
the gist exception I got once that I mentioned previously). I believe
that my ruby code is correct, which you can find here:
gist:1469802 · GitHub. I'm using the rubberband api.

Erick_Tryzelaar · December 13, 2011, 12:55am

On Mon, Dec 12, 2011 at 4:49 PM, Erick Tryzelaar
erick.tryzelaar@gmail.com wrote:

Even stranger, I'm now not getting all the docs returned. Out of 8849
documents (according to response.hits.total), I have gotten 6000,
7000, and 6849 documents returned. There's no error in the log (except
the gist exception I got once that I mentioned previously). I believe
that my ruby code is correct, which you can find here:
gist:1469802 · GitHub. I'm using the rubberband api.

To answer my own question about the missing docs, I tracked it down to
rubberband timing out when fetching the next batch of documents.
Occasionally it'd take longer than the default 5 seconds. Raising the
timeout to 60 seconds fixes this though.

Is there any way to recover from a problem like a timeout reading the
next patch of scrolled documents? ES keeps returning me the same
scroll_id, so I'm guessing grabbing the next scroll set is not
idempotent. So do I just need to abort the scroll when this situation
is detected?

kimchy · December 13, 2011, 10:32pm

just to double check ,you mean that when it times out (on a shard) then ES
misbehaves? I will try and recreate it. Now that it doesn't time out
because you increased the timeout, is it ok?

On Tue, Dec 13, 2011 at 2:55 AM, Erick Tryzelaar
erick.tryzelaar@gmail.comwrote:

On Mon, Dec 12, 2011 at 4:49 PM, Erick Tryzelaar
erick.tryzelaar@gmail.com wrote:

Even stranger, I'm now not getting all the docs returned. Out of 8849
documents (according to response.hits.total), I have gotten 6000,
7000, and 6849 documents returned. There's no error in the log (except
the gist exception I got once that I mentioned previously). I believe
that my ruby code is correct, which you can find here:
gist:1469802 · GitHub. I'm using the rubberband api.

To answer my own question about the missing docs, I tracked it down to
rubberband timing out when fetching the next batch of documents.
Occasionally it'd take longer than the default 5 seconds. Raising the
timeout to 60 seconds fixes this though.

Is there any way to recover from a problem like a timeout reading the
next patch of scrolled documents? ES keeps returning me the same
scroll_id, so I'm guessing grabbing the next scroll set is not
idempotent. So do I just need to abort the scroll when this situation
is detected?

Erick_Tryzelaar · December 14, 2011, 12:40am

Hi Shay,

No problem, I'm sorry I was unclear. Here's what happened:

I connect to 127.0.0.1:9200.
Rubberband (my ruby client library) transparently fetches the other
machines in the cluster. I only have the one running on my localhost,
but ES returns the external IP address so it thinks there are two ES
servers.
I submit a scrolling query request.
I receive a reply which contains 1-1000 of 10,000 records and a scroll_id.
I repeat 1. and 2. a couple times.
At some point, I submit a request for 6000-7000 and the reply takes
longer than 5 seconds.
Rubberband gives up, closes the connection, and retries the same
query with the same scroll_id to the IP address, which happens to be
my machine's external IP address.
ES returns records 7000-8000. Records 6000 through 7000 were skipped.
Repeat until I get an empty set of hits.

The only way to detect that this happened at the moment is to keep a
running count of the records I received and compare it to the total in
response.hits.total. If I don't receive the same amount then I must
have missed some records so I have to restart from the beginning.

On Tue, Dec 13, 2011 at 2:32 PM, Shay Banon kimchy@gmail.com wrote:

just to double check ,you mean that when it times out (on a shard) then ES
misbehaves? I will try and recreate it. Now that it doesn't time out because
you increased the timeout, is it ok?

On Tue, Dec 13, 2011 at 2:55 AM, Erick Tryzelaar erick.tryzelaar@gmail.com
wrote:

On Mon, Dec 12, 2011 at 4:49 PM, Erick Tryzelaar
erick.tryzelaar@gmail.com wrote:

Even stranger, I'm now not getting all the docs returned. Out of 8849
documents (according to response.hits.total), I have gotten 6000,
7000, and 6849 documents returned. There's no error in the log (except
the gist exception I got once that I mentioned previously). I believe
that my ruby code is correct, which you can find here:
gist:1469802 · GitHub. I'm using the rubberband api.

To answer my own question about the missing docs, I tracked it down to
rubberband timing out when fetching the next batch of documents.
Occasionally it'd take longer than the default 5 seconds. Raising the
timeout to 60 seconds fixes this though.

Is there any way to recover from a problem like a timeout reading the
next patch of scrolled documents? ES keeps returning me the same
scroll_id, so I'm guessing grabbing the next scroll set is not
idempotent. So do I just need to abort the scroll when this situation
is detected?

kimchy · December 14, 2011, 10:17pm

Ahh, now I understand. The scroll_id does not include the "from", so, you
executed the scroll request to get 6000-7000, the connection timed out,
but, it still executed on elasticsearch side, so, the next request with the
"same" scroll id will advance to the next set (the "from" is stored in a
search context on each shard).

Potentially, we can improve that and have the "from" stored in the scroll
id, so you can repeat the execution of the same scroll id and get back the
same results. But, that can open a can of other problems.

On Wed, Dec 14, 2011 at 2:40 AM, Erick Tryzelaar
erick.tryzelaar@gmail.comwrote:

Hi Shay,

No problem, I'm sorry I was unclear. Here's what happened:

I connect to 127.0.0.1:9200.

Rubberband (my ruby client library) transparently fetches the other
machines in the cluster. I only have the one running on my localhost,
but ES returns the external IP address so it thinks there are two ES
servers.

I submit a scrolling query request.

I receive a reply which contains 1-1000 of 10,000 records and a
scroll_id.

I repeat 1. and 2. a couple times.

At some point, I submit a request for 6000-7000 and the reply takes
longer than 5 seconds.

Rubberband gives up, closes the connection, and retries the same
query with the same scroll_id to the IP address, which happens to be
my machine's external IP address.

ES returns records 7000-8000. Records 6000 through 7000 were skipped.

Repeat until I get an empty set of hits.

The only way to detect that this happened at the moment is to keep a
running count of the records I received and compare it to the total in
response.hits.total. If I don't receive the same amount then I must
have missed some records so I have to restart from the beginning.

On Tue, Dec 13, 2011 at 2:32 PM, Shay Banon kimchy@gmail.com wrote:

just to double check ,you mean that when it times out (on a shard) then
ES
misbehaves? I will try and recreate it. Now that it doesn't time out
because
you increased the timeout, is it ok?

On Tue, Dec 13, 2011 at 2:55 AM, Erick Tryzelaar <
erick.tryzelaar@gmail.com>
wrote:

On Mon, Dec 12, 2011 at 4:49 PM, Erick Tryzelaar
erick.tryzelaar@gmail.com wrote:

Even stranger, I'm now not getting all the docs returned. Out of 8849
documents (according to response.hits.total), I have gotten 6000,
7000, and 6849 documents returned. There's no error in the log (except
the gist exception I got once that I mentioned previously). I believe
that my ruby code is correct, which you can find here:
gist:1469802 · GitHub. I'm using the rubberband api.

To answer my own question about the missing docs, I tracked it down to
rubberband timing out when fetching the next batch of documents.
Occasionally it'd take longer than the default 5 seconds. Raising the
timeout to 60 seconds fixes this though.

Is there any way to recover from a problem like a timeout reading the
next patch of scrolled documents? ES keeps returning me the same
scroll_id, so I'm guessing grabbing the next scroll set is not
idempotent. So do I just need to abort the scroll when this situation
is detected?

Topic		Replies	Views
Failed to execute phase [query_fetch], total failure Elasticsearch	7	1034	July 6, 2017
Integration test issues with elasticsearch Elasticsearch	2	465	July 6, 2017
Exception from one shard - ArrayIndexOutOfBoundsException Elasticsearch	6	453	July 6, 2017
Scansearch failure Elasticsearch	2	657	July 6, 2017
Scansearch failure Elasticsearch	1	332	July 6, 2017

A couple of exceptions from stress testing scrolling

Related topics