Random search errors from one node of three on a previously working cluster


(ian_clark) #1

Hi,

I'm getting some random errors that started happening today. This is
one error:-

{"error":"ReduceSearchPhaseException[Failed to execute phase [fetch],
[reduce] ]; nested: ArrayIndexOutOfBoundsException[1]; ","status":500}

and this is another:-

{"error":"ReduceSearchPhaseException[Failed to execute phase [fetch],
[reduce] ; shardFailures {SearchContextMissingException[No search
context found for id [36813]]}{RemoteTransportException[[Aardwolf]
[inet[/10.206.38.97:9300]][search/phase/fetch/id]]; nested:
SearchContextMissingException[No search context found for id
[55197]]; }]; nested: IndexOutOfBoundsException[index (0) must be less
than size (0)]; ","status":500}

I am load testing my cluster by sending pseudo-random generated search
requests. I can take the search requests that fail, and replay them
manually and they work. Strange, as this test was working for the past
few days.... I have found exceptions in my logs on only one node, I
have copied them here:-

Any ideas what could be going wrong? Shall I raise a bug?

cheers

Ian


(Clinton Gormley) #2

Hi Ian

I am load testing my cluster by sending pseudo-random generated search
requests. I can take the search requests that fail, and replay them
manually and they work. Strange, as this test was working for the past
few days.... I have found exceptions in my logs on only one node, I
have copied them here:-

https://gist.github.com/1066838

I see that the string of errors start with this one:

Caught exception while handling client http traffic, closing connection
java.lang.IllegalStateException: cannot send more responses than
requests

I see this error a few times a day, and it starts a run of bad
responses, eg I search index A and the results contain docs from index A
and index B, or the wrong types are returned etc

It's like the responses from each node are being mixed up.

Do you have any way of replicating this? even infrequently?

It would help greatly in the debugging process

thanks

Clint


(ian_clark) #3

After a period of getting those responses, I restarted the test to
capture responses to file, and damn it has not happened since. :frowning:

This was the first test run after I restarted the nodes one by one
without issuing any shutdowns, maybe that could cause that same state?
I will try that again.

Ian

On Jul 6, 10:04 am, Clinton Gormley clin...@iannounce.co.uk wrote:

Hi Ian

I am load testing my cluster by sending pseudo-random generated search
requests. I can take the search requests that fail, and replay them
manually and they work. Strange, as this test was working for the past
few days.... I have found exceptions in my logs on only one node, I
have copied them here:-

https://gist.github.com/1066838

I see that the string of errors start with this one:

Caught exception while handling client http traffic, closing connection
java.lang.IllegalStateException: cannot send more responses than
requests

I see this error a few times a day, and it starts a run of bad
responses, eg I search index A and the results contain docs from index A
and index B, or the wrong types are returned etc

It's like the responses from each node are being mixed up.

Do you have any way of replicating this? even infrequently?

It would help greatly in the debugging process

thanks

Clint


(Shay Banon) #4

If there is a way to recreate it, I would love to try and nail it...

On Wednesday, July 6, 2011 at 12:39 PM, Ian wrote:

After a period of getting those responses, I restarted the test to
capture responses to file, and damn it has not happened since. :frowning:

This was the first test run after I restarted the nodes one by one
without issuing any shutdowns, maybe that could cause that same state?
I will try that again.

Ian

On Jul 6, 10:04 am, Clinton Gormley <clin...@iannounce.co.uk (http://iannounce.co.uk)> wrote:

Hi Ian

I am load testing my cluster by sending pseudo-random generated search
requests. I can take the search requests that fail, and replay them
manually and they work. Strange, as this test was working for the past
few days.... I have found exceptions in my logs on only one node, I
have copied them here:-

https://gist.github.com/1066838

I see that the string of errors start with this one:

Caught exception while handling client http traffic, closing connection
java.lang.IllegalStateException: cannot send more responses than
requests

I see this error a few times a day, and it starts a run of bad
responses, eg I search index A and the results contain docs from index A
and index B, or the wrong types are returned etc

It's like the responses from each node are being mixed up.

Do you have any way of replicating this? even infrequently?

It would help greatly in the debugging process

thanks

Clint


(ian_clark) #5

Hi Shay,

I'm still trying to reproduce this reliably, I repeated the same
steps, process kill the nodes and bring them back, and I haven't been
able to reliably reproduce it. I will keep trying as I need to get to
the bottom of this.

It has happened once today however, is there any useful log settings I
could add? (I'll try setting them all to debug)

Ian

On Jul 6, 5:08 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

If there is a way to recreate it, I would love to try and nail it...

On Wednesday, July 6, 2011 at 12:39 PM, Ian wrote:

After a period of getting those responses, I restarted the test to
capture responses to file, and damn it has not happened since. :frowning:

This was the first test run after I restarted the nodes one by one
without issuing any shutdowns, maybe that could cause that same state?
I will try that again.

Ian

On Jul 6, 10:04 am, Clinton Gormley <clin...@iannounce.co.uk (http://iannounce.co.uk)> wrote:

Hi Ian

I am load testing my cluster by sending pseudo-random generated search
requests. I can take the search requests that fail, and replay them
manually and they work. Strange, as this test was working for the past
few days.... I have found exceptions in my logs on only one node, I
have copied them here:-

https://gist.github.com/1066838

I see that the string of errors start with this one:

Caught exception while handling client http traffic, closing connection
java.lang.IllegalStateException: cannot send more responses than
requests

I see this error a few times a day, and it starts a run of bad
responses, eg I search index A and the results contain docs from index A
and index B, or the wrong types are returned etc

It's like the responses from each node are being mixed up.

Do you have any way of replicating this? even infrequently?

It would help greatly in the debugging process

thanks

Clint


(Shay Banon) #6

Ian, nothing specific... . Which client lib to talk to ES are you using?

On Thursday, July 7, 2011 at 1:59 PM, Ian wrote:

Hi Shay,

I'm still trying to reproduce this reliably, I repeated the same
steps, process kill the nodes and bring them back, and I haven't been
able to reliably reproduce it. I will keep trying as I need to get to
the bottom of this.

It has happened once today however, is there any useful log settings I
could add? (I'll try setting them all to debug)

Ian

On Jul 6, 5:08 pm, Shay Banon <shay.ba...@elasticsearch.com (http://elasticsearch.com)> wrote:

If there is a way to recreate it, I would love to try and nail it...

On Wednesday, July 6, 2011 at 12:39 PM, Ian wrote:

After a period of getting those responses, I restarted the test to
capture responses to file, and damn it has not happened since. :frowning:

This was the first test run after I restarted the nodes one by one
without issuing any shutdowns, maybe that could cause that same state?
I will try that again.

Ian

On Jul 6, 10:04 am, Clinton Gormley <clin...@iannounce.co.uk (http://iannounce.co.uk)> wrote:

Hi Ian

I am load testing my cluster by sending pseudo-random generated search
requests. I can take the search requests that fail, and replay them
manually and they work. Strange, as this test was working for the past
few days.... I have found exceptions in my logs on only one node, I
have copied them here:-

https://gist.github.com/1066838

I see that the string of errors start with this one:

Caught exception while handling client http traffic, closing connection
java.lang.IllegalStateException: cannot send more responses than
requests

I see this error a few times a day, and it starts a run of bad
responses, eg I search index A and the results contain docs from index A
and index B, or the wrong types are returned etc

It's like the responses from each node are being mixed up.

Do you have any way of replicating this? even infrequently?

It would help greatly in the debugging process

thanks

Clint


(ian_clark) #7

I'm just load testing it at the moment, using jmeter. I can seem to
reproduce it with a trivial 3 node cluster, and if I forcibly kill a
node whilst throwing lot's of searches at the cluster. That wasn't my
original scenario, then the cluster was healthy and I was resuming
tests only, but I get a similar stack trace:-

Caused by: java.lang.IndexOutOfBoundsException: index (0) must be less
than size (0)
at
org.elasticsearch.common.base.Preconditions.checkElementIndex(Preconditions.java:
301)
at
org.elasticsearch.common.base.Preconditions.checkElementIndex(Preconditions.java:
280)
at org.elasticsearch.common.collect.Iterables.get(Iterables.java:649)
at
org.elasticsearch.search.controller.SearchPhaseController.merge(SearchPhaseController.java:
259)
at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction
$AsyncAction.innerFinishHim(TransportSearchQueryThenFetchAction.java:
179)
at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction
$AsyncAction.finishHim(TransportSearchQueryThenFetchAction.java:164)

Could this exception be causing valid search contexts to be released?

https://github.com/elasticsearch/elasticsearch/blob/master/modules/elasticsearch/src/main/java/org/elasticsearch/action/search/type/TransportSearchQueryThenFetchAction.java#L172

(A wild guess, I'm not even sure what the search context is!!)

IC

On Jul 7, 3:42 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Ian, nothing specific... . Which client lib to talk to ES are you using?

On Thursday, July 7, 2011 at 1:59 PM, Ian wrote:

Hi Shay,

I'm still trying to reproduce this reliably, I repeated the same
steps, process kill the nodes and bring them back, and I haven't been
able to reliably reproduce it. I will keep trying as I need to get to
the bottom of this.

It has happened once today however, is there any useful log settings I
could add? (I'll try setting them all to debug)

Ian

On Jul 6, 5:08 pm, Shay Banon <shay.ba...@elasticsearch.com (http://elasticsearch.com)> wrote:

If there is a way to recreate it, I would love to try and nail it...

On Wednesday, July 6, 2011 at 12:39 PM, Ian wrote:

After a period of getting those responses, I restarted the test to
capture responses to file, and damn it has not happened since. :frowning:

This was the first test run after I restarted the nodes one by one
without issuing any shutdowns, maybe that could cause that same state?
I will try that again.

Ian

On Jul 6, 10:04 am, Clinton Gormley <clin...@iannounce.co.uk (http://iannounce.co.uk)> wrote:

Hi Ian

I am load testing my cluster by sending pseudo-random generated search
requests. I can take the search requests that fail, and replay them
manually and they work. Strange, as this test was working for the past
few days.... I have found exceptions in my logs on only one node, I
have copied them here:-

https://gist.github.com/1066838

I see that the string of errors start with this one:

Caught exception while handling client http traffic, closing connection
java.lang.IllegalStateException: cannot send more responses than
requests

I see this error a few times a day, and it starts a run of bad
responses, eg I search index A and the results contain docs from index A
and index B, or the wrong types are returned etc

It's like the responses from each node are being mixed up.

Do you have any way of replicating this? even infrequently?

It would help greatly in the debugging process

thanks

Clint


(Shay Banon) #8

This failure has been fixed in master, but it is "guarded" in terms of things being properly released because of that.

On Thursday, July 7, 2011 at 7:45 PM, Ian wrote:

I'm just load testing it at the moment, using jmeter. I can seem to
reproduce it with a trivial 3 node cluster, and if I forcibly kill a
node whilst throwing lot's of searches at the cluster. That wasn't my
original scenario, then the cluster was healthy and I was resuming
tests only, but I get a similar stack trace:-

Caused by: java.lang.IndexOutOfBoundsException: index (0) must be less
than size (0)
at
org.elasticsearch.common.base.Preconditions.checkElementIndex(Preconditions.java:
301)
at
org.elasticsearch.common.base.Preconditions.checkElementIndex(Preconditions.java:
280)
at org.elasticsearch.common.collect.Iterables.get(Iterables.java:649)
at
org.elasticsearch.search.controller.SearchPhaseController.merge(SearchPhaseController.java:
259)
at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction
$AsyncAction.innerFinishHim(TransportSearchQueryThenFetchAction.java:
179)
at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction
$AsyncAction.finishHim(TransportSearchQueryThenFetchAction.java:164)

Could this exception be causing valid search contexts to be released?

https://github.com/elasticsearch/elasticsearch/blob/master/modules/elasticsearch/src/main/java/org/elasticsearch/action/search/type/TransportSearchQueryThenFetchAction.java#L172

(A wild guess, I'm not even sure what the search context is!!)

IC

On Jul 7, 3:42 pm, Shay Banon <shay.ba...@elasticsearch.com (http://elasticsearch.com)> wrote:

Ian, nothing specific... . Which client lib to talk to ES are you using?

On Thursday, July 7, 2011 at 1:59 PM, Ian wrote:

Hi Shay,

I'm still trying to reproduce this reliably, I repeated the same
steps, process kill the nodes and bring them back, and I haven't been
able to reliably reproduce it. I will keep trying as I need to get to
the bottom of this.

It has happened once today however, is there any useful log settings I
could add? (I'll try setting them all to debug)

Ian

On Jul 6, 5:08 pm, Shay Banon <shay.ba...@elasticsearch.com (http://elasticsearch.com) (http://elasticsearch.com)> wrote:

If there is a way to recreate it, I would love to try and nail it...

On Wednesday, July 6, 2011 at 12:39 PM, Ian wrote:

After a period of getting those responses, I restarted the test to
capture responses to file, and damn it has not happened since. :frowning:

This was the first test run after I restarted the nodes one by one
without issuing any shutdowns, maybe that could cause that same state?
I will try that again.

Ian

On Jul 6, 10:04 am, Clinton Gormley <clin...@iannounce.co.uk (http://iannounce.co.uk) (http://iannounce.co.uk)> wrote:

Hi Ian

I am load testing my cluster by sending pseudo-random generated search
requests. I can take the search requests that fail, and replay them
manually and they work. Strange, as this test was working for the past
few days.... I have found exceptions in my logs on only one node, I
have copied them here:-

https://gist.github.com/1066838

I see that the string of errors start with this one:

Caught exception while handling client http traffic, closing connection
java.lang.IllegalStateException: cannot send more responses than
requests

I see this error a few times a day, and it starts a run of bad
responses, eg I search index A and the results contain docs from index A
and index B, or the wrong types are returned etc

It's like the responses from each node are being mixed up.

Do you have any way of replicating this? even infrequently?

It would help greatly in the debugging process

thanks

Clint


(ian_clark) #9

What does "guarded" mean? Shall I test with master or keep trying to
reliably reproduce using 0.16.2?

IC

On Jul 7, 6:24 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

This failure has been fixed in master, but it is "guarded" in terms of things being properly released because of that.

On Thursday, July 7, 2011 at 7:45 PM, Ian wrote:

I'm just load testing it at the moment, using jmeter. I can seem to
reproduce it with a trivial 3 node cluster, and if I forcibly kill a
node whilst throwing lot's of searches at the cluster. That wasn't my
original scenario, then the cluster was healthy and I was resuming
tests only, but I get a similar stack trace:-

Caused by: java.lang.IndexOutOfBoundsException: index (0) must be less
than size (0)
at
org.elasticsearch.common.base.Preconditions.checkElementIndex(Preconditions .java:
301)
at
org.elasticsearch.common.base.Preconditions.checkElementIndex(Preconditions .java:
280)
at org.elasticsearch.common.collect.Iterables.get(Iterables.java:649)
at
org.elasticsearch.search.controller.SearchPhaseController.merge(SearchPhase Controller.java:
259)
at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction
$AsyncAction.innerFinishHim(TransportSearchQueryThenFetchAction.java:
179)
at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction
$AsyncAction.finishHim(TransportSearchQueryThenFetchAction.java:164)

Could this exception be causing valid search contexts to be released?

https://github.com/elasticsearch/elasticsearch/blob/master/modules/el...

(A wild guess, I'm not even sure what the search context is!!)

IC

On Jul 7, 3:42 pm, Shay Banon <shay.ba...@elasticsearch.com (http://elasticsearch.com)> wrote:

Ian, nothing specific... . Which client lib to talk to ES are you using?

On Thursday, July 7, 2011 at 1:59 PM, Ian wrote:

Hi Shay,

I'm still trying to reproduce this reliably, I repeated the same
steps, process kill the nodes and bring them back, and I haven't been
able to reliably reproduce it. I will keep trying as I need to get to
the bottom of this.

It has happened once today however, is there any useful log settings I
could add? (I'll try setting them all to debug)

Ian

On Jul 6, 5:08 pm, Shay Banon <shay.ba...@elasticsearch.com (http://elasticsearch.com) (http://elasticsearch.com)> wrote:

If there is a way to recreate it, I would love to try and nail it...

On Wednesday, July 6, 2011 at 12:39 PM, Ian wrote:

After a period of getting those responses, I restarted the test to
capture responses to file, and damn it has not happened since. :frowning:

This was the first test run after I restarted the nodes one by one
without issuing any shutdowns, maybe that could cause that same state?
I will try that again.

Ian

On Jul 6, 10:04 am, Clinton Gormley <clin...@iannounce.co.uk (http://iannounce.co.uk) (http://iannounce.co.uk)> wrote:

Hi Ian

I am load testing my cluster by sending pseudo-random generated search
requests. I can take the search requests that fail, and replay them
manually and they work. Strange, as this test was working for the past
few days.... I have found exceptions in my logs on only one node, I
have copied them here:-

https://gist.github.com/1066838

I see that the string of errors start with this one:

Caught exception while handling client http traffic, closing connection
java.lang.IllegalStateException: cannot send more responses than
requests

I see this error a few times a day, and it starts a run of bad
responses, eg I search index A and the results contain docs from index A
and index B, or the wrong types are returned etc

It's like the responses from each node are being mixed up.

Do you have any way of replicating this? even infrequently?

It would help greatly in the debugging process

thanks

Clint


(Shay Banon) #10

Sorry, by guarded I mean that its properly handled. Either master or 0.16.2 are fine, if you manage to recreate it. I can run the relevant version once you do reproduce it. Thanks!

On Friday, July 8, 2011 at 1:54 PM, Ian wrote:

What does "guarded" mean? Shall I test with master or keep trying to
reliably reproduce using 0.16.2?

IC

On Jul 7, 6:24 pm, Shay Banon <shay.ba...@elasticsearch.com (http://elasticsearch.com)> wrote:

This failure has been fixed in master, but it is "guarded" in terms of things being properly released because of that.

On Thursday, July 7, 2011 at 7:45 PM, Ian wrote:

I'm just load testing it at the moment, using jmeter. I can seem to
reproduce it with a trivial 3 node cluster, and if I forcibly kill a
node whilst throwing lot's of searches at the cluster. That wasn't my
original scenario, then the cluster was healthy and I was resuming
tests only, but I get a similar stack trace:-

Caused by: java.lang.IndexOutOfBoundsException: index (0) must be less
than size (0)
at
org.elasticsearch.common.base.Preconditions.checkElementIndex(Preconditions .java:
301)
at
org.elasticsearch.common.base.Preconditions.checkElementIndex(Preconditions .java:
280)
at org.elasticsearch.common.collect.Iterables.get(Iterables.java:649)
at
org.elasticsearch.search.controller.SearchPhaseController.merge(SearchPhase Controller.java:
259)
at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction
$AsyncAction.innerFinishHim(TransportSearchQueryThenFetchAction.java:
179)
at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction
$AsyncAction.finishHim(TransportSearchQueryThenFetchAction.java:164)

Could this exception be causing valid search contexts to be released?

https://github.com/elasticsearch/elasticsearch/blob/master/modules/el...

(A wild guess, I'm not even sure what the search context is!!)

IC

On Jul 7, 3:42 pm, Shay Banon <shay.ba...@elasticsearch.com (http://elasticsearch.com)> wrote:

Ian, nothing specific... . Which client lib to talk to ES are you using?

On Thursday, July 7, 2011 at 1:59 PM, Ian wrote:

Hi Shay,

I'm still trying to reproduce this reliably, I repeated the same
steps, process kill the nodes and bring them back, and I haven't been
able to reliably reproduce it. I will keep trying as I need to get to
the bottom of this.

It has happened once today however, is there any useful log settings I
could add? (I'll try setting them all to debug)

Ian

On Jul 6, 5:08 pm, Shay Banon <shay.ba...@elasticsearch.com (http://elasticsearch.com) (http://elasticsearch.com)> wrote:

If there is a way to recreate it, I would love to try and nail it...

On Wednesday, July 6, 2011 at 12:39 PM, Ian wrote:

After a period of getting those responses, I restarted the test to
capture responses to file, and damn it has not happened since. :frowning:

This was the first test run after I restarted the nodes one by one
without issuing any shutdowns, maybe that could cause that same state?
I will try that again.

Ian

On Jul 6, 10:04 am, Clinton Gormley <clin...@iannounce.co.uk (http://iannounce.co.uk) (http://iannounce.co.uk)> wrote:

Hi Ian

I am load testing my cluster by sending pseudo-random generated search
requests. I can take the search requests that fail, and replay them
manually and they work. Strange, as this test was working for the past
few days.... I have found exceptions in my logs on only one node, I
have copied them here:-

https://gist.github.com/1066838

I see that the string of errors start with this one:

Caught exception while handling client http traffic, closing connection
java.lang.IllegalStateException: cannot send more responses than
requests

I see this error a few times a day, and it starts a run of bad
responses, eg I search index A and the results contain docs from index A
and index B, or the wrong types are returned etc

It's like the responses from each node are being mixed up.

Do you have any way of replicating this? even infrequently?

It would help greatly in the debugging process

thanks

Clint


(system) #11