String index out of range: -1 Exception

Hi,
I'm running ElasticSearch (0.16.2) on a dedicated cluster with 17 nodes (100
shards, currently 0 replicas). After indexing about 130 million documents,
I'm using Hadoop MapReduce to execute about 100000 queries. For my usecase I
need to fetch all hits for each query (most of them are text or span_near
queries and some yield more than 20 million IDs), so I'm using
SearchType.SCAN and scrolls in order to avoid sorting all the results.

In order to avoid overhead (and reduce the initialization time) I'm using
TransportClient instead of client-only-nodes.

Occasionally a weird error occurs, stacktrace at:
https://gist.github.com/1073836

As the error doesn't occur frequently and the stack trace doesn't look like
a 'normal' network error, I think it may be a bug in ElasticSearch - I
didn't find any information on the stack trace on Google.

It would be great if anyone of you could review the stacktrace - many thanks
in advance!

Best regards,
Uli

Hey,

Strange, it seems like the scroll id is malformed in some way. I have pushed an improvement to throw within the failure the scroll id itself in this case, for easier debugging in the future. Is there a chance that the scroll id passed is munged in your code?

-shay.banon

On Saturday, July 9, 2011 at 9:46 PM, Uli Köhler wrote:

Hi,
I'm running Elasticsearch (0.16.2) on a dedicated cluster with 17 nodes (100 shards, currently 0 replicas). After indexing about 130 million documents, I'm using Hadoop MapReduce to execute about 100000 queries. For my usecase I need to fetch all hits for each query (most of them are text or span_near queries and some yield more than 20 million IDs), so I'm using SearchType.SCAN and scrolls in order to avoid sorting all the results.

In order to avoid overhead (and reduce the initialization time) I'm using TransportClient instead of client-only-nodes.

Occasionally a weird error occurs, stacktrace at:
https://gist.github.com/1073836

As the error doesn't occur frequently and the stack trace doesn't look like a 'normal' network error, I think it may be a bug in Elasticsearch - I didn't find any information on the stack trace on Google.

It would be great if anyone of you could review the stacktrace - many thanks in advance!

Best regards,
Uli

Hi Shay,
Thanks for your fast reply!

I don't think my code modifies the Scroll ID in any way - here's the snippet
scrolling over the result set:
https://gist.github.com/1074510

Actually I'm getting another, similar exception:
https://gist.github.com/1074512

I set the maximum number of retries to 8 for each MapReduce job. Some of my
queries take 3 attempts to succeed but after all each one succeeds, so this
doesn't seem like a problem related to specific queries.

Best regards,
Uli

2011/7/10 Shay Banon shay.banon@elasticsearch.com

Hey,

Strange, it seems like the scroll id is malformed in some way. I have
pushed an improvement to throw within the failure the scroll id itself in
this case, for easier debugging in the future. Is there a chance that the
scroll id passed is munged in your code?

-shay.banon

On Saturday, July 9, 2011 at 9:46 PM, Uli Köhler wrote:

Hi,
I'm running Elasticsearch (0.16.2) on a dedicated cluster with 17 nodes
(100 shards, currently 0 replicas). After indexing about 130 million
documents, I'm using Hadoop MapReduce to execute about 100000 queries. For
my usecase I need to fetch all hits for each query (most of them are text or
span_near queries and some yield more than 20 million IDs), so I'm using
SearchType.SCAN and scrolls in order to avoid sorting all the results.

In order to avoid overhead (and reduce the initialization time) I'm using
TransportClient instead of client-only-nodes.

Occasionally a weird error occurs, stacktrace at:
https://gist.github.com/1073836

As the error doesn't occur frequently and the stack trace doesn't look like
a 'normal' network error, I think it may be a bug in Elasticsearch - I
didn't find any information on the stack trace on Google.

It would be great if anyone of you could review the stacktrace - many
thanks in advance!

Best regards,
Uli

Hi,

I just got another idea: Could it be possible that any of my queries timed
out?
My timeout is set to 10 minutes (600000 milliseconds). Is the timeout
counted between two scroll requests or between the origin search scroll
request (returning the first scroll ID) and the last scroll fetch? On the
other hand I have successfull queries that took more than 10 minutes.

Best regards, Uli

2011/7/10 Uli Köhler ulikoehler.dev@googlemail.com

Hi Shay,
Thanks for your fast reply!

I don't think my code modifies the Scroll ID in any way - here's the
snippet scrolling over the result set:
https://gist.github.com/1074510

Actually I'm getting another, similar exception:
https://gist.github.com/1074512

I set the maximum number of retries to 8 for each MapReduce job. Some of my
queries take 3 attempts to succeed but after all each one succeeds, so this
doesn't seem like a problem related to specific queries.

Best regards,
Uli

2011/7/10 Shay Banon shay.banon@elasticsearch.com

Hey,

Strange, it seems like the scroll id is malformed in some way. I have
pushed an improvement to throw within the failure the scroll id itself in
this case, for easier debugging in the future. Is there a chance that the
scroll id passed is munged in your code?

-shay.banon

On Saturday, July 9, 2011 at 9:46 PM, Uli Köhler wrote:

Hi,
I'm running Elasticsearch (0.16.2) on a dedicated cluster with 17 nodes
(100 shards, currently 0 replicas). After indexing about 130 million
documents, I'm using Hadoop MapReduce to execute about 100000 queries. For
my usecase I need to fetch all hits for each query (most of them are text or
span_near queries and some yield more than 20 million IDs), so I'm using
SearchType.SCAN and scrolls in order to avoid sorting all the results.

In order to avoid overhead (and reduce the initialization time) I'm using
TransportClient instead of client-only-nodes.

Occasionally a weird error occurs, stacktrace at:
https://gist.github.com/1073836

As the error doesn't occur frequently and the stack trace doesn't look
like a 'normal' network error, I think it may be a bug in Elasticsearch - I
didn't find any information on the stack trace on Google.

It would be great if anyone of you could review the stacktrace - many
thanks in advance!

Best regards,
Uli

The timeout applies between scroll requests within the same scrolling "process". A timeout should not cause this failure though...

On Sunday, July 10, 2011 at 4:13 PM, Uli Köhler wrote:

Hi,

I just got another idea: Could it be possible that any of my queries timed out?
My timeout is set to 10 minutes (600000 milliseconds). Is the timeout counted between two scroll requests or between the origin search scroll request (returning the first scroll ID) and the last scroll fetch? On the other hand I have successfull queries that took more than 10 minutes.

Best regards, Uli

2011/7/10 Uli Köhler <ulikoehler.dev@googlemail.com (mailto:ulikoehler.dev@googlemail.com)>

Hi Shay,
Thanks for your fast reply!

I don't think my code modifies the Scroll ID in any way - here's the snippet scrolling over the result set:
https://gist.github.com/1074510

Actually I'm getting another, similar exception:
https://gist.github.com/1074512

I set the maximum number of retries to 8 for each MapReduce job. Some of my queries take 3 attempts to succeed but after all each one succeeds, so this doesn't seem like a problem related to specific queries.

Best regards,
Uli

2011/7/10 Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)>

Hey,

Strange, it seems like the scroll id is malformed in some way. I have pushed an improvement to throw within the failure the scroll id itself in this case, for easier debugging in the future. Is there a chance that the scroll id passed is munged in your code?

-shay.banon

On Saturday, July 9, 2011 at 9:46 PM, Uli Köhler wrote:

Hi,
I'm running Elasticsearch (0.16.2) on a dedicated cluster with 17 nodes (100 shards, currently 0 replicas). After indexing about 130 million documents, I'm using Hadoop MapReduce to execute about 100000 queries. For my usecase I need to fetch all hits for each query (most of them are text or span_near queries and some yield more than 20 million IDs), so I'm using SearchType.SCAN and scrolls in order to avoid sorting all the results.

In order to avoid overhead (and reduce the initialization time) I'm using TransportClient instead of client-only-nodes.

Occasionally a weird error occurs, stacktrace at:
https://gist.github.com/1073836

As the error doesn't occur frequently and the stack trace doesn't look like a 'normal' network error, I think it may be a bug in Elasticsearch - I didn't find any information on the stack trace on Google.

It would be great if anyone of you could review the stacktrace - many thanks in advance!

Best regards,
Uli