Hi,
I'm running ElasticSearch (0.16.2) on a dedicated cluster with 17 nodes (100
shards, currently 0 replicas). After indexing about 130 million documents,
I'm using Hadoop MapReduce to execute about 100000 queries. For my usecase I
need to fetch all hits for each query (most of them are text or span_near
queries and some yield more than 20 million IDs), so I'm using
SearchType.SCAN and scrolls in order to avoid sorting all the results.
In order to avoid overhead (and reduce the initialization time) I'm using
TransportClient instead of client-only-nodes.
As the error doesn't occur frequently and the stack trace doesn't look like
a 'normal' network error, I think it may be a bug in ElasticSearch - I
didn't find any information on the stack trace on Google.
It would be great if anyone of you could review the stacktrace - many thanks
in advance!
Strange, it seems like the scroll id is malformed in some way. I have pushed an improvement to throw within the failure the scroll id itself in this case, for easier debugging in the future. Is there a chance that the scroll id passed is munged in your code?
-shay.banon
On Saturday, July 9, 2011 at 9:46 PM, Uli Köhler wrote:
Hi,
I'm running Elasticsearch (0.16.2) on a dedicated cluster with 17 nodes (100 shards, currently 0 replicas). After indexing about 130 million documents, I'm using Hadoop MapReduce to execute about 100000 queries. For my usecase I need to fetch all hits for each query (most of them are text or span_near queries and some yield more than 20 million IDs), so I'm using SearchType.SCAN and scrolls in order to avoid sorting all the results.
In order to avoid overhead (and reduce the initialization time) I'm using TransportClient instead of client-only-nodes.
As the error doesn't occur frequently and the stack trace doesn't look like a 'normal' network error, I think it may be a bug in Elasticsearch - I didn't find any information on the stack trace on Google.
It would be great if anyone of you could review the stacktrace - many thanks in advance!
I set the maximum number of retries to 8 for each MapReduce job. Some of my
queries take 3 attempts to succeed but after all each one succeeds, so this
doesn't seem like a problem related to specific queries.
Strange, it seems like the scroll id is malformed in some way. I have
pushed an improvement to throw within the failure the scroll id itself in
this case, for easier debugging in the future. Is there a chance that the
scroll id passed is munged in your code?
-shay.banon
On Saturday, July 9, 2011 at 9:46 PM, Uli Köhler wrote:
Hi,
I'm running Elasticsearch (0.16.2) on a dedicated cluster with 17 nodes
(100 shards, currently 0 replicas). After indexing about 130 million
documents, I'm using Hadoop MapReduce to execute about 100000 queries. For
my usecase I need to fetch all hits for each query (most of them are text or
span_near queries and some yield more than 20 million IDs), so I'm using
SearchType.SCAN and scrolls in order to avoid sorting all the results.
In order to avoid overhead (and reduce the initialization time) I'm using
TransportClient instead of client-only-nodes.
As the error doesn't occur frequently and the stack trace doesn't look like
a 'normal' network error, I think it may be a bug in Elasticsearch - I
didn't find any information on the stack trace on Google.
It would be great if anyone of you could review the stacktrace - many
thanks in advance!
I just got another idea: Could it be possible that any of my queries timed
out?
My timeout is set to 10 minutes (600000 milliseconds). Is the timeout
counted between two scroll requests or between the origin search scroll
request (returning the first scroll ID) and the last scroll fetch? On the
other hand I have successfull queries that took more than 10 minutes.
I set the maximum number of retries to 8 for each MapReduce job. Some of my
queries take 3 attempts to succeed but after all each one succeeds, so this
doesn't seem like a problem related to specific queries.
Strange, it seems like the scroll id is malformed in some way. I have
pushed an improvement to throw within the failure the scroll id itself in
this case, for easier debugging in the future. Is there a chance that the
scroll id passed is munged in your code?
-shay.banon
On Saturday, July 9, 2011 at 9:46 PM, Uli Köhler wrote:
Hi,
I'm running Elasticsearch (0.16.2) on a dedicated cluster with 17 nodes
(100 shards, currently 0 replicas). After indexing about 130 million
documents, I'm using Hadoop MapReduce to execute about 100000 queries. For
my usecase I need to fetch all hits for each query (most of them are text or
span_near queries and some yield more than 20 million IDs), so I'm using
SearchType.SCAN and scrolls in order to avoid sorting all the results.
In order to avoid overhead (and reduce the initialization time) I'm using
TransportClient instead of client-only-nodes.
As the error doesn't occur frequently and the stack trace doesn't look
like a 'normal' network error, I think it may be a bug in Elasticsearch - I
didn't find any information on the stack trace on Google.
It would be great if anyone of you could review the stacktrace - many
thanks in advance!
The timeout applies between scroll requests within the same scrolling "process". A timeout should not cause this failure though...
On Sunday, July 10, 2011 at 4:13 PM, Uli Köhler wrote:
Hi,
I just got another idea: Could it be possible that any of my queries timed out?
My timeout is set to 10 minutes (600000 milliseconds). Is the timeout counted between two scroll requests or between the origin search scroll request (returning the first scroll ID) and the last scroll fetch? On the other hand I have successfull queries that took more than 10 minutes.
I set the maximum number of retries to 8 for each MapReduce job. Some of my queries take 3 attempts to succeed but after all each one succeeds, so this doesn't seem like a problem related to specific queries.
Strange, it seems like the scroll id is malformed in some way. I have pushed an improvement to throw within the failure the scroll id itself in this case, for easier debugging in the future. Is there a chance that the scroll id passed is munged in your code?
-shay.banon
On Saturday, July 9, 2011 at 9:46 PM, Uli Köhler wrote:
Hi,
I'm running Elasticsearch (0.16.2) on a dedicated cluster with 17 nodes (100 shards, currently 0 replicas). After indexing about 130 million documents, I'm using Hadoop MapReduce to execute about 100000 queries. For my usecase I need to fetch all hits for each query (most of them are text or span_near queries and some yield more than 20 million IDs), so I'm using SearchType.SCAN and scrolls in order to avoid sorting all the results.
In order to avoid overhead (and reduce the initialization time) I'm using TransportClient instead of client-only-nodes.
As the error doesn't occur frequently and the stack trace doesn't look like a 'normal' network error, I think it may be a bug in Elasticsearch - I didn't find any information on the stack trace on Google.
It would be great if anyone of you could review the stacktrace - many thanks in advance!
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.