Shard failure, RemoteTransportException [search/phase/fetch/id]... IllegalArgumentException

Hi -

When doing searches against an index, we get shard failures as shown in the
query result below.

{"took" : 118,
"timed_out" : false,
"_shards" : {
"total" : 32,
"successful" : 31,
"failed" : 1,
"failures" : [ {
"status" : 500,
"reason" :
"RemoteTransportException[[Surge][inet[/...:9300]][search/phase/fetch/id]];
nested: IllegalArgumentException[docID must be >= 0 and < maxDoc=1 (got
docID=2147330367)]; "
} ]
},
"hits" : {
"total" : 15
...

==
The stack trace is the following:

2013-01-08 19:11:02,089
[elasticsearch[Neptune][transport_client_worker][T#2]{New I/O worker #2}]
DEBUG [type] - [Neptune] [55784508] Failed to execute fetch phase
org.elasticsearch.transport.RemoteTransportException:
[Ringer][inet[/...:9300]][search/phase/fetch/id]
Caused by: java.lang.IllegalArgumentException: docID must be >= 0 and <
maxDoc=1 (got docID=2147335046)
at
org.apache.lucene.index.SegmentReader.document(SegmentReader.java:490)
at
org.apache.lucene.index.DirectoryReader.document(DirectoryReader.java:568)
at
org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:264)
at
org.elasticsearch.search.fetch.FetchPhase.loadDocument(FetchPhase.java:288)
at
org.elasticsearch.search.fetch.FetchPhase.execute(FetchPhase.java:159)
at
org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:430)
at
org.elasticsearch.search.action.SearchServiceTransportAction$SearchFetchByIdTransportHandler.messageReceived(SearchServiceTransportAction.java:602)
at
org.elasticsearch.search.action.SearchServiceTransportAction$SearchFetchByIdTransportHandler.messageReceived(SearchServiceTransportAction.java:591)
at
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:268)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)

=====

We've run into this problem before with another query, eventually being
unable to retrieve any documents whatsoever. Running _optimize helped, but
after a few days we get these errors again.

Some more details:
Version: 0.19.11
Replication factor: 2
Health: {"cluster_name" : "...",

"status" : "green",
"timed_out" : false,
"number_of_nodes" : 27,
"number_of_data_nodes" : 3,
"active_primary_shards" : 96,
"active_shards" : 192,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

I'll appreciate any help/pointers.

-Ignacio

--

Are you using custom_scorer in your queries?

On Tuesday, January 8, 2013 3:13:32 PM UTC-5, Ignacio wrote:

Hi -

When doing searches against an index, we get shard failures as shown in
the query result below.

{"took" : 118,
"timed_out" : false,
"_shards" : {
"total" : 32,
"successful" : 31,
"failed" : 1,
"failures" : [ {
"status" : 500,
"reason" :
"RemoteTransportException[[Surge][inet[/...:9300]][search/phase/fetch/id]];
nested: IllegalArgumentException[docID must be >= 0 and < maxDoc=1 (got
docID=2147330367)]; "
} ]
},
"hits" : {
"total" : 15
...

==
The stack trace is the following:

2013-01-08 19:11:02,089
[elasticsearch[Neptune][transport_client_worker][T#2]{New I/O worker #2}]
DEBUG [type] - [Neptune] [55784508] Failed to execute fetch phase
org.elasticsearch.transport.RemoteTransportException:
[Ringer][inet[/...:9300]][search/phase/fetch/id]
Caused by: java.lang.IllegalArgumentException: docID must be >= 0 and <
maxDoc=1 (got docID=2147335046)
at
org.apache.lucene.index.SegmentReader.document(SegmentReader.java:490)
at
org.apache.lucene.index.DirectoryReader.document(DirectoryReader.java:568)
at
org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:264)
at
org.elasticsearch.search.fetch.FetchPhase.loadDocument(FetchPhase.java:288)
at
org.elasticsearch.search.fetch.FetchPhase.execute(FetchPhase.java:159)
at
org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:430)
at
org.elasticsearch.search.action.SearchServiceTransportAction$SearchFetchByIdTransportHandler.messageReceived(SearchServiceTransportAction.java:602)
at
org.elasticsearch.search.action.SearchServiceTransportAction$SearchFetchByIdTransportHandler.messageReceived(SearchServiceTransportAction.java:591)
at
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:268)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)

=====

We've run into this problem before with another query, eventually being
unable to retrieve any documents whatsoever. Running _optimize helped, but
after a few days we get these errors again.

Some more details:
Version: 0.19.11
Replication factor: 2
Health: {"cluster_name" : "...",

"status" : "green",
"timed_out" : false,
"number_of_nodes" : 27,
"number_of_data_nodes" : 3,
"active_primary_shards" : 96,
"active_shards" : 192,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

I'll appreciate any help/pointers.

-Ignacio

--

No, not per query, but we do have our own scorer. We also have custom
analyzers, which we also don't specify per query, only per field.

Thanks.

On Tue, Jan 8, 2013 at 4:28 PM, Igor Motov imotov@gmail.com wrote:

Are you using custom_scorer in your queries?

On Tuesday, January 8, 2013 3:13:32 PM UTC-5, Ignacio wrote:

Hi -

When doing searches against an index, we get shard failures as shown in
the query result below.

{"took" : 118,
"timed_out" : false,
"_shards" : {
"total" : 32,
"successful" : 31,
"failed" : 1,
"failures" : [ {
"status" : 500,
"reason" : "RemoteTransportException[[**Surge][inet[/...:9300]][**search/phase/fetch/id]];
nested: IllegalArgumentException[docID must be >= 0 and < maxDoc=1 (got
docID=2147330367)]; "
} ]
},
"hits" : {
"total" : 15
...

==
The stack trace is the following:

2013-01-08 19:11:02,089 [elasticsearch[Neptune][**
transport_client_worker][T#2]{**New I/O worker #2}] DEBUG [type] -
[Neptune] [55784508] Failed to execute fetch phase
org.elasticsearch.transport.**RemoteTransportException:
[Ringer][inet[/...:9300]][**search/phase/fetch/id]
Caused by: java.lang.**IllegalArgumentException: docID must be >= 0 and
< maxDoc=1 (got docID=2147335046)
at org.apache.lucene.index.SegmentReader.document(
SegmentReader.java:490)
at org.apache.lucene.index.DirectoryReader.document(
DirectoryReader.java:568)
at org.apache.lucene.search.IndexSearcher.doc(
IndexSearcher.java:264)
at org.elasticsearch.search.fetch.FetchPhase.loadDocument(
FetchPhase.java:288)
at org.elasticsearch.search.fetch.FetchPhase.execute(
FetchPhase.java:159)
at org.elasticsearch.search.**SearchService.*executeFetchPhase(
*SearchService.java:430)
at org.elasticsearch.search.action.
SearchServiceTransportAction$SearchFetchByIdTransportHandle
r.messageReceived(**SearchServiceTransportAction.**java:602)
at org.elasticsearch.search.action.
SearchServiceTransportAction$SearchFetchByIdTransportHandle
r.messageReceived(**SearchServiceTransportAction.**java:591)
at org.elasticsearch.transport.netty.MessageChannelHandler$
RequestHandler.run(**MessageChannelHandler.java:**268)
at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.**java:722)

=====

We've run into this problem before with another query, eventually being
unable to retrieve any documents whatsoever. Running _optimize helped, but
after a few days we get these errors again.

Some more details:
Version: 0.19.11
Replication factor: 2
Health: {"cluster_name" : "...",

"status" : "green",
"timed_out" : false,
"number_of_nodes" : 27,
"number_of_data_nodes" : 3,
"active_primary_shards" : 96,
"active_shards" : 192,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

I'll appreciate any help/pointers.

-Ignacio

--

--

Is it possible that your scorer returns NaN for some documents?

On Tuesday, January 8, 2013 7:56:27 PM UTC-5, Ignacio wrote:

No, not per query, but we do have our own scorer. We also have custom
analyzers, which we also don't specify per query, only per field.

Thanks.

On Tue, Jan 8, 2013 at 4:28 PM, Igor Motov <imo...@gmail.com <javascript:>

wrote:

Are you using custom_scorer in your queries?

On Tuesday, January 8, 2013 3:13:32 PM UTC-5, Ignacio wrote:

Hi -

When doing searches against an index, we get shard failures as shown in
the query result below.

{"took" : 118,
"timed_out" : false,
"_shards" : {
"total" : 32,
"successful" : 31,
"failed" : 1,
"failures" : [ {
"status" : 500,
"reason" : "RemoteTransportException[[**Surge][inet[/...:9300]][**search/phase/fetch/id]];
nested: IllegalArgumentException[docID must be >= 0 and < maxDoc=1 (got
docID=2147330367)]; "
} ]
},
"hits" : {
"total" : 15
...

==
The stack trace is the following:

2013-01-08 19:11:02,089 [elasticsearch[Neptune][**
transport_client_worker][T#2]{**New I/O worker #2}] DEBUG [type] -
[Neptune] [55784508] Failed to execute fetch phase
org.elasticsearch.transport.**RemoteTransportException:
[Ringer][inet[/...:9300]][**search/phase/fetch/id]
Caused by: java.lang.**IllegalArgumentException: docID must be >= 0 and
< maxDoc=1 (got docID=2147335046)
at org.apache.lucene.index.SegmentReader.document(
SegmentReader.java:490)
at org.apache.lucene.index.DirectoryReader.document(
DirectoryReader.java:568)
at org.apache.lucene.search.IndexSearcher.doc(
IndexSearcher.java:264)
at org.elasticsearch.search.fetch.FetchPhase.loadDocument(
FetchPhase.java:288)
at org.elasticsearch.search.fetch.FetchPhase.execute(
FetchPhase.java:159)
at org.elasticsearch.search.**SearchService.**executeFetchPhase(
**SearchService.java:430)
at org.elasticsearch.search.action.
SearchServiceTransportAction$SearchFetchByIdTransportHandle
r.messageReceived(**SearchServiceTransportAction.**java:602)
at org.elasticsearch.search.action.
SearchServiceTransportAction$SearchFetchByIdTransportHandle
r.messageReceived(**SearchServiceTransportAction.**java:591)
at org.elasticsearch.transport.netty.MessageChannelHandler$
RequestHandler.run(**MessageChannelHandler.java:**268)
at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.**java:722)

=====

We've run into this problem before with another query, eventually being
unable to retrieve any documents whatsoever. Running _optimize helped, but
after a few days we get these errors again.

Some more details:
Version: 0.19.11
Replication factor: 2
Health: {"cluster_name" : "...",

"status" : "green",
"timed_out" : false,
"number_of_nodes" : 27,
"number_of_data_nodes" : 3,
"active_primary_shards" : 96,
"active_shards" : 192,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

I'll appreciate any help/pointers.

-Ignacio

--

--

We'll look into the scoring, Igor, but it seems to not be the problem given
that I've finished running _optimize on the affected index and that indeed
fixed the problem. However, I'm afraid it may pop up again.

Some more context:
We have thousands of writes/hour (we have 3 indices, the one usu. affected
has very large documents--in the order of KBs*). Lots of segments have
deleted docs. When I run _optimize with max_num_segments=1, it never seems
to complete, but the number of deleted documents gets reduced.

  • We've recently optimized the size of these documents, and set compression
    at 5KB, but compression may not be a problem because we had these problems
    prior to turning on compression.

Thanks!

On Tue, Jan 8, 2013 at 4:56 PM, Ignacio Zendejas <
ignacio.zendejas.cs@gmail.com> wrote:

No, not per query, but we do have our own scorer. We also have custom
analyzers, which we also don't specify per query, only per field.

Thanks.

On Tue, Jan 8, 2013 at 4:28 PM, Igor Motov imotov@gmail.com wrote:

Are you using custom_scorer in your queries?

On Tuesday, January 8, 2013 3:13:32 PM UTC-5, Ignacio wrote:

Hi -

When doing searches against an index, we get shard failures as shown in
the query result below.

{"took" : 118,
"timed_out" : false,
"_shards" : {
"total" : 32,
"successful" : 31,
"failed" : 1,
"failures" : [ {
"status" : 500,
"reason" : "RemoteTransportException[[**Surge][inet[/...:9300]][**search/phase/fetch/id]];
nested: IllegalArgumentException[docID must be >= 0 and < maxDoc=1 (got
docID=2147330367)]; "
} ]
},
"hits" : {
"total" : 15
...

==
The stack trace is the following:

2013-01-08 19:11:02,089 [elasticsearch[Neptune][**
transport_client_worker][T#2]{**New I/O worker #2}] DEBUG [type] -
[Neptune] [55784508] Failed to execute fetch phase
org.elasticsearch.transport.**RemoteTransportException:
[Ringer][inet[/...:9300]][**search/phase/fetch/id]
Caused by: java.lang.**IllegalArgumentException: docID must be >= 0 and
< maxDoc=1 (got docID=2147335046)
at org.apache.lucene.index.SegmentReader.document(
SegmentReader.java:490)
at org.apache.lucene.index.DirectoryReader.document(
DirectoryReader.java:568)
at org.apache.lucene.search.IndexSearcher.doc(
IndexSearcher.java:264)
at org.elasticsearch.search.fetch.FetchPhase.loadDocument(
FetchPhase.java:288)
at org.elasticsearch.search.fetch.FetchPhase.execute(
FetchPhase.java:159)
at org.elasticsearch.search.**SearchService.**executeFetchPhase(
**SearchService.java:430)
at org.elasticsearch.search.action.
SearchServiceTransportAction$SearchFetchByIdTransportHandle
r.messageReceived(**SearchServiceTransportAction.**java:602)
at org.elasticsearch.search.action.
SearchServiceTransportAction$SearchFetchByIdTransportHandle
r.messageReceived(**SearchServiceTransportAction.**java:591)
at org.elasticsearch.transport.netty.MessageChannelHandler$
RequestHandler.run(**MessageChannelHandler.java:**268)
at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.**java:722)

=====

We've run into this problem before with another query, eventually being
unable to retrieve any documents whatsoever. Running _optimize helped, but
after a few days we get these errors again.

Some more details:
Version: 0.19.11
Replication factor: 2
Health: {"cluster_name" : "...",

"status" : "green",
"timed_out" : false,
"number_of_nodes" : 27,
"number_of_data_nodes" : 3,
"active_primary_shards" : 96,
"active_shards" : 192,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

I'll appreciate any help/pointers.

-Ignacio

--

--

I am just trying to find the simplest possible explanation. I know for sure
that this script nan_score.sh · GitHub is causing exactly the
same error, you are using your own scorer, so to me it sounds like
something that should be checked first.

When you run _optimize, do you specify only_expunge_deletes=true? If not,
try running it with this flag, to see if it will change the situation.

Do you see any warnings in the log file saying "failed to merge"?

On Tuesday, January 8, 2013 8:13:53 PM UTC-5, Ignacio wrote:

We'll look into the scoring, Igor, but it seems to not be the problem
given that I've finished running _optimize on the affected index and that
indeed fixed the problem. However, I'm afraid it may pop up again.

Some more context:
We have thousands of writes/hour (we have 3 indices, the one usu. affected
has very large documents--in the order of KBs*). Lots of segments have
deleted docs. When I run _optimize with max_num_segments=1, it never seems
to complete, but the number of deleted documents gets reduced.

  • We've recently optimized the size of these documents, and set
    compression at 5KB, but compression may not be a problem because we had
    these problems prior to turning on compression.

Thanks!

On Tue, Jan 8, 2013 at 4:56 PM, Ignacio Zendejas <ignacio.z...@gmail.com<javascript:>

wrote:

No, not per query, but we do have our own scorer. We also have custom
analyzers, which we also don't specify per query, only per field.

Thanks.

On Tue, Jan 8, 2013 at 4:28 PM, Igor Motov <imo...@gmail.com<javascript:>

wrote:

Are you using custom_scorer in your queries?

On Tuesday, January 8, 2013 3:13:32 PM UTC-5, Ignacio wrote:

Hi -

When doing searches against an index, we get shard failures as shown in
the query result below.

{"took" : 118,
"timed_out" : false,
"_shards" : {
"total" : 32,
"successful" : 31,
"failed" : 1,
"failures" : [ {
"status" : 500,
"reason" : "RemoteTransportException[[*Surge][inet[/...:9300]][
*search/phase/fetch/id]]; nested: IllegalArgumentException[docID must
be >= 0 and < maxDoc=1 (got docID=2147330367)]; "
} ]
},
"hits" : {
"total" : 15
...

==
The stack trace is the following:

2013-01-08 19:11:02,089 [elasticsearch[Neptune][**
transport_client_worker][T#2]{**New I/O worker #2}] DEBUG [type] -
[Neptune] [55784508] Failed to execute fetch phase
org.elasticsearch.transport.**RemoteTransportException:
[Ringer][inet[/...:9300]][**search/phase/fetch/id]
Caused by: java.lang.**IllegalArgumentException: docID must be >= 0
and < maxDoc=1 (got docID=2147335046)
at org.apache.lucene.index.SegmentReader.document(
SegmentReader.java:490)
at org.apache.lucene.index.DirectoryReader.document(
DirectoryReader.java:568)
at org.apache.lucene.search.IndexSearcher.doc(
IndexSearcher.java:264)
at org.elasticsearch.search.fetch.FetchPhase.loadDocument(
FetchPhase.java:288)
at org.elasticsearch.search.fetch.FetchPhase.execute(
FetchPhase.java:159)
at org.elasticsearch.search.SearchService.
executeFetchPhase(**SearchService.java:430)
at org.elasticsearch.search.action.
SearchServiceTransportAction$SearchFetchByIdTransportHandle
r.messageReceived(**SearchServiceTransportAction.**java:602)
at org.elasticsearch.search.action.
SearchServiceTransportAction$SearchFetchByIdTransportHandle
r.messageReceived(**SearchServiceTransportAction.**java:591)
at org.elasticsearch.transport.netty.MessageChannelHandler$
RequestHandler.run(**MessageChannelHandler.java:**268)
at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.**java:722)

=====

We've run into this problem before with another query, eventually being
unable to retrieve any documents whatsoever. Running _optimize helped, but
after a few days we get these errors again.

Some more details:
Version: 0.19.11
Replication factor: 2
Health: {"cluster_name" : "...",

"status" : "green",
"timed_out" : false,
"number_of_nodes" : 27,
"number_of_data_nodes" : 3,
"active_primary_shards" : 96,
"active_shards" : 192,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

I'll appreciate any help/pointers.

-Ignacio

--

--

I ran with max_num_segments flag today, can't say if
only_expunge_deletes=true would do the trick. I did not find a "failed to
merge" message in the logs over the past 4 days.

I was definitely able to reproduce the same exact stack trace, minus a
different docId with the gist you provided. Since this is a
production-based index, I'll have to keep an eye on it to see if the
Exception is thrown again -- but we're trying to see if we can reproduce it
on our staging environment (where we haven't run _optimize recently).

All along, my suspicion has been corrupted segments, but I'll also
investigate the possibility of deleted documents screwing up the scores.

On Tue, Jan 8, 2013 at 5:37 PM, Igor Motov imotov@gmail.com wrote:

I am just trying to find the simplest possible explanation. I know for
sure that this script nan_score.sh · GitHub is
causing exactly the same error, you are using your own scorer, so to me it
sounds like something that should be checked first.

When you run _optimize, do you specify only_expunge_deletes=true? If not,
try running it with this flag, to see if it will change the situation.

Do you see any warnings in the log file saying "failed to merge"?

On Tuesday, January 8, 2013 8:13:53 PM UTC-5, Ignacio wrote:

We'll look into the scoring, Igor, but it seems to not be the problem
given that I've finished running _optimize on the affected index and that
indeed fixed the problem. However, I'm afraid it may pop up again.

Some more context:
We have thousands of writes/hour (we have 3 indices, the one usu.
affected has very large documents--in the order of KBs*). Lots of segments
have deleted docs. When I run _optimize with max_num_segments=1, it never
seems to complete, but the number of deleted documents gets reduced.

  • We've recently optimized the size of these documents, and set
    compression at 5KB, but compression may not be a problem because we had
    these problems prior to turning on compression.

Thanks!

On Tue, Jan 8, 2013 at 4:56 PM, Ignacio Zendejas <ignacio.z...@gmail.com*
*> wrote:

No, not per query, but we do have our own scorer. We also have custom
analyzers, which we also don't specify per query, only per field.

Thanks.

On Tue, Jan 8, 2013 at 4:28 PM, Igor Motov imo...@gmail.com wrote:

Are you using custom_scorer in your queries?

On Tuesday, January 8, 2013 3:13:32 PM UTC-5, Ignacio wrote:

Hi -

When doing searches against an index, we get shard failures as shown
in the query result below.

{"took" : 118,
"timed_out" : false,
"_shards" : {
"total" : 32,
"successful" : 31,
"failed" : 1,
"failures" : [ {
"status" : 500,
"reason" : "RemoteTransportException[[Sur
ge][inet[/...:9300]][**search/**phase/fetch/id]]; nested:
IllegalArgumentException[docID must be >= 0 and < maxDoc=1 (got docID=
2147330367)]; "
} ]
},
"hits" : {
"total" : 15
...

==
The stack trace is the following:

2013-01-08 19:11:02,089 [elasticsearch[Neptune][transp
ort_client_worker][T#2]{New I/O worker #2}] DEBUG [type] -
[Neptune] [55784508] Failed to execute fetch phase
org.elasticsearch.transport.RemoteTransportException:
[Ringer][inet[/...:9300]][search/phase/fetch/id]
Caused by: java.lang.IllegalArgumentException: docID must be >= 0
and < maxDoc=1 (got docID=2147335046)
at org.apache.lucene.index.SegmentReader.document(

SegmentReader**.java:490)
at org.apache.lucene.index.DirectoryReader.document(**
DirectoryReader.java:568)
at org.apache.lucene.search.IndexSearcher.doc(

IndexSearcher.java:264)
at org.elasticsearch.search.fetch.FetchPhase.loadDocument(
FetchPhase.java:288)
at org.elasticsearch.search.fetch.FetchPhase.execute(

FetchPhase**.java:159)
at org.elasticsearch.search.SearchService.**
executeFetchPhase(SearchService.java:430)
at org.elasticsearch.search.action.**
SearchServiceTransportAction**$SearchFetchByIdTransportHandle**
r.messageReceived(SearchServiceTransportAction.java:602)
at org.elasticsearch.search.action.

SearchServiceTransportAction**$SearchFetchByIdTransportHandle**
r.messageReceived(SearchServiceTransportAction.java:591)
at org.elasticsearch.transport.ne
tty.MessageChannelHandler$RequestHandler.run(MessageChannelH
andler.java:268)
at java.util.concurrent.ThreadPoolExecutor.runWorker(

ThreadPool
Executor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(**
ThreadPoo**lExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)

=====

We've run into this problem before with another query, eventually
being unable to retrieve any documents whatsoever. Running _optimize
helped, but after a few days we get these errors again.

Some more details:
Version: 0.19.11
Replication factor: 2
Health: {"cluster_name" : "...",

"status" : "green",
"timed_out" : false,
"number_of_nodes" : 27,
"number_of_data_nodes" : 3,
"active_primary_shards" : 96,
"active_shards" : 192,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

I'll appreciate any help/pointers.

-Ignacio

--

--

--

Igor, thanks for the pointer. It definitely turned out to be a problem with
our scorer and we do, in effect, get an NaN. The problem is that our
scorer will not work properly with the existence of deleted docs,
therefore, running optimize will at least temporarily fix this because it
removes them.

For the interested:

  • I was able to find a query that broke on our staging environment, so I
    downloaded the index data and ran that same query locally.
  • only_expunge_deletes=true did not remove all deleted documents (and thus
    did not fix our problem), so I ran _optimize with max_num_segments=1 (which
    removes all deleted docs) and compared the list of retrieved documents
  • After identifying a document that was not retrieved prior to optimizing,
    I ran an explain* with the same query on that document and found that it
    had a field which returned an NaN (log(0) * 0) leading to an overall score
    of NaN.

*Elasticsearch Platform — Find real-time answers at scale | Elastic

By the way, scoring documents with NaN will also lead to fetching duplicate
documents when paging through results, since the reported number of hits
will be greater that the retrievable documents.

Thanks again!

On Tue, Jan 8, 2013 at 6:33 PM, Ignacio Zendejas <
ignacio.zendejas.cs@gmail.com> wrote:

I ran with max_num_segments flag today, can't say if
only_expunge_deletes=true would do the trick. I did not find a "failed to
merge" message in the logs over the past 4 days.

I was definitely able to reproduce the same exact stack trace, minus a
different docId with the gist you provided. Since this is a
production-based index, I'll have to keep an eye on it to see if the
Exception is thrown again -- but we're trying to see if we can reproduce it
on our staging environment (where we haven't run _optimize recently).

All along, my suspicion has been corrupted segments, but I'll also
investigate the possibility of deleted documents screwing up the scores.

On Tue, Jan 8, 2013 at 5:37 PM, Igor Motov imotov@gmail.com wrote:

I am just trying to find the simplest possible explanation. I know for
sure that this script nan_score.sh · GitHub is
causing exactly the same error, you are using your own scorer, so to me it
sounds like something that should be checked first.

When you run _optimize, do you specify only_expunge_deletes=true? If not,
try running it with this flag, to see if it will change the situation.

Do you see any warnings in the log file saying "failed to merge"?

On Tuesday, January 8, 2013 8:13:53 PM UTC-5, Ignacio wrote:

We'll look into the scoring, Igor, but it seems to not be the problem
given that I've finished running _optimize on the affected index and that
indeed fixed the problem. However, I'm afraid it may pop up again.

Some more context:
We have thousands of writes/hour (we have 3 indices, the one usu.
affected has very large documents--in the order of KBs*). Lots of segments
have deleted docs. When I run _optimize with max_num_segments=1, it never
seems to complete, but the number of deleted documents gets reduced.

  • We've recently optimized the size of these documents, and set
    compression at 5KB, but compression may not be a problem because we had
    these problems prior to turning on compression.

Thanks!

On Tue, Jan 8, 2013 at 4:56 PM, Ignacio Zendejas <ignacio.z...@gmail.com
**> wrote:

No, not per query, but we do have our own scorer. We also have custom
analyzers, which we also don't specify per query, only per field.

Thanks.

On Tue, Jan 8, 2013 at 4:28 PM, Igor Motov imo...@gmail.com wrote:

Are you using custom_scorer in your queries?

On Tuesday, January 8, 2013 3:13:32 PM UTC-5, Ignacio wrote:

Hi -

When doing searches against an index, we get shard failures as shown
in the query result below.

{"took" : 118,
"timed_out" : false,
"_shards" : {
"total" : 32,
"successful" : 31,
"failed" : 1,
"failures" : [ {
"status" : 500,
"reason" : "RemoteTransportException[[Sur
ge][inet[/...:9300]][**search/**phase/fetch/id]]; nested:
IllegalArgumentException[docID must be >= 0 and < maxDoc=1 (got docID=
2147330367)]; "
} ]
},
"hits" : {
"total" : 15
...

==
The stack trace is the following:

2013-01-08 19:11:02,089 [elasticsearch[Neptune][transp
ort_client_worker][T#2]{New I/O worker #2}] DEBUG [type] -
[Neptune] [55784508] Failed to execute fetch phase
org.elasticsearch.transport.RemoteTransportException:
[Ringer][inet[/...:9300]][search/phase/fetch/id]
Caused by: java.lang.IllegalArgumentException: docID must be >=
0 and < maxDoc=1 (got docID=2147335046)
at org.apache.lucene.index.SegmentReader.document(

SegmentReader**.java:490)
at org.apache.lucene.index.DirectoryReader.document(**
DirectoryReader.java:568)
at org.apache.lucene.search.IndexSearcher.doc(

IndexSearcher.java:264)
at org.elasticsearch.search.fetch
.FetchPhase.loadDocument(FetchPhase.java:288)
at org.elasticsearch.search.fetch.FetchPhase.execute(

FetchPhase**.java:159)
at org.elasticsearch.search.SearchService.**
executeFetchPhase(SearchService.java:430)
at org.elasticsearch.search.action.**
SearchServiceTransportAction**$SearchFetchByIdTransportHandle**
r.messageReceived(SearchServiceTransportAction.java:602)
at org.elasticsearch.search.action.

SearchServiceTransportAction**$SearchFetchByIdTransportHandle**
r.messageReceived(SearchServiceTransportAction.java:591)
at org.elasticsearch.transport.ne
tty.MessageChannelHandler$RequestHandler.run(MessageChannelH
andler.java:268)
at java.util.concurrent.ThreadPoolExecutor.runWorker(

ThreadPool
Executor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(**
ThreadPoo**lExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)

=====

We've run into this problem before with another query, eventually
being unable to retrieve any documents whatsoever. Running _optimize
helped, but after a few days we get these errors again.

Some more details:
Version: 0.19.11
Replication factor: 2
Health: {"cluster_name" : "...",

"status" : "green",
"timed_out" : false,
"number_of_nodes" : 27,
"number_of_data_nodes" : 3,
"active_primary_shards" : 96,
"active_shards" : 192,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

I'll appreciate any help/pointers.

-Ignacio

--

--

--

Correction, paging through results will not lead to retrieval of duplicate
documents. You simply won't be able to retrieve those documents with NaN
scores.

On Wed, Jan 9, 2013 at 3:31 PM, Ignacio Zendejas <
ignacio.zendejas.cs@gmail.com> wrote:

Igor, thanks for the pointer. It definitely turned out to be a problem
with our scorer and we do, in effect, get an NaN. The problem is that our
scorer will not work properly with the existence of deleted docs,
therefore, running optimize will at least temporarily fix this because it
removes them.

For the interested:

  • I was able to find a query that broke on our staging environment, so I
    downloaded the index data and ran that same query locally.
  • only_expunge_deletes=true did not remove all deleted documents (and thus
    did not fix our problem), so I ran _optimize with max_num_segments=1 (which
    removes all deleted docs) and compared the list of retrieved documents
  • After identifying a document that was not retrieved prior to optimizing,
    I ran an explain* with the same query on that document and found that it
    had a field which returned an NaN (log(0) * 0) leading to an overall score
    of NaN.

*Elasticsearch Platform — Find real-time answers at scale | Elastic

By the way, scoring documents with NaN will also lead to fetching
duplicate documents when paging through results, since the reported number
of hits will be greater that the retrievable documents.

Thanks again!

On Tue, Jan 8, 2013 at 6:33 PM, Ignacio Zendejas <
ignacio.zendejas.cs@gmail.com> wrote:

I ran with max_num_segments flag today, can't say if
only_expunge_deletes=true would do the trick. I did not find a "failed to
merge" message in the logs over the past 4 days.

I was definitely able to reproduce the same exact stack trace, minus a
different docId with the gist you provided. Since this is a
production-based index, I'll have to keep an eye on it to see if the
Exception is thrown again -- but we're trying to see if we can reproduce it
on our staging environment (where we haven't run _optimize recently).

All along, my suspicion has been corrupted segments, but I'll also
investigate the possibility of deleted documents screwing up the scores.

On Tue, Jan 8, 2013 at 5:37 PM, Igor Motov imotov@gmail.com wrote:

I am just trying to find the simplest possible explanation. I know for
sure that this script nan_score.sh · GitHub is
causing exactly the same error, you are using your own scorer, so to me it
sounds like something that should be checked first.

When you run _optimize, do you specify only_expunge_deletes=true? If
not, try running it with this flag, to see if it will change the situation.

Do you see any warnings in the log file saying "failed to merge"?

On Tuesday, January 8, 2013 8:13:53 PM UTC-5, Ignacio wrote:

We'll look into the scoring, Igor, but it seems to not be the problem
given that I've finished running _optimize on the affected index and that
indeed fixed the problem. However, I'm afraid it may pop up again.

Some more context:
We have thousands of writes/hour (we have 3 indices, the one usu.
affected has very large documents--in the order of KBs*). Lots of segments
have deleted docs. When I run _optimize with max_num_segments=1, it never
seems to complete, but the number of deleted documents gets reduced.

  • We've recently optimized the size of these documents, and set
    compression at 5KB, but compression may not be a problem because we had
    these problems prior to turning on compression.

Thanks!

On Tue, Jan 8, 2013 at 4:56 PM, Ignacio Zendejas <
ignacio.z...@gmail.com**> wrote:

No, not per query, but we do have our own scorer. We also have custom
analyzers, which we also don't specify per query, only per field.

Thanks.

On Tue, Jan 8, 2013 at 4:28 PM, Igor Motov imo...@gmail.com wrote:

Are you using custom_scorer in your queries?

On Tuesday, January 8, 2013 3:13:32 PM UTC-5, Ignacio wrote:

Hi -

When doing searches against an index, we get shard failures as shown
in the query result below.

{"took" : 118,
"timed_out" : false,
"_shards" : {
"total" : 32,
"successful" : 31,
"failed" : 1,
"failures" : [ {
"status" : 500,
"reason" : "RemoteTransportException[[Sur
ge][inet[/...:9300]][**search/**phase/fetch/id]]; nested:
IllegalArgumentException[docID must be >= 0 and < maxDoc=1 (got docID=
2147330367)]; "
} ]
},
"hits" : {
"total" : 15
...

==
The stack trace is the following:

2013-01-08 19:11:02,089 [elasticsearch[Neptune][transp
ort_client_worker][T#2]{New I/O worker #2}] DEBUG [type] -
[Neptune] [55784508] Failed to execute fetch phase
org.elasticsearch.transport.RemoteTransportException:
[Ringer][inet[/...:9300]][search/phase/fetch/id]
Caused by: java.lang.IllegalArgumentException: docID must be >=
0 and < maxDoc=1 (got docID=2147335046)
at org.apache.lucene.index.SegmentReader.document(

SegmentReader**.java:490)
at org.apache.lucene.index.DirectoryReader.document(**
DirectoryReader.java:568)
at org.apache.lucene.search.IndexSearcher.doc(

IndexSearcher.java:264)
at org.elasticsearch.search.fetch
.FetchPhase.loadDocument(FetchPhase.java:288)
at org.elasticsearch.search.fetch.FetchPhase.execute(

FetchPhase**.java:159)
at org.elasticsearch.search.SearchService.**
executeFetchPhase(SearchService.java:430)
at org.elasticsearch.search.action.**
SearchServiceTransportAction**$SearchFetchByIdTransportHandle**
r.messageReceived(SearchServiceTransportAction.java:602)
at org.elasticsearch.search.action.

SearchServiceTransportAction**$SearchFetchByIdTransportHandle**
r.messageReceived(SearchServiceTransportAction.java:591)
at org.elasticsearch.transport.ne
tty.MessageChannelHandler$RequestHandler.run(MessageChannelH
andler.java:268)
at java.util.concurrent.ThreadPoolExecutor.runWorker(

ThreadPool
Executor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(**
ThreadPoo**lExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)

=====

We've run into this problem before with another query, eventually
being unable to retrieve any documents whatsoever. Running _optimize
helped, but after a few days we get these errors again.

Some more details:
Version: 0.19.11
Replication factor: 2
Health: {"cluster_name" : "...",

"status" : "green",
"timed_out" : false,
"number_of_nodes" : 27,
"number_of_data_nodes" : 3,
"active_primary_shards" : 96,
"active_shards" : 192,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

I'll appreciate any help/pointers.

-Ignacio

--

--

--

Here was the root problem -- readers beware:

We didn't know Lucene's docFreq(term) would include deleted documents in
the count! This is terrible and I hope Lucene addresses this.

If your custom scorer relies on docFreq, here's an article that explains
that along with a workaround -- which am not necessarily vouching for. We
found our own work-around (which is specific to our scorer).

And scratch the last post, having NaN's DOES lead to duplicate documents
being fetched.

On Wed, Jan 9, 2013 at 3:37 PM, Ignacio Zendejas <
ignacio.zendejas.cs@gmail.com> wrote:

Correction, paging through results will not lead to retrieval of duplicate
documents. You simply won't be able to retrieve those documents with NaN
scores.

On Wed, Jan 9, 2013 at 3:31 PM, Ignacio Zendejas <
ignacio.zendejas.cs@gmail.com> wrote:

Igor, thanks for the pointer. It definitely turned out to be a problem
with our scorer and we do, in effect, get an NaN. The problem is that our
scorer will not work properly with the existence of deleted docs,
therefore, running optimize will at least temporarily fix this because it
removes them.

For the interested:

  • I was able to find a query that broke on our staging environment, so I
    downloaded the index data and ran that same query locally.
  • only_expunge_deletes=true did not remove all deleted documents (and
    thus did not fix our problem), so I ran _optimize with max_num_segments=1
    (which removes all deleted docs) and compared the list of retrieved
    documents
  • After identifying a document that was not retrieved prior to
    optimizing, I ran an explain* with the same query on that document and
    found that it had a field which returned an NaN (log(0) * 0) leading to an
    overall score of NaN.

*Elasticsearch Platform — Find real-time answers at scale | Elastic

By the way, scoring documents with NaN will also lead to fetching
duplicate documents when paging through results, since the reported number
of hits will be greater that the retrievable documents.

Thanks again!

On Tue, Jan 8, 2013 at 6:33 PM, Ignacio Zendejas <
ignacio.zendejas.cs@gmail.com> wrote:

I ran with max_num_segments flag today, can't say if
only_expunge_deletes=true would do the trick. I did not find a "failed to
merge" message in the logs over the past 4 days.

I was definitely able to reproduce the same exact stack trace, minus a
different docId with the gist you provided. Since this is a
production-based index, I'll have to keep an eye on it to see if the
Exception is thrown again -- but we're trying to see if we can reproduce it
on our staging environment (where we haven't run _optimize recently).

All along, my suspicion has been corrupted segments, but I'll also
investigate the possibility of deleted documents screwing up the scores.

On Tue, Jan 8, 2013 at 5:37 PM, Igor Motov imotov@gmail.com wrote:

I am just trying to find the simplest possible explanation. I know for
sure that this script nan_score.sh · GitHub is
causing exactly the same error, you are using your own scorer, so to me it
sounds like something that should be checked first.

When you run _optimize, do you specify only_expunge_deletes=true? If
not, try running it with this flag, to see if it will change the situation.

Do you see any warnings in the log file saying "failed to merge"?

On Tuesday, January 8, 2013 8:13:53 PM UTC-5, Ignacio wrote:

We'll look into the scoring, Igor, but it seems to not be the problem
given that I've finished running _optimize on the affected index and that
indeed fixed the problem. However, I'm afraid it may pop up again.

Some more context:
We have thousands of writes/hour (we have 3 indices, the one usu.
affected has very large documents--in the order of KBs*). Lots of segments
have deleted docs. When I run _optimize with max_num_segments=1, it never
seems to complete, but the number of deleted documents gets reduced.

  • We've recently optimized the size of these documents, and set
    compression at 5KB, but compression may not be a problem because we had
    these problems prior to turning on compression.

Thanks!

On Tue, Jan 8, 2013 at 4:56 PM, Ignacio Zendejas <
ignacio.z...@gmail.com**> wrote:

No, not per query, but we do have our own scorer. We also have custom
analyzers, which we also don't specify per query, only per field.

Thanks.

On Tue, Jan 8, 2013 at 4:28 PM, Igor Motov imo...@gmail.com wrote:

Are you using custom_scorer in your queries?

On Tuesday, January 8, 2013 3:13:32 PM UTC-5, Ignacio wrote:

Hi -

When doing searches against an index, we get shard failures as
shown in the query result below.

{"took" : 118,
"timed_out" : false,
"_shards" : {
"total" : 32,
"successful" : 31,
"failed" : 1,
"failures" : [ {
"status" : 500,
"reason" : "RemoteTransportException[[Sur
ge][inet[/...:9300]][**search/**phase/fetch/id]]; nested:
IllegalArgumentException[docID must be >= 0 and < maxDoc=1 (got docID=
2147330367)]; "
} ]
},
"hits" : {
"total" : 15
...

==
The stack trace is the following:

2013-01-08 19:11:02,089 [elasticsearch[Neptune][transp
ort_client_worker][T#2]{**New I/O worker #2}] DEBUG [type] -
[Neptune] [55784508] Failed to execute fetch phase
org.elasticsearch.transport.RemoteTransportException:
[Ringer][inet[/...:9300]][search/phase/fetch/id]
Caused by: java.lang.IllegalArgumentException: docID must be

= 0 and < maxDoc=1 (got docID=2147335046)
at org.apache.lucene.index.SegmentReader.document(**
SegmentReader**.java:490)
at org.apache.lucene.index.DirectoryReader.document(**
DirectoryReader.java:568)
at org.apache.lucene.search.IndexSearcher.doc(

IndexSearcher.java:264)
at org.elasticsearch.search.fetch
.FetchPhase.loadDocument(FetchPhase.java:288)
at org.elasticsearch.search.fetch.FetchPhase.execute(

FetchPhase**.java:159)
at org.elasticsearch.search.SearchService.**
executeFetchPhase(SearchService.java:430)
at org.elasticsearch.search.action.**
SearchServiceTransportAction**$SearchFetchByIdTransportHandle**
r.messageReceived(SearchServiceTransportAction.java:602)
at org.elasticsearch.search.action.

SearchServiceTransportAction**$SearchFetchByIdTransportHandle**
r.messageReceived(SearchServiceTransportAction.java:591)
at org.elasticsearch.transport.ne
tty.MessageChannelHandler$RequestHandler.run(MessageChannelH
*andler.java:268)
at java.util.concurrent.ThreadPoolExecutor.runWorker(

ThreadPool
Executor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(

*ThreadPoo**lExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)

=====

We've run into this problem before with another query, eventually
being unable to retrieve any documents whatsoever. Running _optimize
helped, but after a few days we get these errors again.

Some more details:
Version: 0.19.11
Replication factor: 2
Health: {"cluster_name" : "...",

"status" : "green",
"timed_out" : false,
"number_of_nodes" : 27,
"number_of_data_nodes" : 3,
"active_primary_shards" : 96,
"active_shards" : 192,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

I'll appreciate any help/pointers.

-Ignacio

--

--

--