Shard failure, RemoteTransportException [search/phase/fetch/id]... IllegalArgumentException

Ignacio_Zendejas · January 8, 2013, 8:13pm

Hi -

When doing searches against an index, we get shard failures as shown in the
query result below.

{"took" : 118,
"timed_out" : false,
"_shards" : {
"total" : 32,
"successful" : 31,
"failed" : 1,
"failures" : [ {
"status" : 500,
"reason" :
"RemoteTransportException[[Surge][inet[/...:9300]][search/phase/fetch/id]];
nested: IllegalArgumentException[docID must be >= 0 and < maxDoc=1 (got
docID=2147330367)]; "
} ]
},
"hits" : {
"total" : 15
...

==
The stack trace is the following:

2013-01-08 19:11:02,089
[elasticsearch[Neptune][transport_client_worker][T#2]{New I/O worker #2}]
DEBUG [type] - [Neptune] [55784508] Failed to execute fetch phase
org.elasticsearch.transport.RemoteTransportException:
[Ringer][inet[/...:9300]][search/phase/fetch/id]
Caused by: java.lang.IllegalArgumentException: docID must be >= 0 and <
maxDoc=1 (got docID=2147335046)
at
org.apache.lucene.index.SegmentReader.document(SegmentReader.java:490)
at
org.apache.lucene.index.DirectoryReader.document(DirectoryReader.java:568)
at
org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:264)
at
org.elasticsearch.search.fetch.FetchPhase.loadDocument(FetchPhase.java:288)
at
org.elasticsearch.search.fetch.FetchPhase.execute(FetchPhase.java:159)
at
org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:430)
at
org.elasticsearch.search.action.SearchServiceTransportAction$SearchFetchByIdTransportHandler.messageReceived(SearchServiceTransportAction.java:602)
at
org.elasticsearch.search.action.SearchServiceTransportAction$SearchFetchByIdTransportHandler.messageReceived(SearchServiceTransportAction.java:591)
at
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:268)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)

=====

We've run into this problem before with another query, eventually being
unable to retrieve any documents whatsoever. Running _optimize helped, but
after a few days we get these errors again.

Some more details:
Version: 0.19.11
Replication factor: 2
Health: {"cluster_name" : "...",

"status" : "green",
"timed_out" : false,
"number_of_nodes" : 27,
"number_of_data_nodes" : 3,
"active_primary_shards" : 96,
"active_shards" : 192,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

I'll appreciate any help/pointers.

-Ignacio

--

Igor_Motov · January 9, 2013, 12:28am

Are you using custom_scorer in your queries?

On Tuesday, January 8, 2013 3:13:32 PM UTC-5, Ignacio wrote:

Hi -

When doing searches against an index, we get shard failures as shown in
the query result below.

{"took" : 118,
"timed_out" : false,
"_shards" : {
"total" : 32,
"successful" : 31,
"failed" : 1,
"failures" : [ {
"status" : 500,
"reason" :
"RemoteTransportException[[Surge][inet[/...:9300]][search/phase/fetch/id]];
nested: IllegalArgumentException[docID must be >= 0 and < maxDoc=1 (got
docID=2147330367)]; "
} ]
},
"hits" : {
"total" : 15
...

==
The stack trace is the following:

2013-01-08 19:11:02,089
[elasticsearch[Neptune][transport_client_worker][T#2]{New I/O worker #2}]
DEBUG [type] - [Neptune] [55784508] Failed to execute fetch phase
org.elasticsearch.transport.RemoteTransportException:
[Ringer][inet[/...:9300]][search/phase/fetch/id]
Caused by: java.lang.IllegalArgumentException: docID must be >= 0 and <
maxDoc=1 (got docID=2147335046)
at
org.apache.lucene.index.SegmentReader.document(SegmentReader.java:490)
at
org.apache.lucene.index.DirectoryReader.document(DirectoryReader.java:568)
at
org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:264)
at
org.elasticsearch.search.fetch.FetchPhase.loadDocument(FetchPhase.java:288)
at
org.elasticsearch.search.fetch.FetchPhase.execute(FetchPhase.java:159)
at
org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:430)
at
org.elasticsearch.search.action.SearchServiceTransportAction$SearchFetchByIdTransportHandler.messageReceived(SearchServiceTransportAction.java:602)
at
org.elasticsearch.search.action.SearchServiceTransportAction$SearchFetchByIdTransportHandler.messageReceived(SearchServiceTransportAction.java:591)
at
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:268)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)

=====

We've run into this problem before with another query, eventually being
unable to retrieve any documents whatsoever. Running _optimize helped, but
after a few days we get these errors again.

Some more details:
Version: 0.19.11
Replication factor: 2
Health: {"cluster_name" : "...",

"status" : "green",
"timed_out" : false,
"number_of_nodes" : 27,
"number_of_data_nodes" : 3,
"active_primary_shards" : 96,
"active_shards" : 192,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

I'll appreciate any help/pointers.

-Ignacio

--

Ignacio_Zendejas · January 9, 2013, 12:56am

No, not per query, but we do have our own scorer. We also have custom
analyzers, which we also don't specify per query, only per field.

Thanks.

On Tue, Jan 8, 2013 at 4:28 PM, Igor Motov imotov@gmail.com wrote:

Are you using custom_scorer in your queries?

On Tuesday, January 8, 2013 3:13:32 PM UTC-5, Ignacio wrote:

Hi -

When doing searches against an index, we get shard failures as shown in
the query result below.

{"took" : 118,
"timed_out" : false,
"_shards" : {
"total" : 32,
"successful" : 31,
"failed" : 1,
"failures" : [ {
"status" : 500,
"reason" : "RemoteTransportException[[**Surge][inet[/...:9300]][**search/phase/fetch/id]];
nested: IllegalArgumentException[docID must be >= 0 and < maxDoc=1 (got
docID=2147330367)]; "
} ]
},
"hits" : {
"total" : 15
...

==
The stack trace is the following:

2013-01-08 19:11:02,089 [elasticsearch[Neptune][**
transport_client_worker][T#2]{**New I/O worker #2}] DEBUG [type] -
[Neptune] [55784508] Failed to execute fetch phase
org.elasticsearch.transport.**RemoteTransportException:
[Ringer][inet[/...:9300]][**search/phase/fetch/id]
Caused by: java.lang.**IllegalArgumentException: docID must be >= 0 and
< maxDoc=1 (got docID=2147335046)
at org.apache.lucene.index.SegmentReader.document(
SegmentReader.java:490)
at org.apache.lucene.index.DirectoryReader.document(
DirectoryReader.java:568)
at org.apache.lucene.search.IndexSearcher.doc(
IndexSearcher.java:264)
at org.elasticsearch.search.fetch.FetchPhase.loadDocument(
FetchPhase.java:288)
at org.elasticsearch.search.fetch.FetchPhase.execute(
FetchPhase.java:159)
at org.elasticsearch.search.**SearchService.*executeFetchPhase(
*SearchService.java:430)
at org.elasticsearch.search.action.
SearchServiceTransportAction$SearchFetchByIdTransportHandle
r.messageReceived(**SearchServiceTransportAction.**java:602)
at org.elasticsearch.search.action.
SearchServiceTransportAction$SearchFetchByIdTransportHandle
r.messageReceived(**SearchServiceTransportAction.**java:591)
at org.elasticsearch.transport.netty.MessageChannelHandler$
RequestHandler.run(**MessageChannelHandler.java:**268)
at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.**java:722)

=====

We've run into this problem before with another query, eventually being
unable to retrieve any documents whatsoever. Running _optimize helped, but
after a few days we get these errors again.

Some more details:
Version: 0.19.11
Replication factor: 2
Health: {"cluster_name" : "...",

"status" : "green",
"timed_out" : false,
"number_of_nodes" : 27,
"number_of_data_nodes" : 3,
"active_primary_shards" : 96,
"active_shards" : 192,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

I'll appreciate any help/pointers.

-Ignacio

--

--

Igor_Motov · January 9, 2013, 1:05am

Is it possible that your scorer returns NaN for some documents?

On Tuesday, January 8, 2013 7:56:27 PM UTC-5, Ignacio wrote:

No, not per query, but we do have our own scorer. We also have custom
analyzers, which we also don't specify per query, only per field.

Thanks.

On Tue, Jan 8, 2013 at 4:28 PM, Igor Motov <imo...@gmail.com <javascript:>

wrote:

Are you using custom_scorer in your queries?

On Tuesday, January 8, 2013 3:13:32 PM UTC-5, Ignacio wrote:

Hi -

When doing searches against an index, we get shard failures as shown in
the query result below.

{"took" : 118,
"timed_out" : false,
"_shards" : {
"total" : 32,
"successful" : 31,
"failed" : 1,
"failures" : [ {
"status" : 500,
"reason" : "RemoteTransportException[[**Surge][inet[/...:9300]][**search/phase/fetch/id]];
nested: IllegalArgumentException[docID must be >= 0 and < maxDoc=1 (got
docID=2147330367)]; "
} ]
},
"hits" : {
"total" : 15
...

==
The stack trace is the following:

2013-01-08 19:11:02,089 [elasticsearch[Neptune][**
transport_client_worker][T#2]{**New I/O worker #2}] DEBUG [type] -
[Neptune] [55784508] Failed to execute fetch phase
org.elasticsearch.transport.**RemoteTransportException:
[Ringer][inet[/...:9300]][**search/phase/fetch/id]
Caused by: java.lang.**IllegalArgumentException: docID must be >= 0 and
< maxDoc=1 (got docID=2147335046)
at org.apache.lucene.index.SegmentReader.document(
SegmentReader.java:490)
at org.apache.lucene.index.DirectoryReader.document(
DirectoryReader.java:568)
at org.apache.lucene.search.IndexSearcher.doc(
IndexSearcher.java:264)
at org.elasticsearch.search.fetch.FetchPhase.loadDocument(
FetchPhase.java:288)
at org.elasticsearch.search.fetch.FetchPhase.execute(
FetchPhase.java:159)
at org.elasticsearch.search.**SearchService.**executeFetchPhase(
**SearchService.java:430)
at org.elasticsearch.search.action.
SearchServiceTransportAction$SearchFetchByIdTransportHandle
r.messageReceived(**SearchServiceTransportAction.**java:602)
at org.elasticsearch.search.action.
SearchServiceTransportAction$SearchFetchByIdTransportHandle
r.messageReceived(**SearchServiceTransportAction.**java:591)
at org.elasticsearch.transport.netty.MessageChannelHandler$
RequestHandler.run(**MessageChannelHandler.java:**268)
at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.**java:722)

=====

We've run into this problem before with another query, eventually being
unable to retrieve any documents whatsoever. Running _optimize helped, but
after a few days we get these errors again.

Some more details:
Version: 0.19.11
Replication factor: 2
Health: {"cluster_name" : "...",

"status" : "green",
"timed_out" : false,
"number_of_nodes" : 27,
"number_of_data_nodes" : 3,
"active_primary_shards" : 96,
"active_shards" : 192,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

I'll appreciate any help/pointers.

-Ignacio

--

--

Ignacio_Zendejas · January 9, 2013, 1:13am

We'll look into the scoring, Igor, but it seems to not be the problem given
that I've finished running _optimize on the affected index and that indeed
fixed the problem. However, I'm afraid it may pop up again.

Some more context:
We have thousands of writes/hour (we have 3 indices, the one usu. affected
has very large documents--in the order of KBs*). Lots of segments have
deleted docs. When I run _optimize with max_num_segments=1, it never seems
to complete, but the number of deleted documents gets reduced.

We've recently optimized the size of these documents, and set compression
at 5KB, but compression may not be a problem because we had these problems
prior to turning on compression.

Thanks!

On Tue, Jan 8, 2013 at 4:56 PM, Ignacio Zendejas <
ignacio.zendejas.cs@gmail.com> wrote:

No, not per query, but we do have our own scorer. We also have custom
analyzers, which we also don't specify per query, only per field.

Thanks.

On Tue, Jan 8, 2013 at 4:28 PM, Igor Motov imotov@gmail.com wrote:

Are you using custom_scorer in your queries?

On Tuesday, January 8, 2013 3:13:32 PM UTC-5, Ignacio wrote:

Hi -

When doing searches against an index, we get shard failures as shown in
the query result below.

{"took" : 118,
"timed_out" : false,
"_shards" : {
"total" : 32,
"successful" : 31,
"failed" : 1,
"failures" : [ {
"status" : 500,
"reason" : "RemoteTransportException[[**Surge][inet[/...:9300]][**search/phase/fetch/id]];
nested: IllegalArgumentException[docID must be >= 0 and < maxDoc=1 (got
docID=2147330367)]; "
} ]
},
"hits" : {
"total" : 15
...

==
The stack trace is the following:

2013-01-08 19:11:02,089 [elasticsearch[Neptune][**
transport_client_worker][T#2]{**New I/O worker #2}] DEBUG [type] -
[Neptune] [55784508] Failed to execute fetch phase
org.elasticsearch.transport.**RemoteTransportException:
[Ringer][inet[/...:9300]][**search/phase/fetch/id]
Caused by: java.lang.**IllegalArgumentException: docID must be >= 0 and
< maxDoc=1 (got docID=2147335046)
at org.apache.lucene.index.SegmentReader.document(
SegmentReader.java:490)
at org.apache.lucene.index.DirectoryReader.document(
DirectoryReader.java:568)
at org.apache.lucene.search.IndexSearcher.doc(
IndexSearcher.java:264)
at org.elasticsearch.search.fetch.FetchPhase.loadDocument(
FetchPhase.java:288)
at org.elasticsearch.search.fetch.FetchPhase.execute(
FetchPhase.java:159)
at org.elasticsearch.search.**SearchService.**executeFetchPhase(
**SearchService.java:430)
at org.elasticsearch.search.action.
SearchServiceTransportAction$SearchFetchByIdTransportHandle
r.messageReceived(**SearchServiceTransportAction.**java:602)
at org.elasticsearch.search.action.
SearchServiceTransportAction$SearchFetchByIdTransportHandle
r.messageReceived(**SearchServiceTransportAction.**java:591)
at org.elasticsearch.transport.netty.MessageChannelHandler$
RequestHandler.run(**MessageChannelHandler.java:**268)
at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.**java:722)

=====

We've run into this problem before with another query, eventually being
unable to retrieve any documents whatsoever. Running _optimize helped, but
after a few days we get these errors again.

Some more details:
Version: 0.19.11
Replication factor: 2
Health: {"cluster_name" : "...",

"status" : "green",
"timed_out" : false,
"number_of_nodes" : 27,
"number_of_data_nodes" : 3,
"active_primary_shards" : 96,
"active_shards" : 192,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

I'll appreciate any help/pointers.

-Ignacio

--

--

Igor_Motov · January 9, 2013, 1:37am

I am just trying to find the simplest possible explanation. I know for sure
that this script nan_score.sh · GitHub is causing exactly the
same error, you are using your own scorer, so to me it sounds like
something that should be checked first.

When you run _optimize, do you specify only_expunge_deletes=true? If not,
try running it with this flag, to see if it will change the situation.

Do you see any warnings in the log file saying "failed to merge"?

On Tuesday, January 8, 2013 8:13:53 PM UTC-5, Ignacio wrote:

We'll look into the scoring, Igor, but it seems to not be the problem
given that I've finished running _optimize on the affected index and that
indeed fixed the problem. However, I'm afraid it may pop up again.

Some more context:
We have thousands of writes/hour (we have 3 indices, the one usu. affected
has very large documents--in the order of KBs*). Lots of segments have
deleted docs. When I run _optimize with max_num_segments=1, it never seems
to complete, but the number of deleted documents gets reduced.

We've recently optimized the size of these documents, and set
compression at 5KB, but compression may not be a problem because we had
these problems prior to turning on compression.

Thanks!

On Tue, Jan 8, 2013 at 4:56 PM, Ignacio Zendejas <ignacio.z...@gmail.com<javascript:>

wrote:

No, not per query, but we do have our own scorer. We also have custom
analyzers, which we also don't specify per query, only per field.

Thanks.

On Tue, Jan 8, 2013 at 4:28 PM, Igor Motov <imo...@gmail.com<javascript:>

wrote:

Are you using custom_scorer in your queries?

On Tuesday, January 8, 2013 3:13:32 PM UTC-5, Ignacio wrote:

Hi -

When doing searches against an index, we get shard failures as shown in
the query result below.

{"took" : 118,
"timed_out" : false,
"_shards" : {
"total" : 32,
"successful" : 31,
"failed" : 1,
"failures" : [ {
"status" : 500,
"reason" : "RemoteTransportException[[*Surge][inet[/...:9300]][
*search/phase/fetch/id]]; nested: IllegalArgumentException[docID must
be >= 0 and < maxDoc=1 (got docID=2147330367)]; "
} ]
},
"hits" : {
"total" : 15
...

==
The stack trace is the following:

2013-01-08 19:11:02,089 [elasticsearch[Neptune][**
transport_client_worker][T#2]{**New I/O worker #2}] DEBUG [type] -
[Neptune] [55784508] Failed to execute fetch phase
org.elasticsearch.transport.**RemoteTransportException:
[Ringer][inet[/...:9300]][**search/phase/fetch/id]
Caused by: java.lang.**IllegalArgumentException: docID must be >= 0
and < maxDoc=1 (got docID=2147335046)
at org.apache.lucene.index.SegmentReader.document(
SegmentReader.java:490)
at org.apache.lucene.index.DirectoryReader.document(
DirectoryReader.java:568)
at org.apache.lucene.search.IndexSearcher.doc(
IndexSearcher.java:264)
at org.elasticsearch.search.fetch.FetchPhase.loadDocument(
FetchPhase.java:288)
at org.elasticsearch.search.fetch.FetchPhase.execute(
FetchPhase.java:159)
at org.elasticsearch.search.SearchService.
executeFetchPhase(**SearchService.java:430)
at org.elasticsearch.search.action.
SearchServiceTransportAction$SearchFetchByIdTransportHandle
r.messageReceived(**SearchServiceTransportAction.**java:602)
at org.elasticsearch.search.action.
SearchServiceTransportAction$SearchFetchByIdTransportHandle
r.messageReceived(**SearchServiceTransportAction.**java:591)
at org.elasticsearch.transport.netty.MessageChannelHandler$
RequestHandler.run(**MessageChannelHandler.java:**268)
at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.**java:722)

=====

We've run into this problem before with another query, eventually being
unable to retrieve any documents whatsoever. Running _optimize helped, but
after a few days we get these errors again.

Some more details:
Version: 0.19.11
Replication factor: 2
Health: {"cluster_name" : "...",

"status" : "green",
"timed_out" : false,
"number_of_nodes" : 27,
"number_of_data_nodes" : 3,
"active_primary_shards" : 96,
"active_shards" : 192,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

I'll appreciate any help/pointers.

-Ignacio

--

--

Ignacio_Zendejas · January 9, 2013, 2:33am

I ran with max_num_segments flag today, can't say if
only_expunge_deletes=true would do the trick. I did not find a "failed to
merge" message in the logs over the past 4 days.

I was definitely able to reproduce the same exact stack trace, minus a
different docId with the gist you provided. Since this is a
production-based index, I'll have to keep an eye on it to see if the
Exception is thrown again -- but we're trying to see if we can reproduce it
on our staging environment (where we haven't run _optimize recently).

All along, my suspicion has been corrupted segments, but I'll also
investigate the possibility of deleted documents screwing up the scores.

On Tue, Jan 8, 2013 at 5:37 PM, Igor Motov imotov@gmail.com wrote:

I am just trying to find the simplest possible explanation. I know for
sure that this script nan_score.sh · GitHub is
causing exactly the same error, you are using your own scorer, so to me it
sounds like something that should be checked first.

When you run _optimize, do you specify only_expunge_deletes=true? If not,
try running it with this flag, to see if it will change the situation.

Do you see any warnings in the log file saying "failed to merge"?

On Tuesday, January 8, 2013 8:13:53 PM UTC-5, Ignacio wrote:

We'll look into the scoring, Igor, but it seems to not be the problem
given that I've finished running _optimize on the affected index and that
indeed fixed the problem. However, I'm afraid it may pop up again.

Some more context:
We have thousands of writes/hour (we have 3 indices, the one usu.
affected has very large documents--in the order of KBs*). Lots of segments
have deleted docs. When I run _optimize with max_num_segments=1, it never
seems to complete, but the number of deleted documents gets reduced.

We've recently optimized the size of these documents, and set
compression at 5KB, but compression may not be a problem because we had
these problems prior to turning on compression.

Thanks!

On Tue, Jan 8, 2013 at 4:56 PM, Ignacio Zendejas <ignacio.z...@gmail.com*
*> wrote:

No, not per query, but we do have our own scorer. We also have custom
analyzers, which we also don't specify per query, only per field.

Thanks.

On Tue, Jan 8, 2013 at 4:28 PM, Igor Motov imo...@gmail.com wrote:

Are you using custom_scorer in your queries?

On Tuesday, January 8, 2013 3:13:32 PM UTC-5, Ignacio wrote:

Hi -

When doing searches against an index, we get shard failures as shown
in the query result below.

{"took" : 118,
"timed_out" : false,
"_shards" : {
"total" : 32,
"successful" : 31,
"failed" : 1,
"failures" : [ {
"status" : 500,
"reason" : "RemoteTransportException[[Sur
ge][inet[/...:9300]][**search/**phase/fetch/id]]; nested:
IllegalArgumentException[docID must be >= 0 and < maxDoc=1 (got docID=
2147330367)]; "
} ]
},
"hits" : {
"total" : 15
...

==
The stack trace is the following:

2013-01-08 19:11:02,089 [elasticsearch[Neptune][transp
ort_client_worker][T#2]{New I/O worker #2}] DEBUG [type] -
[Neptune] [55784508] Failed to execute fetch phase
org.elasticsearch.transport.RemoteTransportException:
[Ringer][inet[/...:9300]][search/phase/fetch/id]
Caused by: java.lang.IllegalArgumentException: docID must be >= 0
and < maxDoc=1 (got docID=2147335046)
at org.apache.lucene.index.SegmentReader.document(
SegmentReader**.java:490)
at org.apache.lucene.index.DirectoryReader.document(**
DirectoryReader.java:568)
at org.apache.lucene.search.IndexSearcher.doc(
IndexSearcher.java:264)
at org.elasticsearch.search.fetch.FetchPhase.loadDocument(
FetchPhase.java:288)
at org.elasticsearch.search.fetch.FetchPhase.execute(
FetchPhase**.java:159)
at org.elasticsearch.search.SearchService.**
executeFetchPhase(SearchService.java:430)
at org.elasticsearch.search.action.**
SearchServiceTransportAction**$SearchFetchByIdTransportHandle**
r.messageReceived(SearchServiceTransportAction.java:602)
at org.elasticsearch.search.action.
SearchServiceTransportAction**$SearchFetchByIdTransportHandle**
r.messageReceived(SearchServiceTransportAction.java:591)
at org.elasticsearch.transport.ne
tty.MessageChannelHandler$RequestHandler.run(MessageChannelH
andler.java:268)
at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(**
ThreadPoo**lExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)

=====

We've run into this problem before with another query, eventually
being unable to retrieve any documents whatsoever. Running _optimize
helped, but after a few days we get these errors again.

Some more details:
Version: 0.19.11
Replication factor: 2
Health: {"cluster_name" : "...",

"status" : "green",
"timed_out" : false,
"number_of_nodes" : 27,
"number_of_data_nodes" : 3,
"active_primary_shards" : 96,
"active_shards" : 192,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

I'll appreciate any help/pointers.

-Ignacio

--

--

--

Ignacio_Zendejas · January 9, 2013, 11:31pm

Igor, thanks for the pointer. It definitely turned out to be a problem with
our scorer and we do, in effect, get an NaN. The problem is that our
scorer will not work properly with the existence of deleted docs,
therefore, running optimize will at least temporarily fix this because it
removes them.

For the interested:

I was able to find a query that broke on our staging environment, so I
downloaded the index data and ran that same query locally.
only_expunge_deletes=true did not remove all deleted documents (and thus
did not fix our problem), so I ran _optimize with max_num_segments=1 (which
removes all deleted docs) and compared the list of retrieved documents
After identifying a document that was not retrieved prior to optimizing,
I ran an explain* with the same query on that document and found that it
had a field which returned an NaN (log(0) * 0) leading to an overall score
of NaN.

*Elasticsearch Platform — Find real-time answers at scale | Elastic

By the way, scoring documents with NaN will also lead to fetching duplicate
documents when paging through results, since the reported number of hits
will be greater that the retrievable documents.

Thanks again!

On Tue, Jan 8, 2013 at 6:33 PM, Ignacio Zendejas <
ignacio.zendejas.cs@gmail.com> wrote:

I ran with max_num_segments flag today, can't say if
only_expunge_deletes=true would do the trick. I did not find a "failed to
merge" message in the logs over the past 4 days.

I was definitely able to reproduce the same exact stack trace, minus a
different docId with the gist you provided. Since this is a
production-based index, I'll have to keep an eye on it to see if the
Exception is thrown again -- but we're trying to see if we can reproduce it
on our staging environment (where we haven't run _optimize recently).

All along, my suspicion has been corrupted segments, but I'll also
investigate the possibility of deleted documents screwing up the scores.

On Tue, Jan 8, 2013 at 5:37 PM, Igor Motov imotov@gmail.com wrote:

I am just trying to find the simplest possible explanation. I know for
sure that this script nan_score.sh · GitHub is
causing exactly the same error, you are using your own scorer, so to me it
sounds like something that should be checked first.

When you run _optimize, do you specify only_expunge_deletes=true? If not,
try running it with this flag, to see if it will change the situation.

Do you see any warnings in the log file saying "failed to merge"?

On Tuesday, January 8, 2013 8:13:53 PM UTC-5, Ignacio wrote:

We'll look into the scoring, Igor, but it seems to not be the problem
given that I've finished running _optimize on the affected index and that
indeed fixed the problem. However, I'm afraid it may pop up again.

Some more context:
We have thousands of writes/hour (we have 3 indices, the one usu.
affected has very large documents--in the order of KBs*). Lots of segments
have deleted docs. When I run _optimize with max_num_segments=1, it never
seems to complete, but the number of deleted documents gets reduced.

We've recently optimized the size of these documents, and set
compression at 5KB, but compression may not be a problem because we had
these problems prior to turning on compression.

Thanks!

On Tue, Jan 8, 2013 at 4:56 PM, Ignacio Zendejas <ignacio.z...@gmail.com
**> wrote:

No, not per query, but we do have our own scorer. We also have custom
analyzers, which we also don't specify per query, only per field.

Thanks.

On Tue, Jan 8, 2013 at 4:28 PM, Igor Motov imo...@gmail.com wrote:

Are you using custom_scorer in your queries?

On Tuesday, January 8, 2013 3:13:32 PM UTC-5, Ignacio wrote:

Hi -

When doing searches against an index, we get shard failures as shown
in the query result below.

{"took" : 118,
"timed_out" : false,
"_shards" : {
"total" : 32,
"successful" : 31,
"failed" : 1,
"failures" : [ {
"status" : 500,
"reason" : "RemoteTransportException[[Sur
ge][inet[/...:9300]][**search/**phase/fetch/id]]; nested:
IllegalArgumentException[docID must be >= 0 and < maxDoc=1 (got docID=
2147330367)]; "
} ]
},
"hits" : {
"total" : 15
...

==
The stack trace is the following:

2013-01-08 19:11:02,089 [elasticsearch[Neptune][transp
ort_client_worker][T#2]{New I/O worker #2}] DEBUG [type] -
[Neptune] [55784508] Failed to execute fetch phase
org.elasticsearch.transport.RemoteTransportException:
[Ringer][inet[/...:9300]][search/phase/fetch/id]
Caused by: java.lang.IllegalArgumentException: docID must be >=
0 and < maxDoc=1 (got docID=2147335046)
at org.apache.lucene.index.SegmentReader.document(
SegmentReader**.java:490)
at org.apache.lucene.index.DirectoryReader.document(**
DirectoryReader.java:568)
at org.apache.lucene.search.IndexSearcher.doc(
IndexSearcher.java:264)
at org.elasticsearch.search.fetch
.FetchPhase.loadDocument(FetchPhase.java:288)
at org.elasticsearch.search.fetch.FetchPhase.execute(
FetchPhase**.java:159)
at org.elasticsearch.search.SearchService.**
executeFetchPhase(SearchService.java:430)
at org.elasticsearch.search.action.**
SearchServiceTransportAction**$SearchFetchByIdTransportHandle**
r.messageReceived(SearchServiceTransportAction.java:602)
at org.elasticsearch.search.action.
SearchServiceTransportAction**$SearchFetchByIdTransportHandle**
r.messageReceived(SearchServiceTransportAction.java:591)
at org.elasticsearch.transport.ne
tty.MessageChannelHandler$RequestHandler.run(MessageChannelH
andler.java:268)
at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(**
ThreadPoo**lExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)

=====

We've run into this problem before with another query, eventually
being unable to retrieve any documents whatsoever. Running _optimize
helped, but after a few days we get these errors again.

Some more details:
Version: 0.19.11
Replication factor: 2
Health: {"cluster_name" : "...",

"status" : "green",
"timed_out" : false,
"number_of_nodes" : 27,
"number_of_data_nodes" : 3,
"active_primary_shards" : 96,
"active_shards" : 192,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

I'll appreciate any help/pointers.

-Ignacio

--

--

--

Ignacio_Zendejas · January 9, 2013, 11:37pm

Correction, paging through results will not lead to retrieval of duplicate
documents. You simply won't be able to retrieve those documents with NaN
scores.

On Wed, Jan 9, 2013 at 3:31 PM, Ignacio Zendejas <
ignacio.zendejas.cs@gmail.com> wrote:

Igor, thanks for the pointer. It definitely turned out to be a problem
with our scorer and we do, in effect, get an NaN. The problem is that our
scorer will not work properly with the existence of deleted docs,
therefore, running optimize will at least temporarily fix this because it
removes them.

For the interested:

I was able to find a query that broke on our staging environment, so I
downloaded the index data and ran that same query locally.

only_expunge_deletes=true did not remove all deleted documents (and thus
did not fix our problem), so I ran _optimize with max_num_segments=1 (which
removes all deleted docs) and compared the list of retrieved documents

After identifying a document that was not retrieved prior to optimizing,
I ran an explain* with the same query on that document and found that it
had a field which returned an NaN (log(0) * 0) leading to an overall score
of NaN.

*Elasticsearch Platform — Find real-time answers at scale | Elastic

By the way, scoring documents with NaN will also lead to fetching
duplicate documents when paging through results, since the reported number
of hits will be greater that the retrievable documents.

Thanks again!

On Tue, Jan 8, 2013 at 6:33 PM, Ignacio Zendejas <
ignacio.zendejas.cs@gmail.com> wrote:

I ran with max_num_segments flag today, can't say if
only_expunge_deletes=true would do the trick. I did not find a "failed to
merge" message in the logs over the past 4 days.

I was definitely able to reproduce the same exact stack trace, minus a
different docId with the gist you provided. Since this is a
production-based index, I'll have to keep an eye on it to see if the
Exception is thrown again -- but we're trying to see if we can reproduce it
on our staging environment (where we haven't run _optimize recently).

All along, my suspicion has been corrupted segments, but I'll also
investigate the possibility of deleted documents screwing up the scores.

On Tue, Jan 8, 2013 at 5:37 PM, Igor Motov imotov@gmail.com wrote:

I am just trying to find the simplest possible explanation. I know for
sure that this script nan_score.sh · GitHub is
causing exactly the same error, you are using your own scorer, so to me it
sounds like something that should be checked first.

When you run _optimize, do you specify only_expunge_deletes=true? If
not, try running it with this flag, to see if it will change the situation.

Do you see any warnings in the log file saying "failed to merge"?

On Tuesday, January 8, 2013 8:13:53 PM UTC-5, Ignacio wrote:

We'll look into the scoring, Igor, but it seems to not be the problem
given that I've finished running _optimize on the affected index and that
indeed fixed the problem. However, I'm afraid it may pop up again.

Some more context:
We have thousands of writes/hour (we have 3 indices, the one usu.
affected has very large documents--in the order of KBs*). Lots of segments
have deleted docs. When I run _optimize with max_num_segments=1, it never
seems to complete, but the number of deleted documents gets reduced.

We've recently optimized the size of these documents, and set
compression at 5KB, but compression may not be a problem because we had
these problems prior to turning on compression.

Thanks!

On Tue, Jan 8, 2013 at 4:56 PM, Ignacio Zendejas <
ignacio.z...@gmail.com**> wrote:

No, not per query, but we do have our own scorer. We also have custom
analyzers, which we also don't specify per query, only per field.

Thanks.

On Tue, Jan 8, 2013 at 4:28 PM, Igor Motov imo...@gmail.com wrote:

Are you using custom_scorer in your queries?

On Tuesday, January 8, 2013 3:13:32 PM UTC-5, Ignacio wrote:

Hi -

When doing searches against an index, we get shard failures as shown
in the query result below.

{"took" : 118,
"timed_out" : false,
"_shards" : {
"total" : 32,
"successful" : 31,
"failed" : 1,
"failures" : [ {
"status" : 500,
"reason" : "RemoteTransportException[[Sur
ge][inet[/...:9300]][**search/**phase/fetch/id]]; nested:
IllegalArgumentException[docID must be >= 0 and < maxDoc=1 (got docID=
2147330367)]; "
} ]
},
"hits" : {
"total" : 15
...

==
The stack trace is the following:

2013-01-08 19:11:02,089 [elasticsearch[Neptune][transp
ort_client_worker][T#2]{New I/O worker #2}] DEBUG [type] -
[Neptune] [55784508] Failed to execute fetch phase
org.elasticsearch.transport.RemoteTransportException:
[Ringer][inet[/...:9300]][search/phase/fetch/id]
Caused by: java.lang.IllegalArgumentException: docID must be >=
0 and < maxDoc=1 (got docID=2147335046)
at org.apache.lucene.index.SegmentReader.document(
SegmentReader**.java:490)
at org.apache.lucene.index.DirectoryReader.document(**
DirectoryReader.java:568)
at org.apache.lucene.search.IndexSearcher.doc(
IndexSearcher.java:264)
at org.elasticsearch.search.fetch
.FetchPhase.loadDocument(FetchPhase.java:288)
at org.elasticsearch.search.fetch.FetchPhase.execute(
FetchPhase**.java:159)
at org.elasticsearch.search.SearchService.**
executeFetchPhase(SearchService.java:430)
at org.elasticsearch.search.action.**
SearchServiceTransportAction**$SearchFetchByIdTransportHandle**
r.messageReceived(SearchServiceTransportAction.java:602)
at org.elasticsearch.search.action.
SearchServiceTransportAction**$SearchFetchByIdTransportHandle**
r.messageReceived(SearchServiceTransportAction.java:591)
at org.elasticsearch.transport.ne
tty.MessageChannelHandler$RequestHandler.run(MessageChannelH
andler.java:268)
at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(**
ThreadPoo**lExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)

=====

We've run into this problem before with another query, eventually
being unable to retrieve any documents whatsoever. Running _optimize
helped, but after a few days we get these errors again.

Some more details:
Version: 0.19.11
Replication factor: 2
Health: {"cluster_name" : "...",

"status" : "green",
"timed_out" : false,
"number_of_nodes" : 27,
"number_of_data_nodes" : 3,
"active_primary_shards" : 96,
"active_shards" : 192,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

I'll appreciate any help/pointers.

-Ignacio

--

--

--

Ignacio_Zendejas · January 10, 2013, 3:31am

Here was the root problem -- readers beware:

We didn't know Lucene's docFreq(term) would include deleted documents in
the count! This is terrible and I hope Lucene addresses this.

If your custom scorer relies on docFreq, here's an article that explains
that along with a workaround -- which am not necessarily vouching for. We
found our own work-around (which is specific to our scorer).

And scratch the last post, having NaN's DOES lead to duplicate documents
being fetched.

On Wed, Jan 9, 2013 at 3:37 PM, Ignacio Zendejas <
ignacio.zendejas.cs@gmail.com> wrote:

Correction, paging through results will not lead to retrieval of duplicate
documents. You simply won't be able to retrieve those documents with NaN
scores.

On Wed, Jan 9, 2013 at 3:31 PM, Ignacio Zendejas <
ignacio.zendejas.cs@gmail.com> wrote:

Igor, thanks for the pointer. It definitely turned out to be a problem
with our scorer and we do, in effect, get an NaN. The problem is that our
scorer will not work properly with the existence of deleted docs,
therefore, running optimize will at least temporarily fix this because it
removes them.

For the interested:

I was able to find a query that broke on our staging environment, so I
downloaded the index data and ran that same query locally.

only_expunge_deletes=true did not remove all deleted documents (and
thus did not fix our problem), so I ran _optimize with max_num_segments=1
(which removes all deleted docs) and compared the list of retrieved
documents

After identifying a document that was not retrieved prior to
optimizing, I ran an explain* with the same query on that document and
found that it had a field which returned an NaN (log(0) * 0) leading to an
overall score of NaN.

*Elasticsearch Platform — Find real-time answers at scale | Elastic

By the way, scoring documents with NaN will also lead to fetching
duplicate documents when paging through results, since the reported number
of hits will be greater that the retrievable documents.

Thanks again!

On Tue, Jan 8, 2013 at 6:33 PM, Ignacio Zendejas <
ignacio.zendejas.cs@gmail.com> wrote:

I ran with max_num_segments flag today, can't say if
only_expunge_deletes=true would do the trick. I did not find a "failed to
merge" message in the logs over the past 4 days.

I was definitely able to reproduce the same exact stack trace, minus a
different docId with the gist you provided. Since this is a
production-based index, I'll have to keep an eye on it to see if the
Exception is thrown again -- but we're trying to see if we can reproduce it
on our staging environment (where we haven't run _optimize recently).

All along, my suspicion has been corrupted segments, but I'll also
investigate the possibility of deleted documents screwing up the scores.

On Tue, Jan 8, 2013 at 5:37 PM, Igor Motov imotov@gmail.com wrote:

I am just trying to find the simplest possible explanation. I know for
sure that this script nan_score.sh · GitHub is
causing exactly the same error, you are using your own scorer, so to me it
sounds like something that should be checked first.

When you run _optimize, do you specify only_expunge_deletes=true? If
not, try running it with this flag, to see if it will change the situation.

Do you see any warnings in the log file saying "failed to merge"?

On Tuesday, January 8, 2013 8:13:53 PM UTC-5, Ignacio wrote:

We'll look into the scoring, Igor, but it seems to not be the problem
given that I've finished running _optimize on the affected index and that
indeed fixed the problem. However, I'm afraid it may pop up again.

Some more context:
We have thousands of writes/hour (we have 3 indices, the one usu.
affected has very large documents--in the order of KBs*). Lots of segments
have deleted docs. When I run _optimize with max_num_segments=1, it never
seems to complete, but the number of deleted documents gets reduced.

We've recently optimized the size of these documents, and set
compression at 5KB, but compression may not be a problem because we had
these problems prior to turning on compression.

Thanks!

On Tue, Jan 8, 2013 at 4:56 PM, Ignacio Zendejas <
ignacio.z...@gmail.com**> wrote:

No, not per query, but we do have our own scorer. We also have custom
analyzers, which we also don't specify per query, only per field.

Thanks.

On Tue, Jan 8, 2013 at 4:28 PM, Igor Motov imo...@gmail.com wrote:

Are you using custom_scorer in your queries?

On Tuesday, January 8, 2013 3:13:32 PM UTC-5, Ignacio wrote:

Hi -

When doing searches against an index, we get shard failures as
shown in the query result below.

{"took" : 118,
"timed_out" : false,
"_shards" : {
"total" : 32,
"successful" : 31,
"failed" : 1,
"failures" : [ {
"status" : 500,
"reason" : "RemoteTransportException[[Sur
ge][inet[/...:9300]][**search/**phase/fetch/id]]; nested:
IllegalArgumentException[docID must be >= 0 and < maxDoc=1 (got docID=
2147330367)]; "
} ]
},
"hits" : {
"total" : 15
...

==
The stack trace is the following:

2013-01-08 19:11:02,089 [elasticsearch[Neptune][transp
ort_client_worker][T#2]{**New I/O worker #2}] DEBUG [type] -
[Neptune] [55784508] Failed to execute fetch phase
org.elasticsearch.transport.RemoteTransportException:
[Ringer][inet[/...:9300]][search/phase/fetch/id]
Caused by: java.lang.IllegalArgumentException: docID must be

= 0 and < maxDoc=1 (got docID=2147335046)
at org.apache.lucene.index.SegmentReader.document(**
SegmentReader**.java:490)
at org.apache.lucene.index.DirectoryReader.document(**
DirectoryReader.java:568)
at org.apache.lucene.search.IndexSearcher.doc(
IndexSearcher.java:264)
at org.elasticsearch.search.fetch
.FetchPhase.loadDocument(FetchPhase.java:288)
at org.elasticsearch.search.fetch.FetchPhase.execute(
FetchPhase**.java:159)
at org.elasticsearch.search.SearchService.**
executeFetchPhase(SearchService.java:430)
at org.elasticsearch.search.action.**
SearchServiceTransportAction**$SearchFetchByIdTransportHandle**
r.messageReceived(SearchServiceTransportAction.java:602)
at org.elasticsearch.search.action.
SearchServiceTransportAction**$SearchFetchByIdTransportHandle**
r.messageReceived(SearchServiceTransportAction.java:591)
at org.elasticsearch.transport.ne
tty.MessageChannelHandler$RequestHandler.run(MessageChannelH
*andler.java:268)
at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(
*ThreadPoo**lExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)

=====

We've run into this problem before with another query, eventually
being unable to retrieve any documents whatsoever. Running _optimize
helped, but after a few days we get these errors again.

Some more details:
Version: 0.19.11
Replication factor: 2
Health: {"cluster_name" : "...",

"status" : "green",
"timed_out" : false,
"number_of_nodes" : 27,
"number_of_data_nodes" : 3,
"active_primary_shards" : 96,
"active_shards" : 192,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

I'll appreciate any help/pointers.

-Ignacio

--

--

--

Topic		Replies	Views
Two different shard exceptions Elasticsearch	16	546	July 6, 2017
RemoteTransportException Elasticsearch	4	1637	July 6, 2017
Failed shards + lost of kibana data Elasticsearch	13	2861	July 5, 2017
Getting Exception while running Aggregation Queries Elasticsearch	2	3938	July 5, 2017
RemoteTransportException Elasticsearch	12	693	February 28, 2023

Shard failure, RemoteTransportException [search/phase/fetch/id]... IllegalArgumentException

Related topics