2013-01-08 19:11:02,089
[elasticsearch[Neptune][transport_client_worker][T#2]{New I/O worker #2}]
DEBUG [type] - [Neptune] [55784508] Failed to execute fetch phase
org.elasticsearch.transport.RemoteTransportException:
[Ringer][inet[/...:9300]][search/phase/fetch/id]
Caused by: java.lang.IllegalArgumentException: docID must be >= 0 and <
maxDoc=1 (got docID=2147335046)
at
org.apache.lucene.index.SegmentReader.document(SegmentReader.java:490)
at
org.apache.lucene.index.DirectoryReader.document(DirectoryReader.java:568)
at
org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:264)
at
org.elasticsearch.search.fetch.FetchPhase.loadDocument(FetchPhase.java:288)
at
org.elasticsearch.search.fetch.FetchPhase.execute(FetchPhase.java:159)
at
org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:430)
at
org.elasticsearch.search.action.SearchServiceTransportAction$SearchFetchByIdTransportHandler.messageReceived(SearchServiceTransportAction.java:602)
at
org.elasticsearch.search.action.SearchServiceTransportAction$SearchFetchByIdTransportHandler.messageReceived(SearchServiceTransportAction.java:591)
at
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:268)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
=====
We've run into this problem before with another query, eventually being
unable to retrieve any documents whatsoever. Running _optimize helped, but
after a few days we get these errors again.
Some more details:
Version: 0.19.11
Replication factor: 2
Health: {"cluster_name" : "...",
2013-01-08 19:11:02,089
[elasticsearch[Neptune][transport_client_worker][T#2]{New I/O worker #2}]
DEBUG [type] - [Neptune] [55784508] Failed to execute fetch phase
org.elasticsearch.transport.RemoteTransportException:
[Ringer][inet[/...:9300]][search/phase/fetch/id]
Caused by: java.lang.IllegalArgumentException: docID must be >= 0 and <
maxDoc=1 (got docID=2147335046)
at
org.apache.lucene.index.SegmentReader.document(SegmentReader.java:490)
at
org.apache.lucene.index.DirectoryReader.document(DirectoryReader.java:568)
at
org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:264)
at
org.elasticsearch.search.fetch.FetchPhase.loadDocument(FetchPhase.java:288)
at
org.elasticsearch.search.fetch.FetchPhase.execute(FetchPhase.java:159)
at
org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:430)
at
org.elasticsearch.search.action.SearchServiceTransportAction$SearchFetchByIdTransportHandler.messageReceived(SearchServiceTransportAction.java:602)
at
org.elasticsearch.search.action.SearchServiceTransportAction$SearchFetchByIdTransportHandler.messageReceived(SearchServiceTransportAction.java:591)
at
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:268)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
=====
We've run into this problem before with another query, eventually being
unable to retrieve any documents whatsoever. Running _optimize helped, but
after a few days we get these errors again.
Some more details:
Version: 0.19.11
Replication factor: 2
Health: {"cluster_name" : "...",
2013-01-08 19:11:02,089 [elasticsearch[Neptune][**
transport_client_worker][T#2]{**New I/O worker #2}] DEBUG [type] -
[Neptune] [55784508] Failed to execute fetch phase
org.elasticsearch.transport.**RemoteTransportException:
[Ringer][inet[/...:9300]][**search/phase/fetch/id]
Caused by: java.lang.**IllegalArgumentException: docID must be >= 0 and
< maxDoc=1 (got docID=2147335046)
at org.apache.lucene.index.SegmentReader.document(
SegmentReader.java:490)
at org.apache.lucene.index.DirectoryReader.document(
DirectoryReader.java:568)
at org.apache.lucene.search.IndexSearcher.doc(
IndexSearcher.java:264)
at org.elasticsearch.search.fetch.FetchPhase.loadDocument(
FetchPhase.java:288)
at org.elasticsearch.search.fetch.FetchPhase.execute(
FetchPhase.java:159)
at org.elasticsearch.search.**SearchService.*executeFetchPhase(
*SearchService.java:430)
at org.elasticsearch.search.action.
SearchServiceTransportAction$SearchFetchByIdTransportHandle
r.messageReceived(**SearchServiceTransportAction.**java:602)
at org.elasticsearch.search.action.
SearchServiceTransportAction$SearchFetchByIdTransportHandle
r.messageReceived(**SearchServiceTransportAction.**java:591)
at org.elasticsearch.transport.netty.MessageChannelHandler$
RequestHandler.run(**MessageChannelHandler.java:**268)
at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.**java:722)
=====
We've run into this problem before with another query, eventually being
unable to retrieve any documents whatsoever. Running _optimize helped, but
after a few days we get these errors again.
Some more details:
Version: 0.19.11
Replication factor: 2
Health: {"cluster_name" : "...",
2013-01-08 19:11:02,089 [elasticsearch[Neptune][**
transport_client_worker][T#2]{**New I/O worker #2}] DEBUG [type] -
[Neptune] [55784508] Failed to execute fetch phase
org.elasticsearch.transport.**RemoteTransportException:
[Ringer][inet[/...:9300]][**search/phase/fetch/id]
Caused by: java.lang.**IllegalArgumentException: docID must be >= 0 and
< maxDoc=1 (got docID=2147335046)
at org.apache.lucene.index.SegmentReader.document(
SegmentReader.java:490)
at org.apache.lucene.index.DirectoryReader.document(
DirectoryReader.java:568)
at org.apache.lucene.search.IndexSearcher.doc(
IndexSearcher.java:264)
at org.elasticsearch.search.fetch.FetchPhase.loadDocument(
FetchPhase.java:288)
at org.elasticsearch.search.fetch.FetchPhase.execute(
FetchPhase.java:159)
at org.elasticsearch.search.**SearchService.**executeFetchPhase(
**SearchService.java:430)
at org.elasticsearch.search.action.
SearchServiceTransportAction$SearchFetchByIdTransportHandle
r.messageReceived(**SearchServiceTransportAction.**java:602)
at org.elasticsearch.search.action.
SearchServiceTransportAction$SearchFetchByIdTransportHandle
r.messageReceived(**SearchServiceTransportAction.**java:591)
at org.elasticsearch.transport.netty.MessageChannelHandler$
RequestHandler.run(**MessageChannelHandler.java:**268)
at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.**java:722)
=====
We've run into this problem before with another query, eventually being
unable to retrieve any documents whatsoever. Running _optimize helped, but
after a few days we get these errors again.
Some more details:
Version: 0.19.11
Replication factor: 2
Health: {"cluster_name" : "...",
We'll look into the scoring, Igor, but it seems to not be the problem given
that I've finished running _optimize on the affected index and that indeed
fixed the problem. However, I'm afraid it may pop up again.
Some more context:
We have thousands of writes/hour (we have 3 indices, the one usu. affected
has very large documents--in the order of KBs*). Lots of segments have
deleted docs. When I run _optimize with max_num_segments=1, it never seems
to complete, but the number of deleted documents gets reduced.
We've recently optimized the size of these documents, and set compression
at 5KB, but compression may not be a problem because we had these problems
prior to turning on compression.
2013-01-08 19:11:02,089 [elasticsearch[Neptune][**
transport_client_worker][T#2]{**New I/O worker #2}] DEBUG [type] -
[Neptune] [55784508] Failed to execute fetch phase
org.elasticsearch.transport.**RemoteTransportException:
[Ringer][inet[/...:9300]][**search/phase/fetch/id]
Caused by: java.lang.**IllegalArgumentException: docID must be >= 0 and
< maxDoc=1 (got docID=2147335046)
at org.apache.lucene.index.SegmentReader.document(
SegmentReader.java:490)
at org.apache.lucene.index.DirectoryReader.document(
DirectoryReader.java:568)
at org.apache.lucene.search.IndexSearcher.doc(
IndexSearcher.java:264)
at org.elasticsearch.search.fetch.FetchPhase.loadDocument(
FetchPhase.java:288)
at org.elasticsearch.search.fetch.FetchPhase.execute(
FetchPhase.java:159)
at org.elasticsearch.search.**SearchService.**executeFetchPhase(
**SearchService.java:430)
at org.elasticsearch.search.action.
SearchServiceTransportAction$SearchFetchByIdTransportHandle
r.messageReceived(**SearchServiceTransportAction.**java:602)
at org.elasticsearch.search.action.
SearchServiceTransportAction$SearchFetchByIdTransportHandle
r.messageReceived(**SearchServiceTransportAction.**java:591)
at org.elasticsearch.transport.netty.MessageChannelHandler$
RequestHandler.run(**MessageChannelHandler.java:**268)
at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.**java:722)
=====
We've run into this problem before with another query, eventually being
unable to retrieve any documents whatsoever. Running _optimize helped, but
after a few days we get these errors again.
Some more details:
Version: 0.19.11
Replication factor: 2
Health: {"cluster_name" : "...",
I am just trying to find the simplest possible explanation. I know for sure
that this script nan_score.sh · GitHub is causing exactly the
same error, you are using your own scorer, so to me it sounds like
something that should be checked first.
When you run _optimize, do you specify only_expunge_deletes=true? If not,
try running it with this flag, to see if it will change the situation.
Do you see any warnings in the log file saying "failed to merge"?
On Tuesday, January 8, 2013 8:13:53 PM UTC-5, Ignacio wrote:
We'll look into the scoring, Igor, but it seems to not be the problem
given that I've finished running _optimize on the affected index and that
indeed fixed the problem. However, I'm afraid it may pop up again.
Some more context:
We have thousands of writes/hour (we have 3 indices, the one usu. affected
has very large documents--in the order of KBs*). Lots of segments have
deleted docs. When I run _optimize with max_num_segments=1, it never seems
to complete, but the number of deleted documents gets reduced.
We've recently optimized the size of these documents, and set
compression at 5KB, but compression may not be a problem because we had
these problems prior to turning on compression.
2013-01-08 19:11:02,089 [elasticsearch[Neptune][**
transport_client_worker][T#2]{**New I/O worker #2}] DEBUG [type] -
[Neptune] [55784508] Failed to execute fetch phase
org.elasticsearch.transport.**RemoteTransportException:
[Ringer][inet[/...:9300]][**search/phase/fetch/id]
Caused by: java.lang.**IllegalArgumentException: docID must be >= 0
and < maxDoc=1 (got docID=2147335046)
at org.apache.lucene.index.SegmentReader.document(
SegmentReader.java:490)
at org.apache.lucene.index.DirectoryReader.document(
DirectoryReader.java:568)
at org.apache.lucene.search.IndexSearcher.doc(
IndexSearcher.java:264)
at org.elasticsearch.search.fetch.FetchPhase.loadDocument(
FetchPhase.java:288)
at org.elasticsearch.search.fetch.FetchPhase.execute(
FetchPhase.java:159)
at org.elasticsearch.search.SearchService.
executeFetchPhase(**SearchService.java:430)
at org.elasticsearch.search.action.
SearchServiceTransportAction$SearchFetchByIdTransportHandle
r.messageReceived(**SearchServiceTransportAction.**java:602)
at org.elasticsearch.search.action.
SearchServiceTransportAction$SearchFetchByIdTransportHandle
r.messageReceived(**SearchServiceTransportAction.**java:591)
at org.elasticsearch.transport.netty.MessageChannelHandler$
RequestHandler.run(**MessageChannelHandler.java:**268)
at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.**java:722)
=====
We've run into this problem before with another query, eventually being
unable to retrieve any documents whatsoever. Running _optimize helped, but
after a few days we get these errors again.
Some more details:
Version: 0.19.11
Replication factor: 2
Health: {"cluster_name" : "...",
I ran with max_num_segments flag today, can't say if
only_expunge_deletes=true would do the trick. I did not find a "failed to
merge" message in the logs over the past 4 days.
I was definitely able to reproduce the same exact stack trace, minus a
different docId with the gist you provided. Since this is a
production-based index, I'll have to keep an eye on it to see if the
Exception is thrown again -- but we're trying to see if we can reproduce it
on our staging environment (where we haven't run _optimize recently).
All along, my suspicion has been corrupted segments, but I'll also
investigate the possibility of deleted documents screwing up the scores.
On Tue, Jan 8, 2013 at 5:37 PM, Igor Motov imotov@gmail.com wrote:
I am just trying to find the simplest possible explanation. I know for
sure that this script nan_score.sh · GitHub is
causing exactly the same error, you are using your own scorer, so to me it
sounds like something that should be checked first.
When you run _optimize, do you specify only_expunge_deletes=true? If not,
try running it with this flag, to see if it will change the situation.
Do you see any warnings in the log file saying "failed to merge"?
On Tuesday, January 8, 2013 8:13:53 PM UTC-5, Ignacio wrote:
We'll look into the scoring, Igor, but it seems to not be the problem
given that I've finished running _optimize on the affected index and that
indeed fixed the problem. However, I'm afraid it may pop up again.
Some more context:
We have thousands of writes/hour (we have 3 indices, the one usu.
affected has very large documents--in the order of KBs*). Lots of segments
have deleted docs. When I run _optimize with max_num_segments=1, it never
seems to complete, but the number of deleted documents gets reduced.
We've recently optimized the size of these documents, and set
compression at 5KB, but compression may not be a problem because we had
these problems prior to turning on compression.
2013-01-08 19:11:02,089 [elasticsearch[Neptune][transp
ort_client_worker][T#2]{New I/O worker #2}] DEBUG [type] -
[Neptune] [55784508] Failed to execute fetch phase
org.elasticsearch.transport.RemoteTransportException:
[Ringer][inet[/...:9300]][search/phase/fetch/id]
Caused by: java.lang.IllegalArgumentException: docID must be >= 0
and < maxDoc=1 (got docID=2147335046)
at org.apache.lucene.index.SegmentReader.document(
SegmentReader**.java:490)
at org.apache.lucene.index.DirectoryReader.document(**
DirectoryReader.java:568)
at org.apache.lucene.search.IndexSearcher.doc(
IndexSearcher.java:264)
at org.elasticsearch.search.fetch.FetchPhase.loadDocument( FetchPhase.java:288)
at org.elasticsearch.search.fetch.FetchPhase.execute(
FetchPhase**.java:159)
at org.elasticsearch.search.SearchService.**
executeFetchPhase(SearchService.java:430)
at org.elasticsearch.search.action.**
SearchServiceTransportAction**$SearchFetchByIdTransportHandle**
r.messageReceived(SearchServiceTransportAction.java:602)
at org.elasticsearch.search.action.
SearchServiceTransportAction**$SearchFetchByIdTransportHandle**
r.messageReceived(SearchServiceTransportAction.java:591)
at org.elasticsearch.transport.ne
tty.MessageChannelHandler$RequestHandler.run(MessageChannelH
andler.java:268)
at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(**
ThreadPoo**lExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
=====
We've run into this problem before with another query, eventually
being unable to retrieve any documents whatsoever. Running _optimize
helped, but after a few days we get these errors again.
Some more details:
Version: 0.19.11
Replication factor: 2
Health: {"cluster_name" : "...",
Igor, thanks for the pointer. It definitely turned out to be a problem with
our scorer and we do, in effect, get an NaN. The problem is that our
scorer will not work properly with the existence of deleted docs,
therefore, running optimize will at least temporarily fix this because it
removes them.
For the interested:
I was able to find a query that broke on our staging environment, so I
downloaded the index data and ran that same query locally.
only_expunge_deletes=true did not remove all deleted documents (and thus
did not fix our problem), so I ran _optimize with max_num_segments=1 (which
removes all deleted docs) and compared the list of retrieved documents
After identifying a document that was not retrieved prior to optimizing,
I ran an explain* with the same query on that document and found that it
had a field which returned an NaN (log(0) * 0) leading to an overall score
of NaN.
By the way, scoring documents with NaN will also lead to fetching duplicate
documents when paging through results, since the reported number of hits
will be greater that the retrievable documents.
I ran with max_num_segments flag today, can't say if
only_expunge_deletes=true would do the trick. I did not find a "failed to
merge" message in the logs over the past 4 days.
I was definitely able to reproduce the same exact stack trace, minus a
different docId with the gist you provided. Since this is a
production-based index, I'll have to keep an eye on it to see if the
Exception is thrown again -- but we're trying to see if we can reproduce it
on our staging environment (where we haven't run _optimize recently).
All along, my suspicion has been corrupted segments, but I'll also
investigate the possibility of deleted documents screwing up the scores.
On Tue, Jan 8, 2013 at 5:37 PM, Igor Motov imotov@gmail.com wrote:
I am just trying to find the simplest possible explanation. I know for
sure that this script nan_score.sh · GitHub is
causing exactly the same error, you are using your own scorer, so to me it
sounds like something that should be checked first.
When you run _optimize, do you specify only_expunge_deletes=true? If not,
try running it with this flag, to see if it will change the situation.
Do you see any warnings in the log file saying "failed to merge"?
On Tuesday, January 8, 2013 8:13:53 PM UTC-5, Ignacio wrote:
We'll look into the scoring, Igor, but it seems to not be the problem
given that I've finished running _optimize on the affected index and that
indeed fixed the problem. However, I'm afraid it may pop up again.
Some more context:
We have thousands of writes/hour (we have 3 indices, the one usu.
affected has very large documents--in the order of KBs*). Lots of segments
have deleted docs. When I run _optimize with max_num_segments=1, it never
seems to complete, but the number of deleted documents gets reduced.
We've recently optimized the size of these documents, and set
compression at 5KB, but compression may not be a problem because we had
these problems prior to turning on compression.
2013-01-08 19:11:02,089 [elasticsearch[Neptune][transp
ort_client_worker][T#2]{New I/O worker #2}] DEBUG [type] -
[Neptune] [55784508] Failed to execute fetch phase
org.elasticsearch.transport.RemoteTransportException:
[Ringer][inet[/...:9300]][search/phase/fetch/id]
Caused by: java.lang.IllegalArgumentException: docID must be >=
0 and < maxDoc=1 (got docID=2147335046)
at org.apache.lucene.index.SegmentReader.document(
SegmentReader**.java:490)
at org.apache.lucene.index.DirectoryReader.document(**
DirectoryReader.java:568)
at org.apache.lucene.search.IndexSearcher.doc(
IndexSearcher.java:264)
at org.elasticsearch.search.fetch
.FetchPhase.loadDocument(FetchPhase.java:288)
at org.elasticsearch.search.fetch.FetchPhase.execute(
FetchPhase**.java:159)
at org.elasticsearch.search.SearchService.**
executeFetchPhase(SearchService.java:430)
at org.elasticsearch.search.action.**
SearchServiceTransportAction**$SearchFetchByIdTransportHandle**
r.messageReceived(SearchServiceTransportAction.java:602)
at org.elasticsearch.search.action.
SearchServiceTransportAction**$SearchFetchByIdTransportHandle**
r.messageReceived(SearchServiceTransportAction.java:591)
at org.elasticsearch.transport.ne
tty.MessageChannelHandler$RequestHandler.run(MessageChannelH
andler.java:268)
at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(**
ThreadPoo**lExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
=====
We've run into this problem before with another query, eventually
being unable to retrieve any documents whatsoever. Running _optimize
helped, but after a few days we get these errors again.
Some more details:
Version: 0.19.11
Replication factor: 2
Health: {"cluster_name" : "...",
Correction, paging through results will not lead to retrieval of duplicate
documents. You simply won't be able to retrieve those documents with NaN
scores.
Igor, thanks for the pointer. It definitely turned out to be a problem
with our scorer and we do, in effect, get an NaN. The problem is that our
scorer will not work properly with the existence of deleted docs,
therefore, running optimize will at least temporarily fix this because it
removes them.
For the interested:
I was able to find a query that broke on our staging environment, so I
downloaded the index data and ran that same query locally.
only_expunge_deletes=true did not remove all deleted documents (and thus
did not fix our problem), so I ran _optimize with max_num_segments=1 (which
removes all deleted docs) and compared the list of retrieved documents
After identifying a document that was not retrieved prior to optimizing,
I ran an explain* with the same query on that document and found that it
had a field which returned an NaN (log(0) * 0) leading to an overall score
of NaN.
By the way, scoring documents with NaN will also lead to fetching
duplicate documents when paging through results, since the reported number
of hits will be greater that the retrievable documents.
I ran with max_num_segments flag today, can't say if
only_expunge_deletes=true would do the trick. I did not find a "failed to
merge" message in the logs over the past 4 days.
I was definitely able to reproduce the same exact stack trace, minus a
different docId with the gist you provided. Since this is a
production-based index, I'll have to keep an eye on it to see if the
Exception is thrown again -- but we're trying to see if we can reproduce it
on our staging environment (where we haven't run _optimize recently).
All along, my suspicion has been corrupted segments, but I'll also
investigate the possibility of deleted documents screwing up the scores.
On Tue, Jan 8, 2013 at 5:37 PM, Igor Motov imotov@gmail.com wrote:
I am just trying to find the simplest possible explanation. I know for
sure that this script nan_score.sh · GitHub is
causing exactly the same error, you are using your own scorer, so to me it
sounds like something that should be checked first.
When you run _optimize, do you specify only_expunge_deletes=true? If
not, try running it with this flag, to see if it will change the situation.
Do you see any warnings in the log file saying "failed to merge"?
On Tuesday, January 8, 2013 8:13:53 PM UTC-5, Ignacio wrote:
We'll look into the scoring, Igor, but it seems to not be the problem
given that I've finished running _optimize on the affected index and that
indeed fixed the problem. However, I'm afraid it may pop up again.
Some more context:
We have thousands of writes/hour (we have 3 indices, the one usu.
affected has very large documents--in the order of KBs*). Lots of segments
have deleted docs. When I run _optimize with max_num_segments=1, it never
seems to complete, but the number of deleted documents gets reduced.
We've recently optimized the size of these documents, and set
compression at 5KB, but compression may not be a problem because we had
these problems prior to turning on compression.
2013-01-08 19:11:02,089 [elasticsearch[Neptune][transp
ort_client_worker][T#2]{New I/O worker #2}] DEBUG [type] -
[Neptune] [55784508] Failed to execute fetch phase
org.elasticsearch.transport.RemoteTransportException:
[Ringer][inet[/...:9300]][search/phase/fetch/id]
Caused by: java.lang.IllegalArgumentException: docID must be >=
0 and < maxDoc=1 (got docID=2147335046)
at org.apache.lucene.index.SegmentReader.document(
SegmentReader**.java:490)
at org.apache.lucene.index.DirectoryReader.document(**
DirectoryReader.java:568)
at org.apache.lucene.search.IndexSearcher.doc(
IndexSearcher.java:264)
at org.elasticsearch.search.fetch
.FetchPhase.loadDocument(FetchPhase.java:288)
at org.elasticsearch.search.fetch.FetchPhase.execute(
FetchPhase**.java:159)
at org.elasticsearch.search.SearchService.**
executeFetchPhase(SearchService.java:430)
at org.elasticsearch.search.action.**
SearchServiceTransportAction**$SearchFetchByIdTransportHandle**
r.messageReceived(SearchServiceTransportAction.java:602)
at org.elasticsearch.search.action.
SearchServiceTransportAction**$SearchFetchByIdTransportHandle**
r.messageReceived(SearchServiceTransportAction.java:591)
at org.elasticsearch.transport.ne
tty.MessageChannelHandler$RequestHandler.run(MessageChannelH
andler.java:268)
at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(**
ThreadPoo**lExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
=====
We've run into this problem before with another query, eventually
being unable to retrieve any documents whatsoever. Running _optimize
helped, but after a few days we get these errors again.
Some more details:
Version: 0.19.11
Replication factor: 2
Health: {"cluster_name" : "...",
We didn't know Lucene's docFreq(term) would include deleted documents in
the count! This is terrible and I hope Lucene addresses this.
If your custom scorer relies on docFreq, here's an article that explains
that along with a workaround -- which am not necessarily vouching for. We
found our own work-around (which is specific to our scorer).
And scratch the last post, having NaN's DOES lead to duplicate documents
being fetched.
Correction, paging through results will not lead to retrieval of duplicate
documents. You simply won't be able to retrieve those documents with NaN
scores.
Igor, thanks for the pointer. It definitely turned out to be a problem
with our scorer and we do, in effect, get an NaN. The problem is that our
scorer will not work properly with the existence of deleted docs,
therefore, running optimize will at least temporarily fix this because it
removes them.
For the interested:
I was able to find a query that broke on our staging environment, so I
downloaded the index data and ran that same query locally.
only_expunge_deletes=true did not remove all deleted documents (and
thus did not fix our problem), so I ran _optimize with max_num_segments=1
(which removes all deleted docs) and compared the list of retrieved
documents
After identifying a document that was not retrieved prior to
optimizing, I ran an explain* with the same query on that document and
found that it had a field which returned an NaN (log(0) * 0) leading to an
overall score of NaN.
By the way, scoring documents with NaN will also lead to fetching
duplicate documents when paging through results, since the reported number
of hits will be greater that the retrievable documents.
I ran with max_num_segments flag today, can't say if
only_expunge_deletes=true would do the trick. I did not find a "failed to
merge" message in the logs over the past 4 days.
I was definitely able to reproduce the same exact stack trace, minus a
different docId with the gist you provided. Since this is a
production-based index, I'll have to keep an eye on it to see if the
Exception is thrown again -- but we're trying to see if we can reproduce it
on our staging environment (where we haven't run _optimize recently).
All along, my suspicion has been corrupted segments, but I'll also
investigate the possibility of deleted documents screwing up the scores.
On Tue, Jan 8, 2013 at 5:37 PM, Igor Motov imotov@gmail.com wrote:
I am just trying to find the simplest possible explanation. I know for
sure that this script nan_score.sh · GitHub is
causing exactly the same error, you are using your own scorer, so to me it
sounds like something that should be checked first.
When you run _optimize, do you specify only_expunge_deletes=true? If
not, try running it with this flag, to see if it will change the situation.
Do you see any warnings in the log file saying "failed to merge"?
On Tuesday, January 8, 2013 8:13:53 PM UTC-5, Ignacio wrote:
We'll look into the scoring, Igor, but it seems to not be the problem
given that I've finished running _optimize on the affected index and that
indeed fixed the problem. However, I'm afraid it may pop up again.
Some more context:
We have thousands of writes/hour (we have 3 indices, the one usu.
affected has very large documents--in the order of KBs*). Lots of segments
have deleted docs. When I run _optimize with max_num_segments=1, it never
seems to complete, but the number of deleted documents gets reduced.
We've recently optimized the size of these documents, and set
compression at 5KB, but compression may not be a problem because we had
these problems prior to turning on compression.
2013-01-08 19:11:02,089 [elasticsearch[Neptune][transp
ort_client_worker][T#2]{**New I/O worker #2}] DEBUG [type] -
[Neptune] [55784508] Failed to execute fetch phase
org.elasticsearch.transport.RemoteTransportException:
[Ringer][inet[/...:9300]][search/phase/fetch/id]
Caused by: java.lang.IllegalArgumentException: docID must be
= 0 and < maxDoc=1 (got docID=2147335046)
at org.apache.lucene.index.SegmentReader.document(**
SegmentReader**.java:490)
at org.apache.lucene.index.DirectoryReader.document(**
DirectoryReader.java:568)
at org.apache.lucene.search.IndexSearcher.doc(
IndexSearcher.java:264)
at org.elasticsearch.search.fetch
.FetchPhase.loadDocument(FetchPhase.java:288)
at org.elasticsearch.search.fetch.FetchPhase.execute(
FetchPhase**.java:159)
at org.elasticsearch.search.SearchService.**
executeFetchPhase(SearchService.java:430)
at org.elasticsearch.search.action.**
SearchServiceTransportAction**$SearchFetchByIdTransportHandle**
r.messageReceived(SearchServiceTransportAction.java:602)
at org.elasticsearch.search.action.
SearchServiceTransportAction**$SearchFetchByIdTransportHandle**
r.messageReceived(SearchServiceTransportAction.java:591)
at org.elasticsearch.transport.ne
tty.MessageChannelHandler$RequestHandler.run(MessageChannelH
*andler.java:268)
at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(
*ThreadPoo**lExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
=====
We've run into this problem before with another query, eventually
being unable to retrieve any documents whatsoever. Running _optimize
helped, but after a few days we get these errors again.
Some more details:
Version: 0.19.11
Replication factor: 2
Health: {"cluster_name" : "...",
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.