Shard index gone bad, anyone know how to fix this: java.io.EOFException: read past EOF: NIOFSIndexInput

We running version 0.19.9 with 6 servers running using 6 shards. A few
days ago, shard number 2 seems to have gone goofy (looks like
a corruption in the index) causing the following exception to appear
constantly in the server logs:

org.elasticsearch.transport.RemoteTransportException:
[cardano][inet[/xx.xxx.xx.xxx:9300]][search/phase/query]

Caused by: org.elasticsearch.search.query.QueryPhaseExecutionException:
[theindex][2]: query[filtered(+activityObject.content:"Some query term"
+sourceInfo.publisher:Some Name
-sourceInfo.dataSource:directPooling)->cache(_type:socialmedia)],from[0],size[1],sort[<custom:"sortDate":
org.elasticsearch.index.field.data.longs.LongFieldDataType$1@24f99e97>!]:
Query Failed [Failed to execute main query]

  • at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:182)*
  • at
    org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:234)
  • at
    org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:497)
  • at
    org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:486)
  • at
    org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:268)
  • at
    java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
  • at
    java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
  • at java.lang.Thread.run(Thread.java:662)*
    *Caused by: java.io.EOFException: read past EOF:
    NIOFSIndexInput(path="/var/data/elasticsearch/nodes/0/indices/theindex/2/index/_161lvl.tis")
  • at
    org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:264)
  • at
    org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:40)
  • at org.apache.lucene.store.DataInput.readVInt(DataInput.java:107)*
  • at
    org.apache.lucene.store.BufferedIndexInput.readVInt(BufferedIndexInput.java:217)

What I've tried:

  • Changed replication to 0
  • Closed/Opened the index (to force rebalancing)
  • Restarted the node containing shard 2
    with index.shard.check_on_startup: true

This seems to point to something going bad with the lucene index of this
shard, you the check_on_startup didn't seem to solve the problem. Anyone
know how to get around this.

Much appreciated.
Dimitry.

--

Hi Dimitri,

Read past EOF? I never got that. But here's my braindump, nevertheless:

If there's a problem with shard 2 on all nodes, I don't know what you can
do to recover the data, other than reindex or restore from backup.

If you get this issue only on one server, then I'd try something like this:

  • shut down the problematic node
  • reduce the number of replicas by 1
  • move /var/data/elasticsearch/nodes/0/indices/theindex/2/ to some backup
    location
  • start the node again
  • increase the number of replicas back to force replication

I would assume that if this doesn't fix it it's either:

  • a problem with the hardware on that node. You can check memory&hdd, or
    try to reproduce with another machine to confirm/deny
  • a problem with shard 2 on that index across all the nodes, in which case
    you'd be back to reindexing/restoring from backup

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

On Fri, Jan 4, 2013 at 11:36 AM, Dimitry dlvovsky@gmail.com wrote:

We running version 0.19.9 with 6 servers running using 6 shards. A few
days ago, shard number 2 seems to have gone goofy (looks like
a corruption in the index) causing the following exception to appear
constantly in the server logs:

org.elasticsearch.transport.RemoteTransportException:
[cardano][inet[/xx.xxx.xx.xxx:9300]][search/phase/query]

Caused by: org.elasticsearch.search.query.QueryPhaseExecutionException:
[theindex][2]: query[filtered(+activityObject.content:"Some query term"
+sourceInfo.publisher:Some Name
-sourceInfo.dataSource:directPooling)->cache(_type:socialmedia)],from[0],size[1],sort[<custom:"sortDate":
org.elasticsearch.index.field.data.longs.LongFieldDataType$1@24f99e97>!]:
Query Failed [Failed to execute main query]

  • at
    org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:182)*
  • at
    org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:234)
  • at
    org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:497)
  • at
    org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:486)
  • at
    org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:268)
  • at
    java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
  • at
    java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
  • at java.lang.Thread.run(Thread.java:662)*
    *Caused by: java.io.EOFException: read past EOF:
    NIOFSIndexInput(path="/var/data/elasticsearch/nodes/0/indices/theindex/2/index/_161lvl.tis")
  • at
    org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:264)
  • at
    org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:40)
  • at org.apache.lucene.store.DataInput.readVInt(DataInput.java:107)*
  • at
    org.apache.lucene.store.BufferedIndexInput.readVInt(BufferedIndexInput.java:217)

What I've tried:

  • Changed replication to 0
  • Closed/Opened the index (to force rebalancing)
  • Restarted the node containing shard 2
    with index.shard.check_on_startup: true

This seems to point to something going bad with the lucene index of this
shard, you the check_on_startup didn't seem to solve the problem. Anyone
know how to get around this.

Much appreciated.
Dimitry.

--

--

Running java -cp lucene-core-3.6.1.jar -ea:org.apache.lucene...
org.apache.lucene.index.CheckIndex
/home/es/data/production/nodes/0/indices/users/1/index/ -fix
did the trick as reported in this post
https://groups.google.com/forum/?fromgroups=#!topic/elasticsearch/xprRlA8RQ90
by Marcin Dojwa.

Thanks to all.

On Saturday, January 5, 2013 5:31:17 PM UTC+1, Radu Gheorghe wrote:

Hi Dimitri,

Read past EOF? I never got that. But here's my braindump, nevertheless:

If there's a problem with shard 2 on all nodes, I don't know what you can
do to recover the data, other than reindex or restore from backup.

If you get this issue only on one server, then I'd try something like this:

  • shut down the problematic node
  • reduce the number of replicas by 1
  • move /var/data/elasticsearch/nodes/0/indices/theindex/2/ to some backup
    location
  • start the node again
  • increase the number of replicas back to force replication

I would assume that if this doesn't fix it it's either:

  • a problem with the hardware on that node. You can check memory&hdd, or
    try to reproduce with another machine to confirm/deny
  • a problem with shard 2 on that index across all the nodes, in which case
    you'd be back to reindexing/restoring from backup

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

On Fri, Jan 4, 2013 at 11:36 AM, Dimitry <dlvo...@gmail.com <javascript:>>wrote:

We running version 0.19.9 with 6 servers running using 6 shards. A few
days ago, shard number 2 seems to have gone goofy (looks like
a corruption in the index) causing the following exception to appear
constantly in the server logs:

org.elasticsearch.transport.RemoteTransportException:
[cardano][inet[/xx.xxx.xx.xxx:9300]][search/phase/query]

Caused by: org.elasticsearch.search.query.QueryPhaseExecutionException:
[theindex][2]: query[filtered(+activityObject.content:"Some query term"
+sourceInfo.publisher:Some Name
-sourceInfo.dataSource:directPooling)->cache(_type:socialmedia)],from[0],size[1],sort[<custom:"sortDate":
org.elasticsearch.index.field.data.longs.LongFieldDataType$1@24f99e97>!]:
Query Failed [Failed to execute main query]

  • at
    org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:182)*
  • at
    org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:234)
  • at
    org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:497)
  • at
    org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:486)
  • at
    org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:268)
  • at
    java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
  • at
    java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
  • at java.lang.Thread.run(Thread.java:662)*
    *Caused by: java.io.EOFException: read past EOF:
    NIOFSIndexInput(path="/var/data/elasticsearch/nodes/0/indices/theindex/2/index/_161lvl.tis")
  • at
    org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:264)
  • at
    org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:40)
  • at org.apache.lucene.store.DataInput.readVInt(DataInput.java:107)*
  • at
    org.apache.lucene.store.BufferedIndexInput.readVInt(BufferedIndexInput.java:217)

What I've tried:

  • Changed replication to 0
  • Closed/Opened the index (to force rebalancing)
  • Restarted the node containing shard 2
    with index.shard.check_on_startup: true

This seems to point to something going bad with the lucene index of this
shard, you the check_on_startup didn't seem to solve the problem. Anyone
know how to get around this.

Much appreciated.
Dimitry.

--

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.