Background merge hit exception


(Christian Pesch) #1

Hi,

I'm running ElasticSearch 0.11.0 with three machines and about 30G of
disk usage. Yesterday, this dropped to 12G and I'm seeing errors:

java.io.IOException: directory '/mnt/elasticsearch/work/warp/nodes/0/
indices/mau
ritius-comments/3/index' exists and is a directory, but cannot be
listed: list() returned null
at
org.elasticsearch.index.store.fs.NioFsStore.(NioFsStore.java:50)
while locating org.elasticsearch.index.store.fs.NioFsStore
while locating org.elasticsearch.index.store.Store
for parameter 5 at
org.elasticsearch.index.gateway.IndexShardGatewayService.(IndexShardGatewayService.j
ava:76)
while locating
org.elasticsearch.index.gateway.IndexShardGatewayService
Caused by: java.io.IOException: directory '/mnt/elasticsearch/work/
warp/nodes/0/indices/mauritius-comments/3/inde
x' exists and is a directory, but cannot be listed: list() returned
null
at
org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:234)
at
org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:245)
at org.elasticsearch.index.store.support.AbstractStore
$StoreDirectory.(AbstractStore.java:122)

I suspect the number of open files which is 16k now, is too low:

java.io.IOException: Too many open files
at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
at
sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:
163)
at
org.elasticsearch.common.netty.channel.socket.nio.NioServerSocketPipelineSink
$Boss.run(NioServerSocket
PipelineSink.java:245)
at
org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:
108)
at
org.elasticsearch.common.netty.util.internal.IoWorkerRunnable.run(IoWorkerRunnable.java:
46)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
1110)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)

Health seems to be fine:

$ curl -XGET 'http://localhost:9300/_cluster/health?pretty=true'
{
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 8,
"active_primary_shards" : 10,
"active_shards" : 20,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

My problem is: how do I recover the data? It seems to be on the
machines but is mostly unaccessible. Optimizing show the following
interesting error:

[2010-10-26 10:49:58,314][DEBUG][action.admin.indices.optimize] [Nova-
Prime] [mauritius-comments][1], node[14f7b134-368e-4ad5-9c27-
b1cdcd15fba8], [R], s[STARTED]: Failed to execute
[org.elasticsearch.action.admin.indices.optimize.OptimizeRequest@195bed9d]
org.elasticsearch.transport.RemoteTransportException: [Headknocker]
[inet[/10.235.38.207:9301]][indices/optimize/shard]
Caused by:
org.elasticsearch.index.engine.OptimizeFailedEngineException:
[mauritius-comments][1] Optimize failed
at
org.elasticsearch.index.engine.robin.RobinEngine.optimize(RobinEngine.java:
461)
at
org.elasticsearch.index.shard.service.InternalIndexShard.optimize(InternalIndexShard.java:
386)
at
org.elasticsearch.action.admin.indices.optimize.TransportOptimizeAction.shardOperation(TransportOptimizeAction.java:
108)
at
org.elasticsearch.action.admin.indices.optimize.TransportOptimizeAction.shardOperation(TransportOptimizeAction.java:
50)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction
$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:
399)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction
$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:
392)
at org.elasticsearch.transport.netty.MessageChannelHandler
$3.run(MessageChannelHandler.java:195)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
1110)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)
Caused by: java.io.IOException: background merge hit exception:
_6i8:C66898 _kpg:C105747 _aw9:C17593 _kpe:C151833 _kxw:C20226 into
_kxy [optimize] [mergeDocStores]
at
org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2359)
at
org.elasticsearch.index.engine.robin.RobinEngine.optimize(RobinEngine.java:
455)
... 9 more
Caused by: java.lang.ArrayIndexOutOfBoundsException

And that's all. Any help is greatly appreciated.


(Michael McCandless) #2

This normally should not lead to corruption of the Lucene index.

But, it's possible this bug:

https://issues.apache.org/jira/browse/LUCENE-2593

is causing the corruption. That case was disk full, but I suspect running
out of file descriptors could also lead to this.

Does ElasticSearch use 3.0.1 or the tip of Lucene's 3.0.x branch...?

Mike

On Tue, Oct 26, 2010 at 7:49 AM, Christian Pesch cpesch@gmail.com wrote:

Hi,

I'm running ElasticSearch 0.11.0 with three machines and about 30G of
disk usage. Yesterday, this dropped to 12G and I'm seeing errors:

java.io.IOException: directory '/mnt/elasticsearch/work/warp/nodes/0/
indices/mau
ritius-comments/3/index' exists and is a directory, but cannot be
listed: list() returned null
at
org.elasticsearch.index.store.fs.NioFsStore.(NioFsStore.java:50)
while locating org.elasticsearch.index.store.fs.NioFsStore
while locating org.elasticsearch.index.store.Store
for parameter 5 at

org.elasticsearch.index.gateway.IndexShardGatewayService.(IndexShardGatewayService.j
ava:76)
while locating
org.elasticsearch.index.gateway.IndexShardGatewayService
Caused by: java.io.IOException: directory '/mnt/elasticsearch/work/
warp/nodes/0/indices/mauritius-comments/3/inde
x' exists and is a directory, but cannot be listed: list() returned
null
at
org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:234)
at
org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:245)
at org.elasticsearch.index.store.support.AbstractStore
$StoreDirectory.(AbstractStore.java:122)

I suspect the number of open files which is 16k now, is too low:

java.io.IOException: Too many open files
at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
at
sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:
163)
at

org.elasticsearch.common.netty.channel.socket.nio.NioServerSocketPipelineSink
$Boss.run(NioServerSocket
PipelineSink.java:245)
at

org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:
108)
at

org.elasticsearch.common.netty.util.internal.IoWorkerRunnable.run(IoWorkerRunnable.java:
46)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
1110)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)

Health seems to be fine:

$ curl -XGET 'http://localhost:9300/_cluster/health?pretty=true'
{
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 8,
"active_primary_shards" : 10,
"active_shards" : 20,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

My problem is: how do I recover the data? It seems to be on the
machines but is mostly unaccessible. Optimizing show the following
interesting error:

[2010-10-26 10:49:58,314][DEBUG][action.admin.indices.optimize] [Nova-
Prime] [mauritius-comments][1], node[14f7b134-368e-4ad5-9c27-
b1cdcd15fba8], [R], s[STARTED]: Failed to execute
[org.elasticsearch.action.admin.indices.optimize.OptimizeRequest@195bed9d]
org.elasticsearch.transport.RemoteTransportException: [Headknocker]
[inet[/10.235.38.207:9301]][indices/optimize/shard]
Caused by:
org.elasticsearch.index.engine.OptimizeFailedEngineException:
[mauritius-comments][1] Optimize failed
at
org.elasticsearch.index.engine.robin.RobinEngine.optimize(RobinEngine.java:
461)
at

org.elasticsearch.index.shard.service.InternalIndexShard.optimize(InternalIndexShard.java:
386)
at

org.elasticsearch.action.admin.indices.optimize.TransportOptimizeAction.shardOperation(TransportOptimizeAction.java:
108)
at

org.elasticsearch.action.admin.indices.optimize.TransportOptimizeAction.shardOperation(TransportOptimizeAction.java:
50)
at

org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction

$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:
399)
at

org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction

$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:
392)
at org.elasticsearch.transport.netty.MessageChannelHandler
$3.run(MessageChannelHandler.java:195)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
1110)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)
Caused by: java.io.IOException: background merge hit exception:
_6i8:C66898 _kpg:C105747 _aw9:C17593 _kpe:C151833 _kxw:C20226 into
_kxy [optimize] [mergeDocStores]
at
org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2359)
at
org.elasticsearch.index.engine.robin.RobinEngine.optimize(RobinEngine.java:
455)
... 9 more
Caused by: java.lang.ArrayIndexOutOfBoundsException

And that's all. Any help is greatly appreciated.


(Lukáš Vlček) #3

It uses 3.0.2
http://github.com/elasticsearch/elasticsearch/blob/master/modules/elasticsearch/build.gradle#L38

http://github.com/elasticsearch/elasticsearch/blob/master/modules/elasticsearch/build.gradle#L38
Lukas

On Tue, Oct 26, 2010 at 7:01 PM, Michael McCandless <mail@mikemccandless.com

wrote:

This normally should not lead to corruption of the Lucene index.

But, it's possible this bug:

https://issues.apache.org/jira/browse/LUCENE-2593

is causing the corruption. That case was disk full, but I suspect running
out of file descriptors could also lead to this.

Does ElasticSearch use 3.0.1 or the tip of Lucene's 3.0.x branch...?

Mike

On Tue, Oct 26, 2010 at 7:49 AM, Christian Pesch cpesch@gmail.com wrote:

Hi,

I'm running ElasticSearch 0.11.0 with three machines and about 30G of
disk usage. Yesterday, this dropped to 12G and I'm seeing errors:

java.io.IOException: directory '/mnt/elasticsearch/work/warp/nodes/0/
indices/mau
ritius-comments/3/index' exists and is a directory, but cannot be
listed: list() returned null
at
org.elasticsearch.index.store.fs.NioFsStore.(NioFsStore.java:50)
while locating org.elasticsearch.index.store.fs.NioFsStore
while locating org.elasticsearch.index.store.Store
for parameter 5 at

org.elasticsearch.index.gateway.IndexShardGatewayService.(IndexShardGatewayService.j
ava:76)
while locating
org.elasticsearch.index.gateway.IndexShardGatewayService
Caused by: java.io.IOException: directory '/mnt/elasticsearch/work/
warp/nodes/0/indices/mauritius-comments/3/inde
x' exists and is a directory, but cannot be listed: list() returned
null
at
org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:234)
at
org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:245)
at org.elasticsearch.index.store.support.AbstractStore
$StoreDirectory.(AbstractStore.java:122)

I suspect the number of open files which is 16k now, is too low:

java.io.IOException: Too many open files
at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
at
sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:
163)
at

org.elasticsearch.common.netty.channel.socket.nio.NioServerSocketPipelineSink
$Boss.run(NioServerSocket
PipelineSink.java:245)
at

org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:
108)
at

org.elasticsearch.common.netty.util.internal.IoWorkerRunnable.run(IoWorkerRunnable.java:
46)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
1110)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)

Health seems to be fine:

$ curl -XGET 'http://localhost:9300/_cluster/health?pretty=true'
{
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 8,
"active_primary_shards" : 10,
"active_shards" : 20,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

My problem is: how do I recover the data? It seems to be on the
machines but is mostly unaccessible. Optimizing show the following
interesting error:

[2010-10-26 10:49:58,314][DEBUG][action.admin.indices.optimize] [Nova-
Prime] [mauritius-comments][1], node[14f7b134-368e-4ad5-9c27-
b1cdcd15fba8], [R], s[STARTED]: Failed to execute
[org.elasticsearch.action.admin.indices.optimize.OptimizeRequest@195bed9d
]
org.elasticsearch.transport.RemoteTransportException: [Headknocker]
[inet[/10.235.38.207:9301]][indices/optimize/shard]
Caused by:
org.elasticsearch.index.engine.OptimizeFailedEngineException:
[mauritius-comments][1] Optimize failed
at

org.elasticsearch.index.engine.robin.RobinEngine.optimize(RobinEngine.java:
461)
at

org.elasticsearch.index.shard.service.InternalIndexShard.optimize(InternalIndexShard.java:
386)
at

org.elasticsearch.action.admin.indices.optimize.TransportOptimizeAction.shardOperation(TransportOptimizeAction.java:
108)
at

org.elasticsearch.action.admin.indices.optimize.TransportOptimizeAction.shardOperation(TransportOptimizeAction.java:
50)
at

org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction

$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:
399)
at

org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction

$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:
392)
at org.elasticsearch.transport.netty.MessageChannelHandler
$3.run(MessageChannelHandler.java:195)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
1110)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)
Caused by: java.io.IOException: background merge hit exception:
_6i8:C66898 _kpg:C105747 _aw9:C17593 _kpe:C151833 _kxw:C20226 into
_kxy [optimize] [mergeDocStores]
at
org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2359)
at

org.elasticsearch.index.engine.robin.RobinEngine.optimize(RobinEngine.java:
455)
... 9 more
Caused by: java.lang.ArrayIndexOutOfBoundsException

And that's all. Any help is greatly appreciated.


(Michael McCandless) #4

OK, 3.0.2 does not have the fix for LUCENE-2593 (that fix landed after 3.0.2
was released).

But the fix was back-ported to the 3.0.x branch tip...

Mike

On Tue, Oct 26, 2010 at 2:47 PM, Lukáš Vlček lukas.vlcek@gmail.com wrote:

It uses 3.0.2

http://github.com/elasticsearch/elasticsearch/blob/master/modules/elasticsearch/build.gradle#L38

http://github.com/elasticsearch/elasticsearch/blob/master/modules/elasticsearch/build.gradle#L38
Lukas

On Tue, Oct 26, 2010 at 7:01 PM, Michael McCandless <
mail@mikemccandless.com> wrote:

This normally should not lead to corruption of the Lucene index.

But, it's possible this bug:

https://issues.apache.org/jira/browse/LUCENE-2593

is causing the corruption. That case was disk full, but I suspect running
out of file descriptors could also lead to this.

Does ElasticSearch use 3.0.1 or the tip of Lucene's 3.0.x branch...?

Mike

On Tue, Oct 26, 2010 at 7:49 AM, Christian Pesch cpesch@gmail.comwrote:

Hi,

I'm running ElasticSearch 0.11.0 with three machines and about 30G of
disk usage. Yesterday, this dropped to 12G and I'm seeing errors:

java.io.IOException: directory '/mnt/elasticsearch/work/warp/nodes/0/
indices/mau
ritius-comments/3/index' exists and is a directory, but cannot be
listed: list() returned null
at
org.elasticsearch.index.store.fs.NioFsStore.(NioFsStore.java:50)
while locating org.elasticsearch.index.store.fs.NioFsStore
while locating org.elasticsearch.index.store.Store
for parameter 5 at

org.elasticsearch.index.gateway.IndexShardGatewayService.(IndexShardGatewayService.j
ava:76)
while locating
org.elasticsearch.index.gateway.IndexShardGatewayService
Caused by: java.io.IOException: directory '/mnt/elasticsearch/work/
warp/nodes/0/indices/mauritius-comments/3/inde
x' exists and is a directory, but cannot be listed: list() returned
null
at
org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:234)
at
org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:245)
at org.elasticsearch.index.store.support.AbstractStore
$StoreDirectory.(AbstractStore.java:122)

I suspect the number of open files which is 16k now, is too low:

java.io.IOException: Too many open files
at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
at
sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:
163)
at

org.elasticsearch.common.netty.channel.socket.nio.NioServerSocketPipelineSink
$Boss.run(NioServerSocket
PipelineSink.java:245)
at

org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:
108)
at

org.elasticsearch.common.netty.util.internal.IoWorkerRunnable.run(IoWorkerRunnable.java:
46)
at

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
1110)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)

Health seems to be fine:

$ curl -XGET 'http://localhost:9300/_cluster/health?pretty=true'
{
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 8,
"active_primary_shards" : 10,
"active_shards" : 20,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

My problem is: how do I recover the data? It seems to be on the
machines but is mostly unaccessible. Optimizing show the following
interesting error:

[2010-10-26 10:49:58,314][DEBUG][action.admin.indices.optimize] [Nova-
Prime] [mauritius-comments][1], node[14f7b134-368e-4ad5-9c27-
b1cdcd15fba8], [R], s[STARTED]: Failed to execute
[org.elasticsearch.action.admin.indices.optimize.OptimizeRequest@195bed9d
]
org.elasticsearch.transport.RemoteTransportException: [Headknocker]
[inet[/10.235.38.207:9301]][indices/optimize/shard]
Caused by:
org.elasticsearch.index.engine.OptimizeFailedEngineException:
[mauritius-comments][1] Optimize failed
at

org.elasticsearch.index.engine.robin.RobinEngine.optimize(RobinEngine.java:
461)
at

org.elasticsearch.index.shard.service.InternalIndexShard.optimize(InternalIndexShard.java:
386)
at

org.elasticsearch.action.admin.indices.optimize.TransportOptimizeAction.shardOperation(TransportOptimizeAction.java:
108)
at

org.elasticsearch.action.admin.indices.optimize.TransportOptimizeAction.shardOperation(TransportOptimizeAction.java:
50)
at

org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction

$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:
399)
at

org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction

$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:
392)
at org.elasticsearch.transport.netty.MessageChannelHandler
$3.run(MessageChannelHandler.java:195)
at

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
1110)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)
Caused by: java.io.IOException: background merge hit exception:
_6i8:C66898 _kpg:C105747 _aw9:C17593 _kpe:C151833 _kxw:C20226 into
_kxy [optimize] [mergeDocStores]
at
org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2359)
at

org.elasticsearch.index.engine.robin.RobinEngine.optimize(RobinEngine.java:
455)
... 9 more
Caused by: java.lang.ArrayIndexOutOfBoundsException

And that's all. Any help is greatly appreciated.


(Shay Banon) #5

Hi,

Mike, nice to have you on the list!, I feel safer already :).

Christian: Can you try and restart the node in question? if not, then the
whole cluster? This still might be recoverable from another shard replica.
What gateway are you using, local or shared (fs/..)?

Regarding the open files. This is a change I made in the defaults of
elasticsearch, to have by default the index not to use the Lucene compound
file format. While this leads to more file descriptors open, its benefits
are vast (mostly IO, but also CPU). This is how I would run elasticsearch in
production, but it does require setting the open file descriptors to a
higher value (depends on the number of shards that end up on each node, but
usually 32k is a good starting number, note that file descriptors are not
only files, also sockets and so on).

You can change that by setting: index.merge.policy.use_compound_file to
true.

-shay.banon

On Tue, Oct 26, 2010 at 8:59 PM, Michael McCandless <mail@mikemccandless.com

wrote:

OK, 3.0.2 does not have the fix for LUCENE-2593 (that fix landed after
3.0.2 was released).

But the fix was back-ported to the 3.0.x branch tip...

Mike

On Tue, Oct 26, 2010 at 2:47 PM, Lukáš Vlček lukas.vlcek@gmail.comwrote:

It uses 3.0.2

http://github.com/elasticsearch/elasticsearch/blob/master/modules/elasticsearch/build.gradle#L38

http://github.com/elasticsearch/elasticsearch/blob/master/modules/elasticsearch/build.gradle#L38
Lukas

On Tue, Oct 26, 2010 at 7:01 PM, Michael McCandless <
mail@mikemccandless.com> wrote:

This normally should not lead to corruption of the Lucene index.

But, it's possible this bug:

https://issues.apache.org/jira/browse/LUCENE-2593

is causing the corruption. That case was disk full, but I suspect
running out of file descriptors could also lead to this.

Does ElasticSearch use 3.0.1 or the tip of Lucene's 3.0.x branch...?

Mike

On Tue, Oct 26, 2010 at 7:49 AM, Christian Pesch cpesch@gmail.comwrote:

Hi,

I'm running ElasticSearch 0.11.0 with three machines and about 30G of
disk usage. Yesterday, this dropped to 12G and I'm seeing errors:

java.io.IOException: directory '/mnt/elasticsearch/work/warp/nodes/0/
indices/mau
ritius-comments/3/index' exists and is a directory, but cannot be
listed: list() returned null
at
org.elasticsearch.index.store.fs.NioFsStore.(NioFsStore.java:50)
while locating org.elasticsearch.index.store.fs.NioFsStore
while locating org.elasticsearch.index.store.Store
for parameter 5 at

org.elasticsearch.index.gateway.IndexShardGatewayService.(IndexShardGatewayService.j
ava:76)
while locating
org.elasticsearch.index.gateway.IndexShardGatewayService
Caused by: java.io.IOException: directory '/mnt/elasticsearch/work/
warp/nodes/0/indices/mauritius-comments/3/inde
x' exists and is a directory, but cannot be listed: list() returned
null
at
org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:234)
at
org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:245)
at org.elasticsearch.index.store.support.AbstractStore
$StoreDirectory.(AbstractStore.java:122)

I suspect the number of open files which is 16k now, is too low:

java.io.IOException: Too many open files
at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
at
sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:
163)
at

org.elasticsearch.common.netty.channel.socket.nio.NioServerSocketPipelineSink
$Boss.run(NioServerSocket
PipelineSink.java:245)
at

org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:
108)
at

org.elasticsearch.common.netty.util.internal.IoWorkerRunnable.run(IoWorkerRunnable.java:
46)
at

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
1110)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)

Health seems to be fine:

$ curl -XGET 'http://localhost:9300/_cluster/health?pretty=true'
{
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 8,
"active_primary_shards" : 10,
"active_shards" : 20,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

My problem is: how do I recover the data? It seems to be on the
machines but is mostly unaccessible. Optimizing show the following
interesting error:

[2010-10-26 10:49:58,314][DEBUG][action.admin.indices.optimize] [Nova-
Prime] [mauritius-comments][1], node[14f7b134-368e-4ad5-9c27-
b1cdcd15fba8], [R], s[STARTED]: Failed to execute

[org.elasticsearch.action.admin.indices.optimize.OptimizeRequest@195bed9d
]
org.elasticsearch.transport.RemoteTransportException: [Headknocker]
[inet[/10.235.38.207:9301]][indices/optimize/shard]
Caused by:
org.elasticsearch.index.engine.OptimizeFailedEngineException:
[mauritius-comments][1] Optimize failed
at

org.elasticsearch.index.engine.robin.RobinEngine.optimize(RobinEngine.java:
461)
at

org.elasticsearch.index.shard.service.InternalIndexShard.optimize(InternalIndexShard.java:
386)
at

org.elasticsearch.action.admin.indices.optimize.TransportOptimizeAction.shardOperation(TransportOptimizeAction.java:
108)
at

org.elasticsearch.action.admin.indices.optimize.TransportOptimizeAction.shardOperation(TransportOptimizeAction.java:
50)
at

org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction

$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:
399)
at

org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction

$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:
392)
at org.elasticsearch.transport.netty.MessageChannelHandler
$3.run(MessageChannelHandler.java:195)
at

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
1110)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)
Caused by: java.io.IOException: background merge hit exception:
_6i8:C66898 _kpg:C105747 _aw9:C17593 _kpe:C151833 _kxw:C20226 into
_kxy [optimize] [mergeDocStores]
at
org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2359)
at

org.elasticsearch.index.engine.robin.RobinEngine.optimize(RobinEngine.java:
455)
... 9 more
Caused by: java.lang.ArrayIndexOutOfBoundsException

And that's all. Any help is greatly appreciated.


(Christian Pesch) #6

Hi Shay,

I've restarted the node node in question - nothing changed. Then the
cluster - nothing changed.
Then tried to optimize to somehow let the data emerge. But the data
didn't show up although
the disk usage indicates that it might be still there. Optimize with
max_num_segments=1
then showed the exceptions.

I'm using the fs gateway.

Kind regards
Christian

On 26 Okt., 21:24, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi,

Mike, nice to have you on the list!, I feel safer already :).

Christian: Can you try and restart the node in question? if not, then the
whole cluster? This still might be recoverable from another shard replica.
What gateway are you using, local or shared (fs/..)?

Regarding the open files. This is a change I made in the defaults of
elasticsearch, to have by default the index not to use the Lucene compound
file format. While this leads to more file descriptors open, its benefits
are vast (mostly IO, but also CPU). This is how I would run elasticsearch in
production, but it does require setting the open file descriptors to a
higher value (depends on the number of shards that end up on each node, but
usually 32k is a good starting number, note that file descriptors are not
only files, also sockets and so on).

You can change that by setting: index.merge.policy.use_compound_file to
true.

-shay.banon

On Tue, Oct 26, 2010 at 8:59 PM, Michael McCandless <m...@mikemccandless.com

wrote:
OK, 3.0.2 does not have the fix for LUCENE-2593 (that fix landed after
3.0.2 was released).

But the fix was back-ported to the 3.0.x branch tip...

Mike

On Tue, Oct 26, 2010 at 2:47 PM, Lukáš Vlček lukas.vl...@gmail.comwrote:

It uses 3.0.2

http://github.com/elasticsearch/elasticsearch/blob/master/modules/ela...

http://github.com/elasticsearch/elasticsearch/blob/master/modules/ela...
Lukas

On Tue, Oct 26, 2010 at 7:01 PM, Michael McCandless <
m...@mikemccandless.com> wrote:

This normally should not lead to corruption of the Lucene index.

But, it's possible this bug:

https://issues.apache.org/jira/browse/LUCENE-2593

is causing the corruption. That case was disk full, but I suspect
running out of file descriptors could also lead to this.

Does ElasticSearch use 3.0.1 or the tip of Lucene's 3.0.x branch...?

Mike

On Tue, Oct 26, 2010 at 7:49 AM, Christian Pesch cpe...@gmail.comwrote:

Hi,

I'm running ElasticSearch 0.11.0 with three machines and about 30G of
disk usage. Yesterday, this dropped to 12G and I'm seeing errors:

java.io.IOException: directory '/mnt/elasticsearch/work/warp/nodes/0/
indices/mau
ritius-comments/3/index' exists and is a directory, but cannot be
listed: list() returned null
at
org.elasticsearch.index.store.fs.NioFsStore.(NioFsStore.java:50)
while locating org.elasticsearch.index.store.fs.NioFsStore
while locating org.elasticsearch.index.store.Store
for parameter 5 at

org.elasticsearch.index.gateway.IndexShardGatewayService.(IndexShardGatewayService.j
ava:76)
while locating
org.elasticsearch.index.gateway.IndexShardGatewayService
Caused by: java.io.IOException: directory '/mnt/elasticsearch/work/
warp/nodes/0/indices/mauritius-comments/3/inde
x' exists and is a directory, but cannot be listed: list() returned
null
at
org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:234)
at
org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:245)
at org.elasticsearch.index.store.support.AbstractStore
$StoreDirectory.(AbstractStore.java:122)

I suspect the number of open files which is 16k now, is too low:

java.io.IOException: Too many open files
at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
at
sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:
163)
at

org.elasticsearch.common.netty.channel.socket.nio.NioServerSocketPipelineSink
$Boss.run(NioServerSocket
PipelineSink.java:245)
at

org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:
108)
at

org.elasticsearch.common.netty.util.internal.IoWorkerRunnable.run(IoWorkerRunnable.java:
46)
at

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
1110)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)

Health seems to be fine:

$ curl -XGET 'http://localhost:9300/_cluster/health?pretty=true'
{
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 8,
"active_primary_shards" : 10,
"active_shards" : 20,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

My problem is: how do I recover the data? It seems to be on the
machines but is mostly unaccessible. Optimizing show the following
interesting error:

[2010-10-26 10:49:58,314][DEBUG][action.admin.indices.optimize] [Nova-
Prime] [mauritius-comments][1], node[14f7b134-368e-4ad5-9c27-
b1cdcd15fba8], [R], s[STARTED]: Failed to execute

[org.elasticsearch.action.admin.indices.optimize.OptimizeRequest@195bed9d
]
org.elasticsearch.transport.RemoteTransportException: [Headknocker]
[inet[/10.235.38.207:9301]][indices/optimize/shard]
Caused by:
org.elasticsearch.index.engine.OptimizeFailedEngineException:
[mauritius-comments][1] Optimize failed
at

org.elasticsearch.index.engine.robin.RobinEngine.optimize(RobinEngine.java:
461)
at

org.elasticsearch.index.shard.service.InternalIndexShard.optimize(InternalIndexShard.java:
386)
at

org.elasticsearch.action.admin.indices.optimize.TransportOptimizeAction.shardOperation(TransportOptimizeAction.java:
108)
at

org.elasticsearch.action.admin.indices.optimize.TransportOptimizeAction.shardOperation(TransportOptimizeAction.java:
50)
at

org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction

$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:
399)
at

org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction

$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:
392)
at org.elasticsearch.transport.netty.MessageChannelHandler
$3.run(MessageChannelHandler.java:195)
at

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
1110)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)
Caused by: java.io.IOException: background merge hit exception:
_6i8:C66898 _kpg:C105747 _aw9:C17593 _kpe:C151833 _kxw:C20226 into
_kxy [optimize] [mergeDocStores]
at
org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2359)
at

org.elasticsearch.index.engine.robin.RobinEngine.optimize(RobinEngine.java:
455)
... 9 more
Caused by: java.lang.ArrayIndexOutOfBoundsException

And that's all. Any help is greatly appreciated.


(Michael McCandless) #7

On Tue, Oct 26, 2010 at 3:24 PM, Shay Banon
shay.banon@elasticsearch.com wrote:

Mike, nice to have you on the list!, I feel safer already :).

Glad to be here! This is cool stuff you all are doing w/ ElasticSearch :slight_smile:

Mike


(Shay Banon) #8

Is there a chance for the log files from when it happened? I would like to
see what might have caused the index to get to this state and maybe be able
to improve it. I think you will need to reindex the data... .

On Tue, Oct 26, 2010 at 10:22 PM, Christian Pesch cpesch@gmail.com wrote:

Hi Shay,

I've restarted the node node in question - nothing changed. Then the
cluster - nothing changed.
Then tried to optimize to somehow let the data emerge. But the data
didn't show up although
the disk usage indicates that it might be still there. Optimize with
max_num_segments=1
then showed the exceptions.

I'm using the fs gateway.

Kind regards
Christian

On 26 Okt., 21:24, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi,

Mike, nice to have you on the list!, I feel safer already :).

Christian: Can you try and restart the node in question? if not, then
the
whole cluster? This still might be recoverable from another shard
replica.
What gateway are you using, local or shared (fs/..)?

Regarding the open files. This is a change I made in the defaults of
elasticsearch, to have by default the index not to use the Lucene
compound
file format. While this leads to more file descriptors open, its benefits
are vast (mostly IO, but also CPU). This is how I would run elasticsearch
in
production, but it does require setting the open file descriptors to a
higher value (depends on the number of shards that end up on each node,
but
usually 32k is a good starting number, note that file descriptors are not
only files, also sockets and so on).

You can change that by setting: index.merge.policy.use_compound_file
to
true.

-shay.banon

On Tue, Oct 26, 2010 at 8:59 PM, Michael McCandless <
m...@mikemccandless.com

wrote:
OK, 3.0.2 does not have the fix for LUCENE-2593 (that fix landed after
3.0.2 was released).

But the fix was back-ported to the 3.0.x branch tip...

Mike

On Tue, Oct 26, 2010 at 2:47 PM, Lukáš Vlček <lukas.vl...@gmail.com
wrote:

It uses 3.0.2

http://github.com/elasticsearch/elasticsearch/blob/master/modules/ela.
..

<
http://github.com/elasticsearch/elasticsearch/blob/master/modules/ela...>

Lukas

On Tue, Oct 26, 2010 at 7:01 PM, Michael McCandless <
m...@mikemccandless.com> wrote:

This normally should not lead to corruption of the Lucene index.

But, it's possible this bug:

https://issues.apache.org/jira/browse/LUCENE-2593

is causing the corruption. That case was disk full, but I suspect
running out of file descriptors could also lead to this.

Does ElasticSearch use 3.0.1 or the tip of Lucene's 3.0.x branch...?

Mike

On Tue, Oct 26, 2010 at 7:49 AM, Christian Pesch <cpe...@gmail.com
wrote:

Hi,

I'm running ElasticSearch 0.11.0 with three machines and about 30G
of

disk usage. Yesterday, this dropped to 12G and I'm seeing errors:

java.io.IOException: directory
'/mnt/elasticsearch/work/warp/nodes/0/

indices/mau
ritius-comments/3/index' exists and is a directory, but cannot be
listed: list() returned null
at

org.elasticsearch.index.store.fs.NioFsStore.(NioFsStore.java:50)

while locating org.elasticsearch.index.store.fs.NioFsStore
while locating org.elasticsearch.index.store.Store
for parameter 5 at

org.elasticsearch.index.gateway.IndexShardGatewayService.(IndexShardGatewayService.j

ava:76)
while locating
org.elasticsearch.index.gateway.IndexShardGatewayService
Caused by: java.io.IOException: directory '/mnt/elasticsearch/work/
warp/nodes/0/indices/mauritius-comments/3/inde
x' exists and is a directory, but cannot be listed: list() returned
null
at
org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:234)
at
org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:245)
at org.elasticsearch.index.store.support.AbstractStore
$StoreDirectory.(AbstractStore.java:122)

I suspect the number of open files which is 16k now, is too low:

java.io.IOException: Too many open files
at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
at

sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:

  1. at

org.elasticsearch.common.netty.channel.socket.nio.NioServerSocketPipelineSink

$Boss.run(NioServerSocket
PipelineSink.java:245)
at

org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:

  1. at

org.elasticsearch.common.netty.util.internal.IoWorkerRunnable.run(IoWorkerRunnable.java:

  1. at

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:

  1. at java.util.concurrent.ThreadPoolExecutor
    $Worker.run(ThreadPoolExecutor.java:603)
    at java.lang.Thread.run(Thread.java:636)

Health seems to be fine:

$ curl -XGET 'http://localhost:9300/_cluster/health?pretty=true'
{
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 8,
"active_primary_shards" : 10,
"active_shards" : 20,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

My problem is: how do I recover the data? It seems to be on the
machines but is mostly unaccessible. Optimizing show the following
interesting error:

[2010-10-26 10:49:58,314][DEBUG][action.admin.indices.optimize]
[Nova-

Prime] [mauritius-comments][1], node[14f7b134-368e-4ad5-9c27-
b1cdcd15fba8], [R], s[STARTED]: Failed to execute

[org.elasticsearch.action.admin.indices.optimize.OptimizeRequest@195bed9d

]
org.elasticsearch.transport.RemoteTransportException: [Headknocker]
[inet[/10.235.38.207:9301]][indices/optimize/shard]
Caused by:
org.elasticsearch.index.engine.OptimizeFailedEngineException:
[mauritius-comments][1] Optimize failed
at

org.elasticsearch.index.engine.robin.RobinEngine.optimize(RobinEngine.java:

  1. at

org.elasticsearch.index.shard.service.InternalIndexShard.optimize(InternalIndexShard.java:

  1. at

org.elasticsearch.action.admin.indices.optimize.TransportOptimizeAction.shardOperation(TransportOptimizeAction.java:

  1. at

org.elasticsearch.action.admin.indices.optimize.TransportOptimizeAction.shardOperation(TransportOptimizeAction.java:

  1. at

org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction

$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:

  1. at

org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction

$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:

  1. at org.elasticsearch.transport.netty.MessageChannelHandler
    $3.run(MessageChannelHandler.java:195)
    at

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:

  1. at java.util.concurrent.ThreadPoolExecutor
    $Worker.run(ThreadPoolExecutor.java:603)
    at java.lang.Thread.run(Thread.java:636)
    Caused by: java.io.IOException: background merge hit exception:
    _6i8:C66898 _kpg:C105747 _aw9:C17593 _kpe:C151833 _kxw:C20226 into
    _kxy [optimize] [mergeDocStores]
    at
    org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2359)
    at

org.elasticsearch.index.engine.robin.RobinEngine.optimize(RobinEngine.java:

  1. ... 9 more
    Caused by: java.lang.ArrayIndexOutOfBoundsException

And that's all. Any help is greatly appreciated.


(Christian Pesch) #9

I'm sorry, I don't have access to the logs anymore. The cause was the
classical
too-many-open-files exception. If it happens again, I'm updating this
thread.

On 27 Okt., 00:07, Shay Banon shay.ba...@elasticsearch.com wrote:

Is there a chance for the log files from when it happened? I would like to
see what might have caused the index to get to this state and maybe be able
to improve it. I think you will need to reindex the data... .

On Tue, Oct 26, 2010 at 10:22 PM, Christian Pesch cpe...@gmail.com wrote:

Hi Shay,

I've restarted the node node in question - nothing changed. Then the
cluster - nothing changed.
Then tried to optimize to somehow let the data emerge. But the data
didn't show up although
the disk usage indicates that it might be still there. Optimize with
max_num_segments=1
then showed the exceptions.

I'm using the fs gateway.

Kind regards
Christian

On 26 Okt., 21:24, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi,

Mike, nice to have you on the list!, I feel safer already :).

Christian: Can you try and restart the node in question? if not, then
the
whole cluster? This still might be recoverable from another shard
replica.
What gateway are you using, local or shared (fs/..)?

Regarding the open files. This is a change I made in the defaults of
elasticsearch, to have by default the index not to use the Lucene
compound
file format. While this leads to more file descriptors open, its benefits
are vast (mostly IO, but also CPU). This is how I would run elasticsearch
in
production, but it does require setting the open file descriptors to a
higher value (depends on the number of shards that end up on each node,
but
usually 32k is a good starting number, note that file descriptors are not
only files, also sockets and so on).

You can change that by setting: index.merge.policy.use_compound_file
to
true.

-shay.banon

On Tue, Oct 26, 2010 at 8:59 PM, Michael McCandless <
m...@mikemccandless.com

wrote:
OK, 3.0.2 does not have the fix for LUCENE-2593 (that fix landed after
3.0.2 was released).

But the fix was back-ported to the 3.0.x branch tip...

Mike

On Tue, Oct 26, 2010 at 2:47 PM, Lukáš Vlček <lukas.vl...@gmail.com
wrote:

It uses 3.0.2

http://github.com/elasticsearch/elasticsearch/blob/master/modules/ela.
..

<
http://github.com/elasticsearch/elasticsearch/blob/master/modules/ela...>

Lukas

On Tue, Oct 26, 2010 at 7:01 PM, Michael McCandless <
m...@mikemccandless.com> wrote:

This normally should not lead to corruption of the Lucene index.

But, it's possible this bug:

https://issues.apache.org/jira/browse/LUCENE-2593

is causing the corruption. That case was disk full, but I suspect
running out of file descriptors could also lead to this.

Does ElasticSearch use 3.0.1 or the tip of Lucene's 3.0.x branch...?

Mike

On Tue, Oct 26, 2010 at 7:49 AM, Christian Pesch <cpe...@gmail.com
wrote:

Hi,

I'm running ElasticSearch 0.11.0 with three machines and about 30G
of

disk usage. Yesterday, this dropped to 12G and I'm seeing errors:

java.io.IOException: directory
'/mnt/elasticsearch/work/warp/nodes/0/

indices/mau
ritius-comments/3/index' exists and is a directory, but cannot be
listed: list() returned null
at

org.elasticsearch.index.store.fs.NioFsStore.(NioFsStore.java:50)

while locating org.elasticsearch.index.store.fs.NioFsStore
while locating org.elasticsearch.index.store.Store
for parameter 5 at

org.elasticsearch.index.gateway.IndexShardGatewayService.(IndexShardGatewayService.j

ava:76)
while locating
org.elasticsearch.index.gateway.IndexShardGatewayService
Caused by: java.io.IOException: directory '/mnt/elasticsearch/work/
warp/nodes/0/indices/mauritius-comments/3/inde
x' exists and is a directory, but cannot be listed: list() returned
null
at
org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:234)
at
org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:245)
at org.elasticsearch.index.store.support.AbstractStore
$StoreDirectory.(AbstractStore.java:122)

I suspect the number of open files which is 16k now, is too low:

java.io.IOException: Too many open files
at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
at

sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:

  1. at

org.elasticsearch.common.netty.channel.socket.nio.NioServerSocketPipelineSink

$Boss.run(NioServerSocket
PipelineSink.java:245)
at

org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:

  1. at

org.elasticsearch.common.netty.util.internal.IoWorkerRunnable.run(IoWorkerRunnable.java:

  1. at

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:

  1. at java.util.concurrent.ThreadPoolExecutor
    $Worker.run(ThreadPoolExecutor.java:603)
    at java.lang.Thread.run(Thread.java:636)

Health seems to be fine:

$ curl -XGET 'http://localhost:9300/_cluster/health?pretty=true'
{
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 8,
"active_primary_shards" : 10,
"active_shards" : 20,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

My problem is: how do I recover the data? It seems to be on the
machines but is mostly unaccessible. Optimizing show the following
interesting error:

[2010-10-26 10:49:58,314][DEBUG][action.admin.indices.optimize]
[Nova-

Prime] [mauritius-comments][1], node[14f7b134-368e-4ad5-9c27-
b1cdcd15fba8], [R], s[STARTED]: Failed to execute

[org.elasticsearch.action.admin.indices.optimize.OptimizeRequest@195bed9d

]
org.elasticsearch.transport.RemoteTransportException: [Headknocker]
[inet[/10.235.38.207:9301]][indices/optimize/shard]
Caused by:
org.elasticsearch.index.engine.OptimizeFailedEngineException:
[mauritius-comments][1] Optimize failed
at

org.elasticsearch.index.engine.robin.RobinEngine.optimize(RobinEngine.java:

  1. at

org.elasticsearch.index.shard.service.InternalIndexShard.optimize(InternalIndexShard.java:

  1. at

org.elasticsearch.action.admin.indices.optimize.TransportOptimizeAction.shardOperation(TransportOptimizeAction.java:

  1. at

org.elasticsearch.action.admin.indices.optimize.TransportOptimizeAction.shardOperation(TransportOptimizeAction.java:

  1. at

org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction

$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:

  1. at

org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction

$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:

  1. at org.elasticsearch.transport.netty.MessageChannelHandler
    $3.run(MessageChannelHandler.java:195)
    at

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:

  1. at java.util.concurrent.ThreadPoolExecutor
    $Worker.run(ThreadPoolExecutor.java:603)
    at java.lang.Thread.run(Thread.java:636)
    Caused by: java.io.IOException: background merge hit exception:
    _6i8:C66898 _kpg:C105747 _aw9:C17593 _kpe:C151833 _kxw:C20226 into
    _kxy [optimize] [mergeDocStores]
    at
    org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2359)
    at

org.elasticsearch.index.engine.robin.RobinEngine.optimize(RobinEngine.java:

  1. ... 9 more
    Caused by: java.lang.ArrayIndexOutOfBoundsException

And that's all. Any help is greatly appreciated.


#10

On Wed, Oct 27, 2010 at 5:07 AM, Christian Pesch cpesch@gmail.com wrote:

I'm sorry, I don't have access to the logs anymore. The cause was the
classical
too-many-open-files exception. If it happens again, I'm updating this
thread.

but this shouldn't corrupt your index!

I think I agree with Mike, that it would be better for ElasticSearch
to use http://svn.apache.org/repos/asf/lucene/java/branches/lucene_3_0/
than Lucene 3.0.2, because we think/hope its treated like a disk full
exception and really the LUCENE-2593 bug, and if thats the case, by
defaulting to non-compound file format, ElasticSearch users will be
more prone to this.

These release branches are really stable, bugfixes-only, you can see
the changes here:
http://svn.apache.org/repos/asf/lucene/java/branches/lucene_3_0/CHANGES.txt

separately i'll see if we can beef up our unit tests to simulate this
error when a file is opened (since I think we only mock the disk full
case happening in IndexOutput.writeBytes and maybe there really is a
bug).


(system) #11