NullPointerException and unresponsive cluster afterwards


(Michel Conrad) #1

Hi,
I got the following NullPointerException. I have been using branch
0.16 (commit ff90873a).
Afterwards the cluster does not respond anymore.

Exception in thread "elasticsearch[Poison]gateway-pool-17-thread-77"
java.lang.NullPointerException
at org.elasticsearch.gateway.local.LocalGateway$1.run(LocalGateway.java:231)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

On that node I am getting the following log entries afterwards:

[2011-06-14 09:49:07,333][ESC[33mWARN ESC[0m][cluster.action.shard
] [Poison] received shard failed for [searchtest][0],
node[cFpVDyPkSKqVqP8sOlTsPQ], [P], s[INITIALIZING], reason [Failed to
start shard, message
[IndexShardGatewayRecoveryException[[searchtest][0] failed recovery];
nested: EngineCreationFailureException[[searchtest][0] Failed to
create engine]; nested: LockReleaseFailedException[Cannot forcefully
unlock a NativeFSLock which is held by another indexer component:
/hd1/elasticsearch_data/search/nodes/0/indices/searchtest/0/index/write.lock];
]]
[2011-06-14 09:49:07,381][ESC[33mWARN ESC[0m][index.engine.robin
] [Poison] [searchtest][0] shard is locked, releasing lock
[2011-06-14 09:49:07,381][ESC[33mWARN ESC[0m][indices.cluster
] [Poison] [searchtest][0] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[searchtest][0] failed recovery
at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:194)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: org.elasticsearch.index.engine.EngineCreationFailureException:
[searchtest][0] Failed to create engine
at org.elasticsearch.index.engine.robin.RobinEngine.start(RobinEngine.java:226)
at org.elasticsearch.index.shard.service.InternalIndexShard.start(InternalIndexShard.java:243)
at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:113)
at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:144)
... 3 more
Caused by: org.apache.lucene.store.LockReleaseFailedException: Cannot
forcefully unlock a NativeFSLock which is held by another indexer
component: /hd1/elasticsearch_data/search/nodes/0/indices/searchtest/0/index/write.lock
at org.apache.lucene.store.NativeFSLock.release(NativeFSLockFactory.java:294)
at org.apache.lucene.index.IndexWriter.unlock(IndexWriter.java:4277)
at org.elasticsearch.index.engine.robin.RobinEngine.createWriter(RobinEngine.java:986)
at org.elasticsearch.index.engine.robin.RobinEngine.start(RobinEngine.java:224)
... 6 more

When I try to search on another node I am getting no response and the
following log entry:

[2011-06-14 09:58:13,325][WARN
][netty.channel.socket.nio.NioServerSocketPipelineSink] Failed to
accept a connection.
java.io.IOException: Too many open files
at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:145)
at org.elasticsearch.common.netty.channel.socket.nio.NioServerSocketPipelineSink$Boss.run(NioServerSocketPipelineSink.java:244)
at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:44)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

Best regards,
Michel


(Shay Banon) #2

You need to increase the max number of open files in the OS.

On Tuesday, June 14, 2011 at 11:02 AM, Michel Conrad wrote:

Hi,
I got the following NullPointerException. I have been using branch
0.16 (commit ff90873a).
Afterwards the cluster does not respond anymore.

Exception in thread "elasticsearch[Poison]gateway-pool-17-thread-77"
java.lang.NullPointerException
at org.elasticsearch.gateway.local.LocalGateway$1.run(LocalGateway.java:231)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

On that node I am getting the following log entries afterwards:

[2011-06-14 09:49:07,333][ESC[33mWARN ESC[0m][cluster.action.shard
] [Poison] received shard failed for [searchtest][0],
node[cFpVDyPkSKqVqP8sOlTsPQ], [P], s[INITIALIZING], reason [Failed to
start shard, message
[IndexShardGatewayRecoveryException[[searchtest][0] failed recovery];
nested: EngineCreationFailureException[[searchtest][0] Failed to
create engine]; nested: LockReleaseFailedException[Cannot forcefully
unlock a NativeFSLock which is held by another indexer component:
/hd1/elasticsearch_data/search/nodes/0/indices/searchtest/0/index/write.lock];
]]
[2011-06-14 09:49:07,381][ESC[33mWARN ESC[0m][index.engine.robin
] [Poison] [searchtest][0] shard is locked, releasing lock
[2011-06-14 09:49:07,381][ESC[33mWARN ESC[0m][indices.cluster
] [Poison] [searchtest][0] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[searchtest][0] failed recovery
at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:194)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: org.elasticsearch.index.engine.EngineCreationFailureException:
[searchtest][0] Failed to create engine
at org.elasticsearch.index.engine.robin.RobinEngine.start(RobinEngine.java:226)
at org.elasticsearch.index.shard.service.InternalIndexShard.start(InternalIndexShard.java:243)
at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:113)
at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:144)
... 3 more
Caused by: org.apache.lucene.store.LockReleaseFailedException: Cannot
forcefully unlock a NativeFSLock which is held by another indexer
component: /hd1/elasticsearch_data/search/nodes/0/indices/searchtest/0/index/write.lock
at org.apache.lucene.store.NativeFSLock.release(NativeFSLockFactory.java:294)
at org.apache.lucene.index.IndexWriter.unlock(IndexWriter.java:4277)
at org.elasticsearch.index.engine.robin.RobinEngine.createWriter(RobinEngine.java:986)
at org.elasticsearch.index.engine.robin.RobinEngine.start(RobinEngine.java:224)
... 6 more

When I try to search on another node I am getting no response and the
following log entry:

[2011-06-14 09:58:13,325][WARN
][netty.channel.socket.nio.NioServerSocketPipelineSink] Failed to
accept a connection.
java.io.IOException: Too many open files
at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:145)
at org.elasticsearch.common.netty.channel.socket.nio.NioServerSocketPipelineSink$Boss.run(NioServerSocketPipelineSink.java:244)
at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:44)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

Best regards,
Michel


(Michel Conrad) #3

Hi,
I forgot mentioning that my ulimit is already set to 32768.

/software/elasticsearch/bin/elasticsearch -f -Des.max-open-files=true
[2011-06-14 11:01:52,895][INFO ][bootstrap ]
max_open_files [32626]

Best,
Mechel

On Tue, Jun 14, 2011 at 10:50 AM, Shay Banon
shay.banon@elasticsearch.com wrote:

You need to increase the max number of open files in the OS.

On Tuesday, June 14, 2011 at 11:02 AM, Michel Conrad wrote:

Hi,
I got the following NullPointerException. I have been using branch
0.16 (commit ff90873a).
Afterwards the cluster does not respond anymore.

Exception in thread "elasticsearch[Poison]gateway-pool-17-thread-77"
java.lang.NullPointerException
at org.elasticsearch.gateway.local.LocalGateway$1.run(LocalGateway.java:231)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

On that node I am getting the following log entries afterwards:

[2011-06-14 09:49:07,333][ESC[33mWARN ESC[0m][cluster.action.shard
] [Poison] received shard failed for [searchtest][0],
node[cFpVDyPkSKqVqP8sOlTsPQ], [P], s[INITIALIZING], reason [Failed to
start shard, message
[IndexShardGatewayRecoveryException[[searchtest][0] failed recovery];
nested: EngineCreationFailureException[[searchtest][0] Failed to
create engine]; nested: LockReleaseFailedException[Cannot forcefully
unlock a NativeFSLock which is held by another indexer component:
/hd1/elasticsearch_data/search/nodes/0/indices/searchtest/0/index/write.lock];
]]
[2011-06-14 09:49:07,381][ESC[33mWARN ESC[0m][index.engine.robin
] [Poison] [searchtest][0] shard is locked, releasing lock
[2011-06-14 09:49:07,381][ESC[33mWARN ESC[0m][indices.cluster
] [Poison] [searchtest][0] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[searchtest][0] failed recovery
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:194)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: org.elasticsearch.index.engine.EngineCreationFailureException:
[searchtest][0] Failed to create engine
at
org.elasticsearch.index.engine.robin.RobinEngine.start(RobinEngine.java:226)
at
org.elasticsearch.index.shard.service.InternalIndexShard.start(InternalIndexShard.java:243)
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:113)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:144)
... 3 more
Caused by: org.apache.lucene.store.LockReleaseFailedException: Cannot
forcefully unlock a NativeFSLock which is held by another indexer
component:
/hd1/elasticsearch_data/search/nodes/0/indices/searchtest/0/index/write.lock
at
org.apache.lucene.store.NativeFSLock.release(NativeFSLockFactory.java:294)
at org.apache.lucene.index.IndexWriter.unlock(IndexWriter.java:4277)
at
org.elasticsearch.index.engine.robin.RobinEngine.createWriter(RobinEngine.java:986)
at
org.elasticsearch.index.engine.robin.RobinEngine.start(RobinEngine.java:224)
... 6 more

When I try to search on another node I am getting no response and the
following log entry:

[2011-06-14 09:58:13,325][WARN
][netty.channel.socket.nio.NioServerSocketPipelineSink] Failed to
accept a connection.
java.io.IOException: Too many open files
at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
at
sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:145)
at
org.elasticsearch.common.netty.channel.socket.nio.NioServerSocketPipelineSink$Boss.run(NioServerSocketPipelineSink.java:244)
at
org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at
org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:44)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

Best regards,
Michel


(Shay Banon) #4

Seems like you are still running out of file descriptors. Do you start elasticsearch the same way you run it below? How many indices / shards do you create?

On Tuesday, June 14, 2011 at 12:04 PM, Michel Conrad wrote:

Hi,
I forgot mentioning that my ulimit is already set to 32768.

/software/elasticsearch/bin/elasticsearch -f -Des.max-open-files=true
[2011-06-14 11:01:52,895][INFO ][bootstrap ]
max_open_files [32626]

Best,
Mechel

On Tue, Jun 14, 2011 at 10:50 AM, Shay Banon
<shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

You need to increase the max number of open files in the OS.

On Tuesday, June 14, 2011 at 11:02 AM, Michel Conrad wrote:

Hi,
I got the following NullPointerException. I have been using branch
0.16 (commit ff90873a).
Afterwards the cluster does not respond anymore.

Exception in thread "elasticsearch[Poison]gateway-pool-17-thread-77"
java.lang.NullPointerException
at org.elasticsearch.gateway.local.LocalGateway$1.run(LocalGateway.java:231)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

On that node I am getting the following log entries afterwards:

[2011-06-14 09:49:07,333][ESC[33mWARN ESC[0m][cluster.action.shard
] [Poison] received shard failed for [searchtest][0],
node[cFpVDyPkSKqVqP8sOlTsPQ], [P], s[INITIALIZING], reason [Failed to
start shard, message
[IndexShardGatewayRecoveryException[[searchtest][0] failed recovery];
nested: EngineCreationFailureException[[searchtest][0] Failed to
create engine]; nested: LockReleaseFailedException[Cannot forcefully
unlock a NativeFSLock which is held by another indexer component:
/hd1/elasticsearch_data/search/nodes/0/indices/searchtest/0/index/write.lock];
]]
[2011-06-14 09:49:07,381][ESC[33mWARN ESC[0m][index.engine.robin
] [Poison] [searchtest][0] shard is locked, releasing lock
[2011-06-14 09:49:07,381][ESC[33mWARN ESC[0m][indices.cluster
] [Poison] [searchtest][0] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[searchtest][0] failed recovery
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:194)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: org.elasticsearch.index.engine.EngineCreationFailureException:
[searchtest][0] Failed to create engine
at
org.elasticsearch.index.engine.robin.RobinEngine.start(RobinEngine.java:226)
at
org.elasticsearch.index.shard.service.InternalIndexShard.start(InternalIndexShard.java:243)
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:113)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:144)
... 3 more
Caused by: org.apache.lucene.store.LockReleaseFailedException: Cannot
forcefully unlock a NativeFSLock which is held by another indexer
component:
/hd1/elasticsearch_data/search/nodes/0/indices/searchtest/0/index/write.lock
at
org.apache.lucene.store.NativeFSLock.release(NativeFSLockFactory.java:294)
at org.apache.lucene.index.IndexWriter.unlock(IndexWriter.java:4277)
at
org.elasticsearch.index.engine.robin.RobinEngine.createWriter(RobinEngine.java:986)
at
org.elasticsearch.index.engine.robin.RobinEngine.start(RobinEngine.java:224)
... 6 more

When I try to search on another node I am getting no response and the
following log entry:

[2011-06-14 09:58:13,325][WARN
][netty.channel.socket.nio.NioServerSocketPipelineSink] Failed to
accept a connection.
java.io.IOException: Too many open files
at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
at
sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:145)
at
org.elasticsearch.common.netty.channel.socket.nio.NioServerSocketPipelineSink$Boss.run(NioServerSocketPipelineSink.java:244)
at
org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at
org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:44)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

Best regards,
Michel


(Michel Conrad) #5

I start elasticsearch exactly the same way. I have currently 318
shards, and some indices are created programmatic on demand.

On Tue, Jun 14, 2011 at 11:06 AM, Shay Banon
shay.banon@elasticsearch.com wrote:

Seems like you are still running out of file descriptors. Do you start
elasticsearch the same way you run it below? How many indices / shards do
you create?

On Tuesday, June 14, 2011 at 12:04 PM, Michel Conrad wrote:

Hi,
I forgot mentioning that my ulimit is already set to 32768.

/software/elasticsearch/bin/elasticsearch -f -Des.max-open-files=true
[2011-06-14 11:01:52,895][INFO ][bootstrap ]
max_open_files [32626]

Best,
Mechel

On Tue, Jun 14, 2011 at 10:50 AM, Shay Banon
shay.banon@elasticsearch.com wrote:

You need to increase the max number of open files in the OS.

On Tuesday, June 14, 2011 at 11:02 AM, Michel Conrad wrote:

Hi,
I got the following NullPointerException. I have been using branch
0.16 (commit ff90873a).
Afterwards the cluster does not respond anymore.

Exception in thread "elasticsearch[Poison]gateway-pool-17-thread-77"
java.lang.NullPointerException
at org.elasticsearch.gateway.local.LocalGateway$1.run(LocalGateway.java:231)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

On that node I am getting the following log entries afterwards:

[2011-06-14 09:49:07,333][ESC[33mWARN ESC[0m][cluster.action.shard
] [Poison] received shard failed for [searchtest][0],
node[cFpVDyPkSKqVqP8sOlTsPQ], [P], s[INITIALIZING], reason [Failed to
start shard, message
[IndexShardGatewayRecoveryException[[searchtest][0] failed recovery];
nested: EngineCreationFailureException[[searchtest][0] Failed to
create engine]; nested: LockReleaseFailedException[Cannot forcefully
unlock a NativeFSLock which is held by another indexer component:
/hd1/elasticsearch_data/search/nodes/0/indices/searchtest/0/index/write.lock];
]]
[2011-06-14 09:49:07,381][ESC[33mWARN ESC[0m][index.engine.robin
] [Poison] [searchtest][0] shard is locked, releasing lock
[2011-06-14 09:49:07,381][ESC[33mWARN ESC[0m][indices.cluster
] [Poison] [searchtest][0] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[searchtest][0] failed recovery
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:194)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: org.elasticsearch.index.engine.EngineCreationFailureException:
[searchtest][0] Failed to create engine
at
org.elasticsearch.index.engine.robin.RobinEngine.start(RobinEngine.java:226)
at
org.elasticsearch.index.shard.service.InternalIndexShard.start(InternalIndexShard.java:243)
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:113)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:144)
... 3 more
Caused by: org.apache.lucene.store.LockReleaseFailedException: Cannot
forcefully unlock a NativeFSLock which is held by another indexer
component:
/hd1/elasticsearch_data/search/nodes/0/indices/searchtest/0/index/write.lock
at
org.apache.lucene.store.NativeFSLock.release(NativeFSLockFactory.java:294)
at org.apache.lucene.index.IndexWriter.unlock(IndexWriter.java:4277)
at
org.elasticsearch.index.engine.robin.RobinEngine.createWriter(RobinEngine.java:986)
at
org.elasticsearch.index.engine.robin.RobinEngine.start(RobinEngine.java:224)
... 6 more

When I try to search on another node I am getting no response and the
following log entry:

[2011-06-14 09:58:13,325][WARN
][netty.channel.socket.nio.NioServerSocketPipelineSink] Failed to
accept a connection.
java.io.IOException: Too many open files
at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
at
sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:145)
at
org.elasticsearch.common.netty.channel.socket.nio.NioServerSocketPipelineSink$Boss.run(NioServerSocketPipelineSink.java:244)
at
org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at
org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:44)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

Best regards,
Michel


(Shay Banon) #6

You seem to be running out of file descriptors, so I would suggest using lsof to check the number of open files used. Note that a socket is also a "file descriptor", so maybe also check the number of open sockets.

On Tuesday, June 14, 2011 at 12:44 PM, Michel Conrad wrote:

I start elasticsearch exactly the same way. I have currently 318
shards, and some indices are created programmatic on demand.

On Tue, Jun 14, 2011 at 11:06 AM, Shay Banon
<shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

Seems like you are still running out of file descriptors. Do you start
elasticsearch the same way you run it below? How many indices / shards do
you create?

On Tuesday, June 14, 2011 at 12:04 PM, Michel Conrad wrote:

Hi,
I forgot mentioning that my ulimit is already set to 32768.

/software/elasticsearch/bin/elasticsearch -f -Des.max-open-files=true
[2011-06-14 11:01:52,895][INFO ][bootstrap ]
max_open_files [32626]

Best,
Mechel

On Tue, Jun 14, 2011 at 10:50 AM, Shay Banon
<shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

You need to increase the max number of open files in the OS.

On Tuesday, June 14, 2011 at 11:02 AM, Michel Conrad wrote:

Hi,
I got the following NullPointerException. I have been using branch
0.16 (commit ff90873a).
Afterwards the cluster does not respond anymore.

Exception in thread "elasticsearch[Poison]gateway-pool-17-thread-77"
java.lang.NullPointerException
at org.elasticsearch.gateway.local.LocalGateway$1.run(LocalGateway.java:231)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

On that node I am getting the following log entries afterwards:

[2011-06-14 09:49:07,333][ESC[33mWARN ESC[0m][cluster.action.shard
] [Poison] received shard failed for [searchtest][0],
node[cFpVDyPkSKqVqP8sOlTsPQ], [P], s[INITIALIZING], reason [Failed to
start shard, message
[IndexShardGatewayRecoveryException[[searchtest][0] failed recovery];
nested: EngineCreationFailureException[[searchtest][0] Failed to
create engine]; nested: LockReleaseFailedException[Cannot forcefully
unlock a NativeFSLock which is held by another indexer component:
/hd1/elasticsearch_data/search/nodes/0/indices/searchtest/0/index/write.lock];
]]
[2011-06-14 09:49:07,381][ESC[33mWARN ESC[0m][index.engine.robin
] [Poison] [searchtest][0] shard is locked, releasing lock
[2011-06-14 09:49:07,381][ESC[33mWARN ESC[0m][indices.cluster
] [Poison] [searchtest][0] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[searchtest][0] failed recovery
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:194)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: org.elasticsearch.index.engine.EngineCreationFailureException:
[searchtest][0] Failed to create engine
at
org.elasticsearch.index.engine.robin.RobinEngine.start(RobinEngine.java:226)
at
org.elasticsearch.index.shard.service.InternalIndexShard.start(InternalIndexShard.java:243)
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:113)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:144)
... 3 more
Caused by: org.apache.lucene.store.LockReleaseFailedException: Cannot
forcefully unlock a NativeFSLock which is held by another indexer
component:
/hd1/elasticsearch_data/search/nodes/0/indices/searchtest/0/index/write.lock
at
org.apache.lucene.store.NativeFSLock.release(NativeFSLockFactory.java:294)
at org.apache.lucene.index.IndexWriter.unlock(IndexWriter.java:4277)
at
org.elasticsearch.index.engine.robin.RobinEngine.createWriter(RobinEngine.java:986)
at
org.elasticsearch.index.engine.robin.RobinEngine.start(RobinEngine.java:224)
... 6 more

When I try to search on another node I am getting no response and the
following log entry:

[2011-06-14 09:58:13,325][WARN
][netty.channel.socket.nio.NioServerSocketPipelineSink] Failed to
accept a connection.
java.io.IOException: Too many open files
at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
at
sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:145)
at
org.elasticsearch.common.netty.channel.socket.nio.NioServerSocketPipelineSink$Boss.run(NioServerSocketPipelineSink.java:244)
at
org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at
org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:44)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

Best regards,
Michel


(Michel Conrad) #7

Hi, I will further investigate the issue if it reoccurs.
Thanks for the quick answer.

On Tue, Jun 14, 2011 at 12:04 PM, Shay Banon
shay.banon@elasticsearch.com wrote:

You seem to be running out of file descriptors, so I would suggest using
lsof to check the number of open files used. Note that a socket is also a
"file descriptor", so maybe also check the number of open sockets.

On Tuesday, June 14, 2011 at 12:44 PM, Michel Conrad wrote:

I start elasticsearch exactly the same way. I have currently 318
shards, and some indices are created programmatic on demand.

On Tue, Jun 14, 2011 at 11:06 AM, Shay Banon
shay.banon@elasticsearch.com wrote:

Seems like you are still running out of file descriptors. Do you start
elasticsearch the same way you run it below? How many indices / shards do
you create?

On Tuesday, June 14, 2011 at 12:04 PM, Michel Conrad wrote:

Hi,
I forgot mentioning that my ulimit is already set to 32768.

/software/elasticsearch/bin/elasticsearch -f -Des.max-open-files=true
[2011-06-14 11:01:52,895][INFO ][bootstrap ]
max_open_files [32626]

Best,
Mechel

On Tue, Jun 14, 2011 at 10:50 AM, Shay Banon
shay.banon@elasticsearch.com wrote:

You need to increase the max number of open files in the OS.

On Tuesday, June 14, 2011 at 11:02 AM, Michel Conrad wrote:

Hi,
I got the following NullPointerException. I have been using branch
0.16 (commit ff90873a).
Afterwards the cluster does not respond anymore.

Exception in thread "elasticsearch[Poison]gateway-pool-17-thread-77"
java.lang.NullPointerException
at org.elasticsearch.gateway.local.LocalGateway$1.run(LocalGateway.java:231)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

On that node I am getting the following log entries afterwards:

[2011-06-14 09:49:07,333][ESC[33mWARN ESC[0m][cluster.action.shard
] [Poison] received shard failed for [searchtest][0],
node[cFpVDyPkSKqVqP8sOlTsPQ], [P], s[INITIALIZING], reason [Failed to
start shard, message
[IndexShardGatewayRecoveryException[[searchtest][0] failed recovery];
nested: EngineCreationFailureException[[searchtest][0] Failed to
create engine]; nested: LockReleaseFailedException[Cannot forcefully
unlock a NativeFSLock which is held by another indexer component:
/hd1/elasticsearch_data/search/nodes/0/indices/searchtest/0/index/write.lock];
]]
[2011-06-14 09:49:07,381][ESC[33mWARN ESC[0m][index.engine.robin
] [Poison] [searchtest][0] shard is locked, releasing lock
[2011-06-14 09:49:07,381][ESC[33mWARN ESC[0m][indices.cluster
] [Poison] [searchtest][0] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[searchtest][0] failed recovery
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:194)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: org.elasticsearch.index.engine.EngineCreationFailureException:
[searchtest][0] Failed to create engine
at
org.elasticsearch.index.engine.robin.RobinEngine.start(RobinEngine.java:226)
at
org.elasticsearch.index.shard.service.InternalIndexShard.start(InternalIndexShard.java:243)
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:113)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:144)
... 3 more
Caused by: org.apache.lucene.store.LockReleaseFailedException: Cannot
forcefully unlock a NativeFSLock which is held by another indexer
component:
/hd1/elasticsearch_data/search/nodes/0/indices/searchtest/0/index/write.lock
at
org.apache.lucene.store.NativeFSLock.release(NativeFSLockFactory.java:294)
at org.apache.lucene.index.IndexWriter.unlock(IndexWriter.java:4277)
at
org.elasticsearch.index.engine.robin.RobinEngine.createWriter(RobinEngine.java:986)
at
org.elasticsearch.index.engine.robin.RobinEngine.start(RobinEngine.java:224)
... 6 more

When I try to search on another node I am getting no response and the
following log entry:

[2011-06-14 09:58:13,325][WARN
][netty.channel.socket.nio.NioServerSocketPipelineSink] Failed to
accept a connection.
java.io.IOException: Too many open files
at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
at
sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:145)
at
org.elasticsearch.common.netty.channel.socket.nio.NioServerSocketPipelineSink$Boss.run(NioServerSocketPipelineSink.java:244)
at
org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at
org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:44)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

Best regards,
Michel


(system) #8