I tried to index about a million documents on a cluster of two servers
and ran into my stock Fedora open files limit of 1024. Should I be
bumping up the limit, or could Elastic Search be closing up some file
descriptors? I'll post whatever I can about my setup below.
elasticsearch.yml
cluster:
name: cluster_name
network:
publishHost: 192.168.x.x
bindHost: 192.168.x.x
index:
numberOfShards: 1
numberOfReplicas: 0
store:
type: niofs
The content I'm indexing is news stories and I broke them up into one-
week indices (index names will be something like 201014 for the 14th
week in 2010). It looks like the server that hit the ulimit first was
holding 88 indices. That server started logging stacks like:
[00:00:27,735][WARN ]
[netty.lib.channel.socket.nio.NioServerSocketPipelineSink] Failed to
accept a connection.
java.io.IOException: Too many open files
at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
at
sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:
145)
at
org.jboss.netty.channel.socket.nio.NioServerSocketPipelineSink
$Boss.run(NioServerSocketPipelineSink.java:227)
at
org.jboss.netty.util.internal.IoWorkerRunnable.run(IoWorkerRunnable.java:
46)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
And the other node started logging:
[16:17:34,176][WARN ][cluster.action.shard ] [Carnage] Received
shard failed for [200844][0], Node[localhost-8066], Relocating
[localhost-38937], [P], S[INITIALIZING], reason [Failed to start
shard, message [RecoveryFailedException[Index Shard [200844][0]:
Recovery failed from [Carnage][localhost-38937][data][inet[/
192.168.3.36:9300]] into [Maestro][localhost-8066][data]
[inet[192.168.3.37/192.168.3.37:9300]]]; nested:
RemoteTransportException[[Carnage][inet[/192.168.3.36:9300]][200844/0/
recovery/start]]; nested: RecoveryEngineException[[200844][0] Phase[1]
Execution failed]; nested: RecoverFilesRecoveryException[[200844][0]
Failed to transfer [1] files with total size of [32b]]; nested:
RemoteTransportException[[Maestro][inet[/192.168.3.37:9300]][200844/0/
recovery/fileChunk]]; nested: FileNotFoundException[/usr/local/
elasticsearch-0.6.0-20100406/work/clipsyndicate_development/indices/
localhost-8066/200844/0/index/segments_1 (Too many open files)]; ]]
The node that didn't make its ulimit reports this to /_cluster/health:
{"status":"green","timed_out":false,"active_primary_shards":
170,"active_shards":170,"relocating_shards":1}
...and the other one won't respond. Here is the output from lsof for
the elastic search process on each box:
http://www.divshare.com/download/10994958-cee
http://www.divshare.com/download/10994960-351
This was running on the master from GitHub, updated yesterday morning,
April 6.