Too Many Open Files


(Matt-2) #1

I tried to index about a million documents on a cluster of two servers
and ran into my stock Fedora open files limit of 1024. Should I be
bumping up the limit, or could Elastic Search be closing up some file
descriptors? I'll post whatever I can about my setup below.

elasticsearch.yml

cluster:
name: cluster_name

network:
publishHost: 192.168.x.x
bindHost: 192.168.x.x

index:
numberOfShards: 1
numberOfReplicas: 0
store:
type: niofs

The content I'm indexing is news stories and I broke them up into one-
week indices (index names will be something like 201014 for the 14th
week in 2010). It looks like the server that hit the ulimit first was
holding 88 indices. That server started logging stacks like:

[00:00:27,735][WARN ]
[netty.lib.channel.socket.nio.NioServerSocketPipelineSink] Failed to
accept a connection.
java.io.IOException: Too many open files
at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
at
sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:
145)
at
org.jboss.netty.channel.socket.nio.NioServerSocketPipelineSink
$Boss.run(NioServerSocketPipelineSink.java:227)
at
org.jboss.netty.util.internal.IoWorkerRunnable.run(IoWorkerRunnable.java:
46)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)

And the other node started logging:

[16:17:34,176][WARN ][cluster.action.shard ] [Carnage] Received
shard failed for [200844][0], Node[localhost-8066], Relocating
[localhost-38937], [P], S[INITIALIZING], reason [Failed to start
shard, message [RecoveryFailedException[Index Shard [200844][0]:
Recovery failed from [Carnage][localhost-38937][data][inet[/
192.168.3.36:9300]] into [Maestro][localhost-8066][data]
[inet[192.168.3.37/192.168.3.37:9300]]]; nested:
RemoteTransportException[[Carnage][inet[/192.168.3.36:9300]][200844/0/
recovery/start]]; nested: RecoveryEngineException[[200844][0] Phase[1]
Execution failed]; nested: RecoverFilesRecoveryException[[200844][0]
Failed to transfer [1] files with total size of [32b]]; nested:
RemoteTransportException[[Maestro][inet[/192.168.3.37:9300]][200844/0/
recovery/fileChunk]]; nested: FileNotFoundException[/usr/local/
elasticsearch-0.6.0-20100406/work/clipsyndicate_development/indices/
localhost-8066/200844/0/index/segments_1 (Too many open files)]; ]]

The node that didn't make its ulimit reports this to /_cluster/health:

{"status":"green","timed_out":false,"active_primary_shards":
170,"active_shards":170,"relocating_shards":1}

...and the other one won't respond. Here is the output from lsof for
the elastic search process on each box:

http://www.divshare.com/download/10994958-cee
http://www.divshare.com/download/10994960-351

This was running on the master from GitHub, updated yesterday morning,
April 6.


(Shay Banon) #2

1024 is too low. With all the sockets and files flying around (elasticsearch
does close the ones that are no longer needed), it should be at least 8k,
preferably 16k.

-shay.banon

On Wed, Apr 7, 2010 at 8:37 PM, Matt matthew.maul@gmail.com wrote:

I tried to index about a million documents on a cluster of two servers
and ran into my stock Fedora open files limit of 1024. Should I be
bumping up the limit, or could Elastic Search be closing up some file
descriptors? I'll post whatever I can about my setup below.

elasticsearch.yml

cluster:
name: cluster_name

network:
publishHost: 192.168.x.x
bindHost: 192.168.x.x

index:
numberOfShards: 1
numberOfReplicas: 0
store:
type: niofs

The content I'm indexing is news stories and I broke them up into one-
week indices (index names will be something like 201014 for the 14th
week in 2010). It looks like the server that hit the ulimit first was
holding 88 indices. That server started logging stacks like:

[00:00:27,735][WARN ]
[netty.lib.channel.socket.nio.NioServerSocketPipelineSink] Failed to
accept a connection.
java.io.IOException: Too many open files
at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
at
sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:
145)
at
org.jboss.netty.channel.socket.nio.NioServerSocketPipelineSink
$Boss.run(NioServerSocketPipelineSink.java:227)
at
org.jboss.netty.util.internal.IoWorkerRunnable.run(IoWorkerRunnable.java:
46)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)

And the other node started logging:

[16:17:34,176][WARN ][cluster.action.shard ] [Carnage] Received
shard failed for [200844][0], Node[localhost-8066], Relocating
[localhost-38937], [P], S[INITIALIZING], reason [Failed to start
shard, message [RecoveryFailedException[Index Shard [200844][0]:
Recovery failed from [Carnage][localhost-38937][data][inet[/
192.168.3.36:9300]] into [Maestro][localhost-8066][data]
[inet[192.168.3.37/192.168.3.37:9300]]]; nested:
RemoteTransportException[[Carnage][inet[/192.168.3.36:9300]][200844/0/
recovery/start]]; nested: RecoveryEngineException[[200844][0] Phase[1]
Execution failed]; nested: RecoverFilesRecoveryException[[200844][0]
Failed to transfer [1] files with total size of [32b]]; nested:
RemoteTransportException[[Maestro][inet[/192.168.3.37:9300]][200844/0/
recovery/fileChunk]]; nested: FileNotFoundException[/usr/local/
elasticsearch-0.6.0-20100406/work/clipsyndicate_development/indices/
localhost-8066/200844/0/index/segments_1 (Too many open files)]; ]]

The node that didn't make its ulimit reports this to /_cluster/health:

{"status":"green","timed_out":false,"active_primary_shards":
170,"active_shards":170,"relocating_shards":1}

...and the other one won't respond. Here is the output from lsof for
the elastic search process on each box:

http://www.divshare.com/download/10994958-cee
http://www.divshare.com/download/10994960-351

This was running on the master from GitHub, updated yesterday morning,
April 6.


(Matt-2) #3

OK. I'll up the limit and try the test again.

On Apr 7, 2:30 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

1024 is too low. With all the sockets and files flying around (elasticsearch
does close the ones that are no longer needed), it should be at least 8k,
preferably 16k.

-shay.banon

On Wed, Apr 7, 2010 at 8:37 PM, Matt matthew.m...@gmail.com wrote:

I tried to index about a million documents on a cluster of two servers
and ran into my stock Fedora open files limit of 1024. Should I be
bumping up the limit, or could Elastic Search be closing up some file
descriptors? I'll post whatever I can about my setup below.

elasticsearch.yml

cluster:
name: cluster_name

network:
publishHost: 192.168.x.x
bindHost: 192.168.x.x

index:
numberOfShards: 1
numberOfReplicas: 0
store:
type: niofs

The content I'm indexing is news stories and I broke them up into one-
week indices (index names will be something like 201014 for the 14th
week in 2010). It looks like the server that hit the ulimit first was
holding 88 indices. That server started logging stacks like:

[00:00:27,735][WARN ]
[netty.lib.channel.socket.nio.NioServerSocketPipelineSink] Failed to
accept a connection.
java.io.IOException: Too many open files
at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
at
sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:
145)
at
org.jboss.netty.channel.socket.nio.NioServerSocketPipelineSink
$Boss.run(NioServerSocketPipelineSink.java:227)
at
org.jboss.netty.util.internal.IoWorkerRunnable.run(IoWorkerRunnable.java:
46)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)

And the other node started logging:

[16:17:34,176][WARN ][cluster.action.shard ] [Carnage] Received
shard failed for [200844][0], Node[localhost-8066], Relocating
[localhost-38937], [P], S[INITIALIZING], reason [Failed to start
shard, message [RecoveryFailedException[Index Shard [200844][0]:
Recovery failed from [Carnage][localhost-38937][data][inet[/
192.168.3.36:9300]] into [Maestro][localhost-8066][data]
[inet[192.168.3.37/192.168.3.37:9300]]]; nested:
RemoteTransportException[[Carnage][inet[/192.168.3.36:9300]][200844/0/
recovery/start]]; nested: RecoveryEngineException[[200844][0] Phase[1]
Execution failed]; nested: RecoverFilesRecoveryException[[200844][0]
Failed to transfer [1] files with total size of [32b]]; nested:
RemoteTransportException[[Maestro][inet[/192.168.3.37:9300]][200844/0/
recovery/fileChunk]]; nested: FileNotFoundException[/usr/local/
elasticsearch-0.6.0-20100406/work/clipsyndicate_development/indices/
localhost-8066/200844/0/index/segments_1 (Too many open files)]; ]]

The node that didn't make its ulimit reports this to /_cluster/health:

{"status":"green","timed_out":false,"active_primary_shards":
170,"active_shards":170,"relocating_shards":1}

...and the other one won't respond. Here is the output from lsof for
the elastic search process on each box:

http://www.divshare.com/download/10994958-cee
http://www.divshare.com/download/10994960-351

This was running on the master from GitHub, updated yesterday morning,
April 6.


(hunter) #4

Hi running on Amazon Linux AMI (centos) with Elasticsearch 0.15, we're seeing too many open files.
I've already set ulimit to 64k for the elasticsearch user in /etc/security/limits.conf, I've tested by
su - elasticsearch; ulimit -a

output:
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 59722
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 64000
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 1024
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

Our error in elasticsearch.log looks like this (this repeats every one to two seconds). See below:


[2011-02-23 17:50:27,332][WARN ][index.shard.service ] [Madam Slay] [visibiz-test][2] Failed to perform scheduled engine refresh
org.elasticsearch.index.engine.RefreshFailedEngineException: [visibiz-test][2] Refresh failed
at org.elasticsearch.index.engine.robin.RobinEngine.refresh(RobinEngine.java:629)
at org.elasticsearch.index.shard.service.InternalIndexShard$EngineRefresher$1.run(InternalIndexShard.java:620)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.FileNotFoundException: /ebs/var/lib/elasticsearch/elasticsearch/nodes/0/indices/visibiz-test/2/index/_iyc.fdt (Too many open files)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.(RandomAccessFile.java:212)
at org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexInput$Descriptor.(SimpleFSDirectory.java:76)
at org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexInput.(SimpleFSDirectory.java:97)
at org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.(NIOFSDirectory.java:98)
at org.apache.lucene.store.NIOFSDirectory.openInput(NIOFSDirectory.java:78)
at org.elasticsearch.index.store.support.AbstractStore$StoreDirectory.openInput(AbstractStore.java:268)
at org.apache.lucene.index.FieldsReader.(FieldsReader.java:108)
at org.apache.lucene.index.SegmentReader$CoreReaders.openDocStores(SegmentReader.java:292)
at org.apache.lucene.index.SegmentReader.openDocStores(SegmentReader.java:627)
at org.apache.lucene.index.IndexWriter$ReaderPool.get(IndexWriter.java:630)
at org.apache.lucene.index.IndexWriter$ReaderPool.getReadOnlyClone(IndexWriter.java:580)
at org.apache.lucene.index.DirectoryReader.(DirectoryReader.java:154)
at org.apache.lucene.index.ReadOnlyDirectoryReader.(ReadOnlyDirectoryReader.java:36)
at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:411)
at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:376)
at org.apache.lucene.index.DirectoryReader.doReopenFromWriter(DirectoryReader.java:385)
at org.apache.lucene.index.DirectoryReader.doReopen(DirectoryReader.java:396)
at org.apache.lucene.index.DirectoryReader.reopen(DirectoryReader.java:363)
at org.elasticsearch.index.engine.robin.RobinEngine.refresh(RobinEngine.java:613)
... 4 more
[2011-02-23 17:50:27,751][WARN ][netty.channel.socket.nio.NioServerSocketPipelineSink] Failed to accept a connection.
java.io.IOException: Too many open files
at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:152)
at org.elasticsearch.common.netty.channel.socket.nio.NioServerSocketPipelineSink$Boss.run(NioServerSocketPipelineSink.java:244)
at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:44)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-02-23 17:50:27,851][WARN ][index.shard.service ] [Madam Slay] [visibiz-test][3] Failed to perform scheduled engine refresh
org.elasticsearch.index.engine.RefreshFailedEngineException: [visibiz-test][3] Refresh failed
at org.elasticsearch.index.engine.robin.RobinEngine.refresh(RobinEngine.java:629)
at org.elasticsearch.index.shard.service.InternalIndexShard$EngineRefresher$1.run(InternalIndexShard.java:620)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.FileNotFoundException: /ebs/var/lib/elasticsearch/elasticsearch/nodes/0/indices/visibiz-test/3/index/_ix0.fdt (Too many open files)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.(RandomAccessFile.java:212)
at org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexInput$Descriptor.(SimpleFSDirectory.java:76)
at org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexInput.(SimpleFSDirectory.java:97)
at org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.(NIOFSDirectory.java:98)
at org.apache.lucene.store.NIOFSDirectory.openInput(NIOFSDirectory.java:78)
at org.elasticsearch.index.store.support.AbstractStore$StoreDirectory.openInput(AbstractStore.java:268)
at org.apache.lucene.index.FieldsReader.(FieldsReader.java:108)
at org.apache.lucene.index.SegmentReader$CoreReaders.openDocStores(SegmentReader.java:292)
at org.apache.lucene.index.SegmentReader.openDocStores(SegmentReader.java:627)
at org.apache.lucene.index.IndexWriter$ReaderPool.get(IndexWriter.java:630)
at org.apache.lucene.index.IndexWriter$ReaderPool.getReadOnlyClone(IndexWriter.java:580)
at org.apache.lucene.index.DirectoryReader.(DirectoryReader.java:154)
at org.apache.lucene.index.ReadOnlyDirectoryReader.(ReadOnlyDirectoryReader.java:36)
at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:411)
at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:376)
at org.apache.lucene.index.DirectoryReader.doReopenFromWriter(DirectoryReader.java:385)
at org.apache.lucene.index.DirectoryReader.doReopen(DirectoryReader.java:396)
at org.apache.lucene.index.DirectoryReader.reopen(DirectoryReader.java:363)
at org.elasticsearch.index.engine.robin.RobinEngine.refresh(RobinEngine.java:613)
... 4 more
[2011-02-23 17:50:28,289][WARN ][netty.channel.socket.nio.NioServerSocketPipelineSink] Failed to accept a connection.
java.io.IOException: Too many open files
at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:152)
at org.elasticsearch.common.netty.channel.socket.nio.NioServerSocketPipelineSink$Boss.run(NioServerSocketPipelineSink.java:244)
at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:44)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-02-23 17:50:28,333][WARN ][index.shard.service ] [Madam Slay] [visibiz-test][2] Failed to perform scheduled engine refresh
org.elasticsearch.index.engine.RefreshFailedEngineException: [visibiz-test][2] Refresh failed
at org.elasticsearch.index.engine.robin.RobinEngine.refresh(RobinEngine.java:629)
at org.elasticsearch.index.shard.service.InternalIndexShard$EngineRefresher$1.run(InternalIndexShard.java:620)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.FileNotFoundException: /ebs/var/lib/elasticsearch/elasticsearch/nodes/0/indices/visibiz-test/2/index/_iyc.fdt (Too many open files)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.(RandomAccessFile.java:212)
at org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexInput$Descriptor.(SimpleFSDirectory.java:76)
at org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexInput.(SimpleFSDirectory.java:97)
at org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.(NIOFSDirectory.java:98)
at org.apache.lucene.store.NIOFSDirectory.openInput(NIOFSDirectory.java:78)
at org.elasticsearch.index.store.support.AbstractStore$StoreDirectory.openInput(AbstractStore.java:268)
at org.apache.lucene.index.FieldsReader.(FieldsReader.java:108)
at org.apache.lucene.index.SegmentReader$CoreReaders.openDocStores(SegmentReader.java:292)
at org.apache.lucene.index.SegmentReader.openDocStores(SegmentReader.java:627)
at org.apache.lucene.index.IndexWriter$ReaderPool.get(IndexWriter.java:630)
at org.apache.lucene.index.IndexWriter$ReaderPool.getReadOnlyClone(IndexWriter.java:580)
at org.apache.lucene.index.DirectoryReader.(DirectoryReader.java:154)
at org.apache.lucene.index.ReadOnlyDirectoryReader.(ReadOnlyDirectoryReader.java:36)
at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:411)
at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:376)
at org.apache.lucene.index.DirectoryReader.doReopenFromWriter(DirectoryReader.java:385)
at org.apache.lucene.index.DirectoryReader.doReopen(DirectoryReader.java:396)
at org.apache.lucene.index.DirectoryReader.reopen(DirectoryReader.java:363)
at org.elasticsearch.index.engine.robin.RobinEngine.refresh(RobinEngine.java:613)
... 4 more

Is it possible that our java code is not closing connections properly? Any thoughts would be greatly appreciated. Thanks!

Hunter


(system) #5