Re: Too many open files with many concurrent index requests


(Andreas Bauer) #1

A quick update on what we tried.

First I played around with kernel settings in /proc/sys/net, using
http://www.speedguide.net/articles/linux-tweaking-121 as a guideline.

Setting /proc/sys/net/ipv4/tcp_tw_recycle to 1 helped a little, but in
the end just delayed the breakdown.

I then flushed and optimized the index at regular intervals during the
reindex process, that delayed it further but still didn't fix it
completely.

What ultimately 'solved' the problem was reducing the number of shards
of the index from the default 5 to 1. This reduced the number of open
files during the reindex (as seen by lsof | grep elasticsearch | wc
-l) drastically. With 5 shards we saw starting values of ~ 400,
constantly growing with breakdown starting at ~950 open files. With 1
shard this was reduced to ~30 open files at start, growing to ~150
during the reindex and staying there.

We're currently running one index on one node. As we started using
elasticsearch only 2 weeks ago, I didn't look into the whole
clustering part of it yet. What are the best practices there? Is it
common to have multiple shards of the same index on one node?

On Wed, Jul 20, 2011 at 4:10 PM, Till klimpong@gmail.com wrote:

I guess you're on AWS also, right? Are you guys on karmic still as
well, or are you using scalarium's lucid image?

No, we're running on custom self-managed servers, but we maybe will
expand to AWS in the future.

Cheers,
Andreas

On Wed, Jul 20, 2011 at 6:58 PM, Shay Banon
shay.banon@elasticsearch.com wrote:

The number of open files depends on many factors, size of the cluster
(sockets), number of clients connected (sockets), number of shards
allocated
on the node (index files), so its hard to tell where its coming from...
In 0.17, there is the maximum open files limit in the nodes info, and
current open files in nodes stats, hope it will help us see whats going
on...

On Wed, Jul 20, 2011 at 4:10 PM, Till klimpong@gmail.com wrote:

We keep having a very similar issue whenever I pull up a new instance
on AWS it bails with "too many open files" right away.

So far, I always checked with lsof and netstat etc., but saw nothing
out of the ordinary. After checking various things for 10-15 minutes,
I start elasticsearch again and then it continues to work. I'm not
sure if I just hit some sort of random capacity notch with AWS, or how
these things are related.

Anyway, one thing I noticed was that the suggested 32000 in the
service-wrapper don't work for us at all. We're currently writing with
a single thread to the index. The default (for the root user) of 65xxx
works much better for us.

I guess you're on AWS also, right? Are you guys on karmic still as
well, or are you using scalarium's lucid image?

Till

On Jul 19, 11:41 pm, Andreas Bauer andre...@moviepilot.com wrote:

Hey,

thanks for the quick answer.

The errors started with lsof | wc -l being around 4000 and lsof | grep
elasticsearch | wc -l around 900. I'm not sure about netstat -a | wc
-l, but I'm pretty sure it was below 1000.

I'll run more tests, watching netstat and trying to tweak settings in
/proc/sys/net.

On Tue, Jul 19, 2011 at 11:03 PM, Shay Banon

shay.ba...@elasticsearch.com wrote:

Maybe you are running out of sockets on the machine? lsof / netstat
should
give you an idea on whats going on.

On Tue, Jul 19, 2011 at 10:20 PM, Andreas Bauer
andre...@moviepilot.com
wrote:

Hello.

we're using elasticsearch 0.16.3. While reindexing our data
(currently
around 200k small documents), we're having the 'Too many open files'
problem.

We're issuing a large number of single index requests in parallel
(around 150-200 per second) using resque. It works fine for a short
time, then we start seeing Connection resets and broken pipes, all
caused by Too many open files.

Ulimit for the user running elasticsearch is 64k:

~$ daemons/elasticsearch/bin/elasticsearch -f
-Des.max-open-files=true
[2011-07-19 21:10:15,264][INFO ][bootstrap ]
max_open_files [65514]

The problem doesn't occur when we're reducing the indexing workers.

Is this a problem of elasticsearch or of netty? Is it possible to
run

elasticsearch in tomcat and could this help with this issue?

Can we tweak elasticsearch options to increase stability? We tried
reducing the flush interval and decrease the lucene merge factor but
this didn't help.

Please let us know if we can provide more information about our
setup

that might be helpful in diagnosing the issue.

Thanks,
Andreas

Stacktrace example of a too many open files error:

[2011-07-19 12:45:12,408][WARN ][index.shard.service ]
[Boneyard]
[nodes][2] Failed to perform scheduled engine refresh
org.elasticsearch.index.engine.RefreshFailedEngineException: [nodes]
[2] Refresh failed
at

org.elasticsearch.index.engine.robin.RobinEngine.refresh(RobinEngine.java:

  1. at

org.elasticsearch.index.shard.service.InternalIndexShard.refresh(InternalIn

dexShard.java:
403)
at org.elasticsearch.index.shard.service.InternalIndexShard
$EngineRefresher$1.run(InternalIndexShard.java:628)
at

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:

  1. at java.util.concurrent.ThreadPoolExecutor
    $Worker.run(ThreadPoolExecutor.java:603)
    at java.lang.Thread.run(Thread.java:636)
    Caused by: java.io.FileNotFoundException: /home/moviepilot/data/
    elasticsearch/sheldon-index/nodes/0/indices/nodes/2/index/_2dg.fdx
    (Too many open files)
    at java.io.RandomAccessFile.open(Native Method)
    at java.io.RandomAccessFile.(RandomAccessFile.java:233)
    at
    org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexInput
    $Descriptor.(SimpleFSDirectory.java:69)
    at org.apache.lucene.store.SimpleFSDirectory
    $SimpleFSIndexInput.(SimpleFSDirectory.java:90)
    at org.apache.lucene.store.NIOFSDirectory
    $NIOFSIndexInput.(NIOFSDirectory.java:91)
    at

org.apache.lucene.store.NIOFSDirectory.openInput(NIOFSDirectory.java:

  1. at org.elasticsearch.index.store.support.AbstractStore
    $StoreDirectory.openInput(AbstractStore.java:344)
    at
    org.apache.lucene.index.FieldsReader.(FieldsReader.java:129)
    at org.apache.lucene.index.SegmentReader
    $CoreReaders.openDocStores(SegmentReader.java:290)
    at

org.apache.lucene.index.SegmentReader.openDocStores(SegmentReader.java:

  1. at org.apache.lucene.index.IndexWriter
    $ReaderPool.get(IndexWriter.java:693)
    at org.apache.lucene.index.IndexWriter
    $ReaderPool.getReadOnlyClone(IndexWriter.java:642)
    at
    org.apache.lucene.index.DirectoryReader.(DirectoryReader.java:
  2. at

org.apache.lucene.index.ReadOnlyDirectoryReader.(ReadOnlyDirectoryRea

der.java:
38)
at
org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:455)
at
org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:403)
at

org.apache.lucene.index.DirectoryReader.doReopenFromWriter(DirectoryReader.

java:
405)
at

org.apache.lucene.index.DirectoryReader.doReopen(DirectoryReader.java:

  1. at
    org.apache.lucene.index.DirectoryReader.reopen(DirectoryReader.java:
  2. at

org.elasticsearch.index.engine.robin.RobinEngine.refresh(RobinEngine.java:

--
Andreas Bauer

Moviepilot GmbH | Mehringdamm 33 | 10961 Berlin | Germany | Tel: +49
30 616 512-0
Sitz der Gesellschaft: Berlin, Deutschland | Handelsregister:
Amtsgericht Berlin-Charlottenburg, HRB Nr. 107195 B | Geschäftsführer:
Tobias Bauckhage, Malte Cherdron


(Shay Banon) #2

Which version are you using for your tests, btw?

On Fri, Jul 29, 2011 at 4:31 PM, Andreas Bauer andreasb@moviepilot.comwrote:

A quick update on what we tried.

First I played around with kernel settings in /proc/sys/net, using
http://www.speedguide.net/**articles/linux-tweaking-121http://www.speedguide.net/articles/linux-tweaking-121 as
a guideline.

Setting /proc/sys/net/ipv4/tcp_tw_**recycle to 1 helped a little, but in
the end just delayed the breakdown.

I then flushed and optimized the index at regular intervals during the
reindex process, that delayed it further but still didn't fix it
completely.

What ultimately 'solved' the problem was reducing the number of shards
of the index from the default 5 to 1. This reduced the number of open
files during the reindex (as seen by lsof | grep elasticsearch | wc
-l) drastically. With 5 shards we saw starting values of ~ 400,
constantly growing with breakdown starting at ~950 open files. With 1
shard this was reduced to ~30 open files at start, growing to ~150
during the reindex and staying there.

We're currently running one index on one node. As we started using
elasticsearch only 2 weeks ago, I didn't look into the whole
clustering part of it yet. What are the best practices there? Is it
common to have multiple shards of the same index on one node?

On Wed, Jul 20, 2011 at 4:10 PM, Till klimpong@gmail.com wrote:

I guess you're on AWS also, right? Are you guys on karmic still as
well, or are you using scalarium's lucid image?

No, we're running on custom self-managed servers, but we maybe will
expand to AWS in the future.

Cheers,
Andreas

On Wed, Jul 20, 2011 at 6:58 PM, Shay Banon
shay.banon@elasticsearch.com wrote:

The number of open files depends on many factors, size of the cluster
(sockets), number of clients connected (sockets), number of shards
allocated
on the node (index files), so its hard to tell where its coming from...
In 0.17, there is the maximum open files limit in the nodes info, and
current open files in nodes stats, hope it will help us see whats going
on...

On Wed, Jul 20, 2011 at 4:10 PM, Till klimpong@gmail.com wrote:

We keep having a very similar issue whenever I pull up a new instance
on AWS it bails with "too many open files" right away.

So far, I always checked with lsof and netstat etc., but saw nothing
out of the ordinary. After checking various things for 10-15 minutes,
I start elasticsearch again and then it continues to work. I'm not
sure if I just hit some sort of random capacity notch with AWS, or how
these things are related.

Anyway, one thing I noticed was that the suggested 32000 in the
service-wrapper don't work for us at all. We're currently writing with
a single thread to the index. The default (for the root user) of 65xxx
works much better for us.

I guess you're on AWS also, right? Are you guys on karmic still as
well, or are you using scalarium's lucid image?

Till

On Jul 19, 11:41 pm, Andreas Bauer andre...@moviepilot.com wrote:

Hey,

thanks for the quick answer.

The errors started with lsof | wc -l being around 4000 and lsof | grep
elasticsearch | wc -l around 900. I'm not sure about netstat -a | wc
-l, but I'm pretty sure it was below 1000.

I'll run more tests, watching netstat and trying to tweak settings in
/proc/sys/net.

On Tue, Jul 19, 2011 at 11:03 PM, Shay Banon

shay.ba...@elasticsearch.com wrote:

Maybe you are running out of sockets on the machine? lsof / netstat
should
give you an idea on whats going on.

On Tue, Jul 19, 2011 at 10:20 PM, Andreas Bauer
andre...@moviepilot.com
wrote:

Hello.

we're using elasticsearch 0.16.3. While reindexing our data
(currently
around 200k small documents), we're having the 'Too many open
files'

problem.

We're issuing a large number of single index requests in parallel
(around 150-200 per second) using resque. It works fine for a short
time, then we start seeing Connection resets and broken pipes, all
caused by Too many open files.

Ulimit for the user running elasticsearch is 64k:

~$ daemons/elasticsearch/bin/**elasticsearch -f
-Des.max-open-files=true
[2011-07-19 21:10:15,264][INFO ][bootstrap ]
max_open_files [65514]

The problem doesn't occur when we're reducing the indexing workers.

Is this a problem of elasticsearch or of netty? Is it possible to
run

elasticsearch in tomcat and could this help with this issue?

Can we tweak elasticsearch options to increase stability? We tried
reducing the flush interval and decrease the lucene merge factor
but

this didn't help.

Please let us know if we can provide more information about our
setup

that might be helpful in diagnosing the issue.

Thanks,
Andreas

Stacktrace example of a too many open files error:

[2011-07-19 12:45:12,408][WARN ][index.shard.service ]
[Boneyard]
[nodes][2] Failed to perform scheduled engine refresh
org.elasticsearch.index.**engine.**RefreshFailedEngineException:
[nodes]

[2] Refresh failed
at

org.elasticsearch.index.engine.robin.RobinEngine.
refresh(RobinEngine.java:

  1. at

org.elasticsearch.index.shard.service.InternalIndexShard.
refresh(InternalIn

dexShard.java:
403)
at org.elasticsearch.index.shard.**
service.InternalIndexShard

$EngineRefresher$1.run(**InternalIndexShard.java:628)
at

java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:

  1. at java.util.concurrent.**ThreadPoolExecutor
    $Worker.run(**ThreadPoolExecutor.java:603)
    at java.lang.Thread.run(Thread.**java:636)
    Caused by: java.io.FileNotFoundException: /home/moviepilot/data/
    elasticsearch/sheldon-index/nodes/0/indices/nodes/2/index/
    _2dg.fdx

(Too many open files)
at java.io.RandomAccessFile.open(**Native Method)
at java.io.RandomAccessFile.<**init>(RandomAccessFile.java:
**233)

   at

org.apache.lucene.store.**SimpleFSDirectory$**SimpleFSIndexInput
$Descriptor.(**SimpleFSDirectory.java:69)
at org.apache.lucene.store.**SimpleFSDirectory
$SimpleFSIndexInput.(**SimpleFSDirectory.java:90)
at org.apache.lucene.store.**NIOFSDirectory
$NIOFSIndexInput.(**NIOFSDirectory.java:91)
at
org.apache.lucene.store.NIOFSDirectory.openInput(
NIOFSDirectory.java:

  1. at org.elasticsearch.index.store.**support.AbstractStore
    $StoreDirectory.openInput(**AbstractStore.java:344)
    at
    org.apache.lucene.index.FieldsReader.(
    FieldsReader.java:129)
   at org.apache.lucene.index.**SegmentReader

$CoreReaders.openDocStores(**SegmentReader.java:290)
at

org.apache.lucene.index.SegmentReader.openDocStores(
SegmentReader.java:

  1. at org.apache.lucene.index.**IndexWriter
    $ReaderPool.get(IndexWriter.**java:693)
    at org.apache.lucene.index.**IndexWriter
    $ReaderPool.getReadOnlyClone(**IndexWriter.java:642)
    at
    org.apache.lucene.index.DirectoryReader.(
    DirectoryReader.java:
  1. at

org.apache.lucene.index.ReadOnlyDirectoryReader.
(ReadOnlyDirectoryRea

der.java:
38)
at
org.apache.lucene.index.IndexWriter.getReader(
IndexWriter.java:455)

   at

org.apache.lucene.index.IndexWriter.getReader(
IndexWriter.java:403)

   at

org.apache.lucene.index.**DirectoryReader.doReopenFromWriter(
DirectoryReader.

java:
405)
at

org.apache.lucene.index.DirectoryReader.doReopen(
DirectoryReader.java:

  1. at
    org.apache.lucene.index.DirectoryReader.reopen(
    DirectoryReader.java:
  1. at

org.elasticsearch.index.engine.robin.RobinEngine.
refresh(RobinEngine.java:

--
Andreas Bauer

Moviepilot GmbH | Mehringdamm 33 | 10961 Berlin | Germany | Tel: +49
30 616 512-0
Sitz der Gesellschaft: Berlin, Deutschland | Handelsregister:
Amtsgericht Berlin-Charlottenburg, HRB Nr. 107195 B | Geschäftsführer:
Tobias Bauckhage, Malte Cherdron


(Andreas Bauer) #3

0.16.3


(Shay Banon) #4

ok, another question, were you passing any specific settings to the index
when performing the indexing request?

On Fri, Jul 29, 2011 at 5:51 PM, Andreas Bauer andreasb@moviepilot.comwrote:

0.16.3


(Andreas Bauer) #5

no, the index had the vanilla settings. we once tried changing
index.translog.flush_threshold_ops to 1000, that didn't have any
noticeable impact though.

On Fri, Jul 29, 2011 at 5:07 PM, Shay Banon kimchy@gmail.com wrote:

ok, another question, were you passing any specific settings to the index
when performing the indexing request?
On Fri, Jul 29, 2011 at 5:51 PM, Andreas Bauer andreasb@moviepilot.com
wrote:

0.16.3

--
Andreas Bauer

Moviepilot GmbH | Mehringdamm 33 | 10961 Berlin | Germany | Tel: +49
30 616 512-0
Sitz der Gesellschaft: Berlin, Deutschland | Handelsregister:
Amtsgericht Berlin-Charlottenburg, HRB Nr. 107195 B | Geschäftsführer:
Tobias Bauckhage, Malte Cherdron


(system) #6