Migration from array documents to parent/child causes 'too many open files' error

es_learner · November 18, 2012, 3:29am

I'm using 0.19.11 server-side and pyes 0.19.1 client-side(from https://github.com/aparo/pyes).

For the purpose of dev/debug/test cycle, I'm using an EC2 instance with 7GB memory and HEAP_SIZE=4GB. ulimit shows 350,000. 1 replica and 5 shards.

I have two application codebase - one is our current production version where child documents are held in an array field(nested type not really needed). This production codebase has been running flawlessly for the past 4 months. I am investigating replacing the array field with parent/child because array is proving to be untenable.

The data set used for this investigating is the same - 2 indices - indexA with 280 parents and 10,000 children, indexB with 300 parents and 5,000 children.

The production codebase rebuilt the 2 indices without any issues as expected.

For the most part, new parent/child codebase was ready and testing started this week. For manual debugging, I started building small subsets of indexA and indexB - e.g. 5 parents/10 children each and those built just fine. However, I was unable to build both test indices fully - see error below. Our production corpus has millions of parents and children.

From reading parent/child threads here, I understand the need to load both parent and children ids into memory but how are parent/children documents mapped into file descriptors? Are they?

Anyone else has similar issue? Anyone has deployment of millions of parents/children?

Thanks.

-- error msg ---

org.elasticsearch.index.engine.IndexFailedEngineException: [test_conversations_3][2] Index failed for [conversation#94]
at org.elasticsearch.index.engine.robin.RobinEngine.index(RobinEngine.java:499)
at org.elasticsearch.index.shard.service.InternalIndexShard.index(InternalIndexShard.java:320)
at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:158)
at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:532)
at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:430)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.FileNotFoundException: /opt/lfops/ops/elasticsearch/parts/elasticsearch/data/livefyre/nodes/0/indices/test_conversations_3/2/index/_1h.tvx (Too many open files)

jprante · November 18, 2012, 5:44pm

Can you check your process open file list with lsof? This would give some
insight.

Jörg

--

es_learner · November 19, 2012, 5:25am

In another thread, Shay did mentioned setting es.max-open-files=true to see the actual maximum. After wandering around trying to find the right file to set this in, I finally figured out it's bin/service/elasticsearch.conf and I set it like so:

wrapper.java.additional.2=-Des.max-open-files=true

On ES restart, I see this new log msg:
[2012-11-19 05:16:06,675][INFO ][bootstrap ] max_open_files [1003]
[2012-11-19 05:16:10,416][INFO ][node ] [Antiphon the Overseer] {0.19.11}[1120]: initializing ...
[2012-11-19 05:16:10,423][INFO ][plugins ] [Antiphon the Overseer] loaded [], sites [bigdesk, head]
[2012-11-19 05:16:13,663][INFO ][node ] [Antiphon the Overseer] {0.19.11}[1120]: initialized

So, in some way, mystery explained though not solved - the ulimit -n was not sticking.

Next, I will need to figure out how to set ulimit properly. Any pointers would be appreciated.

dadoonet · November 19, 2012, 6:49am

When you set ulimit -n from your shell, I think that you have to reconnect to a new shell before launching Elasticsearch.

HTH

David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 19 nov. 2012 à 06:25, es_learner dave@livefyre.com a écrit :

In another thread, Shay did mentioned setting es.max-open-files=true to see
the actual maximum. After wandering around trying to find the right file to
set this in, I finally figured out it's bin/service/elasticsearch.conf and I
set it like so:

wrapper.java.additional.2=-Des.max-open-files=true

On ES restart, I see this new log msg:
[2012-11-19 05:16:06,675][INFO ][bootstrap ] max_open_files
[1003]
[2012-11-19 05:16:10,416][INFO ][node ] [Antiphon the
Overseer] {0.19.11}[1120]: initializing ...
[2012-11-19 05:16:10,423][INFO ][plugins ] [Antiphon the
Overseer] loaded [], sites [bigdesk, head]
[2012-11-19 05:16:13,663][INFO ][node ] [Antiphon the
Overseer] {0.19.11}[1120]: initialized

So, in some way, mystery explained though not solved - the ulimit -n was not
sticking.

Next, I will need to figure out how to set ulimit properly. Any pointers
would be appreciated.

--
View this message in context: http://elasticsearch-users.115913.n3.nabble.com/Migration-from-array-documents-to-parent-child-causes-too-many-open-files-error-tp4025654p4025683.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--

karmi · November 19, 2012, 2:20pm

So, in some way, mystery explained though not solved - the ulimit -n was
not
sticking.

Next, I will need to figure out how to set ulimit properly. Any pointers
would be appreciated.

Calling ulimit is not enough to raise the file limits,
see Elasticsearch Platform — Find real-time answers at scale | Elastic
for a guide and explanation.
See https://github.com/karmi/cookbook-elasticsearch/blob/master/recipes/default.rb#L67-L93
for a programmatic configuration with Opscode Chef.

Karel

--