Out of Memory Error

We got an OOM error earlier this week and it looks like some resource other then memory may be the problem. These boxes have 64GB, 30GB allocated to java heap. At the time of the error they were all using around 30%-35% of the jvm's memory.

All nodes reported 65k limit on file descriptors before and after the error and cluster restart.

We are using the jdbc river plugin (yes for now anyway), which seems to be the event that triggers it. ( jdbc-1.5.0.5-da4ba96 1.5.0.5 )

Any clues, hints, suggestions are appreciated.

Thanks
-Doug

[2015-10-19 05:10:07,722][WARN ][index.engine             ] [es4] [sales][1] failed engine [out of memory (source: [maybe_merge])]
java.lang.OutOfMemoryError: unable to create new native thread
	at java.lang.Thread.start0(Native Method)
	at java.lang.Thread.start(Thread.java:714)
	at org.apache.lucene.index.ConcurrentMergeScheduler.merge(ConcurrentMergeScheduler.java:391)
	at org.elasticsearch.index.merge.EnableMergeScheduler.merge(EnableMergeScheduler.java:50)
	at org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:1985)
	at org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:1979)
	at org.elasticsearch.index.engine.InternalEngine.maybeMerge(InternalEngine.java:778)
	at org.elasticsearch.index.shard.IndexShard$EngineMerger$1.run(IndexShard.java:1241)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

GET /_nodes/process:

{
   "cluster_name": "es_prod",
   "nodes": {
      "uvieBMr3SXKu7BvKuWEJmQ": {
         "name": "es4",
         "version": "1.7.0",
         "build": "929b973",
         "http_address": "inet[/12.130.11.49:9200]",
         "process": {
            "refresh_interval_in_millis": 1000,
            "id": 13702,
            "max_file_descriptors": 65535,
            "mlockall": true
         }
      },
      "xnOltXOsS5eY7VAotvYiSg": {
         "name": "es2",
         "version": "1.7.0",
         "build": "929b973",
         "http_address": "inet[/12.130.11.47:9200]",
         "process": {
            "refresh_interval_in_millis": 1000,
            "id": 43023,
            "max_file_descriptors": 65535,
            "mlockall": true
         }
      },
      "imfqR95jSKOhErEN8nQk3w": {
         "name": "es3",
         "version": "1.7.0",
         "build": "929b973",
         "http_address": "inet[/12.130.11.48:9200]",
         "process": {
            "refresh_interval_in_millis": 1000,
            "id": 26569,
            "max_file_descriptors": 65535,
            "mlockall": true
         }
      }
   }
}

The jvm reports out of memory when it exceeds the number of processes allowed for that user. The number of processes or threads is counted per user. You need to increase the value before starting elasticsearch.

/Michael

Is the value you're referring to from limits.conf?
ulimit reports: max user processes (-u) 514271

That's correct, the value is set per user though so you have to check it for the user that runs es.

/Michael

You could also try to launch a separate node (with node.data:false and node.master:false) for the JDBC river plugin, to move the workload of JDBC processing from the ES indexing.

Also, you could tweak the JDBC bulk data import, slow it down a bit, so the cluster is not overwhelmed.

Another method is to streamline segment merging, but this is an advanced topic. Instead, you could add nodes until segment merge errors are gone, this is easy.

sorry I should have added that part...I did check that, there are no user specific settings on the process count in limits.conf so the value I was getting is the default and should apply to all users, right?

It's hard to believe that ordinary users can spawn 514271 processes and threads. On the systems I use (RHEL & CentOS) the default is 1024.

/Michael

I agree the number is crazy large, but its NOT defined in limits.conf and that is what ulimit reports (centos 6.5) so would seem that limit is not the underlying problem.

Jörg-

How would I go about slowing down the river? Seems like the low hanging fruit in the choices...

Thanks
-Doug

Look in /etc/security/limits.d/90-nproc.conf:

*          soft    nproc     1024
root       soft    nproc     unlimited

This is a part of the pam rpm.

/Michael

Yep, there it is thanks! Didn't realize that ulimits were being set by two different configs.

I was able to verify the process picked up the new limits. Appreciate the help!

Thanks
-Doug