Out of Memory Error


(Doug Swanson) #1

We got an OOM error earlier this week and it looks like some resource other then memory may be the problem. These boxes have 64GB, 30GB allocated to java heap. At the time of the error they were all using around 30%-35% of the jvm's memory.

All nodes reported 65k limit on file descriptors before and after the error and cluster restart.

We are using the jdbc river plugin (yes for now anyway), which seems to be the event that triggers it. ( jdbc-1.5.0.5-da4ba96 1.5.0.5 )

Any clues, hints, suggestions are appreciated.

Thanks
-Doug

[2015-10-19 05:10:07,722][WARN ][index.engine             ] [es4] [sales][1] failed engine [out of memory (source: [maybe_merge])]
java.lang.OutOfMemoryError: unable to create new native thread
	at java.lang.Thread.start0(Native Method)
	at java.lang.Thread.start(Thread.java:714)
	at org.apache.lucene.index.ConcurrentMergeScheduler.merge(ConcurrentMergeScheduler.java:391)
	at org.elasticsearch.index.merge.EnableMergeScheduler.merge(EnableMergeScheduler.java:50)
	at org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:1985)
	at org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:1979)
	at org.elasticsearch.index.engine.InternalEngine.maybeMerge(InternalEngine.java:778)
	at org.elasticsearch.index.shard.IndexShard$EngineMerger$1.run(IndexShard.java:1241)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

GET /_nodes/process:

{
   "cluster_name": "es_prod",
   "nodes": {
      "uvieBMr3SXKu7BvKuWEJmQ": {
         "name": "es4",
         "version": "1.7.0",
         "build": "929b973",
         "http_address": "inet[/12.130.11.49:9200]",
         "process": {
            "refresh_interval_in_millis": 1000,
            "id": 13702,
            "max_file_descriptors": 65535,
            "mlockall": true
         }
      },
      "xnOltXOsS5eY7VAotvYiSg": {
         "name": "es2",
         "version": "1.7.0",
         "build": "929b973",
         "http_address": "inet[/12.130.11.47:9200]",
         "process": {
            "refresh_interval_in_millis": 1000,
            "id": 43023,
            "max_file_descriptors": 65535,
            "mlockall": true
         }
      },
      "imfqR95jSKOhErEN8nQk3w": {
         "name": "es3",
         "version": "1.7.0",
         "build": "929b973",
         "http_address": "inet[/12.130.11.48:9200]",
         "process": {
            "refresh_interval_in_millis": 1000,
            "id": 26569,
            "max_file_descriptors": 65535,
            "mlockall": true
         }
      }
   }
}

(Michael Salmon) #2

The jvm reports out of memory when it exceeds the number of processes allowed for that user. The number of processes or threads is counted per user. You need to increase the value before starting elasticsearch.

/Michael


(Doug Swanson) #3

Is the value you're referring to from limits.conf?
ulimit reports: max user processes (-u) 514271


(Michael Salmon) #4

That's correct, the value is set per user though so you have to check it for the user that runs es.

/Michael


(Jörg Prante) #5

You could also try to launch a separate node (with node.data:false and node.master:false) for the JDBC river plugin, to move the workload of JDBC processing from the ES indexing.

Also, you could tweak the JDBC bulk data import, slow it down a bit, so the cluster is not overwhelmed.

Another method is to streamline segment merging, but this is an advanced topic. Instead, you could add nodes until segment merge errors are gone, this is easy.


(Doug Swanson) #6

sorry I should have added that part...I did check that, there are no user specific settings on the process count in limits.conf so the value I was getting is the default and should apply to all users, right?


(Michael Salmon) #7

It's hard to believe that ordinary users can spawn 514271 processes and threads. On the systems I use (RHEL & CentOS) the default is 1024.

/Michael


(Doug Swanson) #8

I agree the number is crazy large, but its NOT defined in limits.conf and that is what ulimit reports (centos 6.5) so would seem that limit is not the underlying problem.


(Doug Swanson) #9

Jörg-

How would I go about slowing down the river? Seems like the low hanging fruit in the choices...

Thanks
-Doug


(Michael Salmon) #10

Look in /etc/security/limits.d/90-nproc.conf:

*          soft    nproc     1024
root       soft    nproc     unlimited

This is a part of the pam rpm.

/Michael


(Doug Swanson) #11

Yep, there it is thanks! Didn't realize that ulimits were being set by two different configs.

I was able to verify the process picked up the new limits. Appreciate the help!

Thanks
-Doug


(system) #12