Data-only node keeps crashing with oom error


Hello, we have our cluster currently set up across 2 machines, with 1 machine having 1 data+master node, and the other machine having a data only node as well as 2 master nodes. The data only node, as well as the data+master node, are configured to have a reserved JAVA heap size of 32 GB. The 2 master nodes each have 16GB reserved heap size. Pretty frequently, I notice that the data only node will fail every so often, maybe every hour or so? When it fails, I see the error:

[2015-08-12 14:33:50,813][WARN ][] An exception was thrown by ChannelFutureListener.
java.lang.OutOfMemoryError: unable to create new native thread

I'm not sure why this is, as I can see that there are about 6GB used for fielddata, and that the used heap size % is at about 50%...I see a lot of other error messages, but they're all basically giving me the same outofmemory error message. I can't figure out why this node is crashing, especially since it's essentially the same as the data+master node in terms of configuration, and if anything that node should be using more memory than the data only node! Can anybody help?

(Mike Simos) #2

Try setting indices.fielddata.cache.size to 20% and see if that stabilizes things.

You may want to lower the amount of heap below 32GB. As around this point is where java decompresses its pointers.

You can read this as to why:

Try using 30.5GB and see if you see any effect as well.

(Magnus Bäck) #3

Try using 30.5GB and see if you see any effect as well.

Is there a reference for the 30.5 GB number? I've always believed that as long as the heap is <32 GB compressed pointers would be used and I've been using 31 GB to be on the safe side.


Thanks, I'll try setting the fielddata cache size. Although, I did enable doc_values, and it seems that stabilized the data+master node, but the data node is still unstable. I misspoke earlier, in the sysconfig/elasticsearch file I set the ES_HEAP_SIZE to be 31G, but when running TOP it shows the reserved size as 32G. As long as it's below 32G, shouldn't it still use the smaller pointers? And since the config for both the data+master and the data-only node is the same, why would one by unstable but not the other? Both machines are identical, and in fact there is more load on the data+master machine since we have all the logstash instances as well as redis running on it.

(Nils Helge Garli Hegvik) #5

I don't think this is due to lack of memory. The error message indicates that your application is starting too many threads, or you have reached your process limit for the user running the process. Check with ulimit -a and check the settings for max user processes and file descriptors.


Thanks for the tip. Taking a look, when I run ulimit -a as root I see that the max user processes is 1031433. Seems pretty high, I don't imagine that's the issue? Elasticsearch is running under the context of the elasticsearch user, does that max user processes value change for each user? How do I check the value for Elasticsearch, since it's not a valid login?

edit: the very first error message I see is

[logstash-firewall-2015.08.13][0] failed engine [out of memory]
java.lang.OutOfMemoryError: unable to create new native thread

would this point to the process limit?

(Mike Simos) #7

Also check ulimit -n (open files). This may prevent creating another thread, if its set too low.


Look at max_file_descriptors. I don't think Elasticsearch reports the max processes, you'll need to change the shell for the elasticsearch user (chsh) and then do a su -l elasticsearch -c "ulimit -a"

(system) #8