Data-only node keeps crashing with oom error

Hello, we have our cluster currently set up across 2 machines, with 1 machine having 1 data+master node, and the other machine having a data only node as well as 2 master nodes. The data only node, as well as the data+master node, are configured to have a reserved JAVA heap size of 32 GB. The 2 master nodes each have 16GB reserved heap size. Pretty frequently, I notice that the data only node will fail every so often, maybe every hour or so? When it fails, I see the error:

[2015-08-12 14:33:50,813][WARN ][netty.channel.DefaultChannelFuture] An exception was thrown by ChannelFutureListener.
java.lang.OutOfMemoryError: unable to create new native thread

I'm not sure why this is, as I can see that there are about 6GB used for fielddata, and that the used heap size % is at about 50%...I see a lot of other error messages, but they're all basically giving me the same outofmemory error message. I can't figure out why this node is crashing, especially since it's essentially the same as the data+master node in terms of configuration, and if anything that node should be using more memory than the data only node! Can anybody help?

Try setting indices.fielddata.cache.size to 20% and see if that stabilizes things.

https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-fielddata.html#fielddata-monitoring

You may want to lower the amount of heap below 32GB. As around this point is where java decompresses its pointers.

https://www.elastic.co/guide/en/elasticsearch/guide/current/heap-sizing.html#compressed_oops

You can read this as to why:

Try using 30.5GB and see if you see any effect as well.

Try using 30.5GB and see if you see any effect as well.

Is there a reference for the 30.5 GB number? I've always believed that as long as the heap is <32 GB compressed pointers would be used and I've been using 31 GB to be on the safe side.

Thanks, I'll try setting the fielddata cache size. Although, I did enable doc_values, and it seems that stabilized the data+master node, but the data node is still unstable. I misspoke earlier, in the sysconfig/elasticsearch file I set the ES_HEAP_SIZE to be 31G, but when running TOP it shows the reserved size as 32G. As long as it's below 32G, shouldn't it still use the smaller pointers? And since the config for both the data+master and the data-only node is the same, why would one by unstable but not the other? Both machines are identical, and in fact there is more load on the data+master machine since we have all the logstash instances as well as redis running on it.

I don't think this is due to lack of memory. The error message indicates that your application is starting too many threads, or you have reached your process limit for the user running the process. Check with ulimit -a and check the settings for max user processes and file descriptors.

Thanks for the tip. Taking a look, when I run ulimit -a as root I see that the max user processes is 1031433. Seems pretty high, I don't imagine that's the issue? Elasticsearch is running under the context of the elasticsearch user, does that max user processes value change for each user? How do I check the value for Elasticsearch, since it's not a valid login?

edit: the very first error message I see is

[logstash-firewall-2015.08.13][0] failed engine [out of memory]
java.lang.OutOfMemoryError: unable to create new native thread

would this point to the process limit?

Also check ulimit -n (open files). This may prevent creating another thread, if its set too low.

http://localhost:9200/_nodes/process?pretty&human

Look at max_file_descriptors. I don't think Elasticsearch reports the max processes, you'll need to change the shell for the elasticsearch user (chsh) and then do a su -l elasticsearch -c "ulimit -a"