Elasticsearch suddenly stops working

Hi,

I've recently upgraded my 3 node cluster to elastic 2.4.2 from 1.5.
Since the upgrade (I think), I experience random shutdowns of the elasticsearch service: about once a day (at no specific hour), one of the nodes stops working- every time a different node.
I turned on DEBUG logging for the root logger, but there's nothing interesting there, it just stops with:

recalculating shard indexing buffer, total is [815.8mb] with [2] active shards, each shard set to indexing=[407.9mb], translog=[64kb]

I looked at the syslog, and there was nothing interesting there also.

I'm running on m4.xlrage EC2 machines with 16GB RAM.
Settings in elasticsearch.yml that are not the default:
bootstrap.memory_lock: true indices.fielddata.cache.size: 75% indices.breaker.fielddata.limit: 85% discovery.zen.fd.ping_timeout: 60s discovery.zen.fd.ping_retries: 5

and in /etc/default/elasticsearch:
ES_HEAP_SIZE=8g MAX_LOCKED_MEMORY=unlimited

Any ideas?

There's nothing in the logs other than that? What about on the master node?

That is not a good idea.

The master node log just shows that it has lost one of the nodes, nothing else.

Why is setting the field data cache size and the field data breaker to 75% and 85% respectively is not a good idea? can this be the problem?
This setting was recommended here: but this might have been good for ES 1.X and not for 2.X which uses doc count by default?

What OS are you on? What JVM?

Basically you are allocating 75% of your heap to cache, that's seriously inefficient.

Plus it's 2 years old....

OS: Ubuntu 14.04
JVM: we used to work with openJDK 7 where we saw most of the issues, but I recently upgraded to 8.
As I understood it: fielddata cache size is unbounded by default, which may cause OOM.
Capping this at 75% means that ES will evict the data before reaching OOM - it doesn't actually mean we're allocating 75% of the heap for the cache.
Am I getting this wrong?

In any case- which value do you recommend for a write-heavy cluster?

Thanks.

If you are write heavy then caching field data isn't that much of a concern.

Anyway, there's nothing in the logs on the nodes, either ES or the OS? That seems super unusual. Is the node itself being restarted, or does the service stop?

The service simply stops. I eventually wrote a cron job that checks its status every minute and starts the service again if it's stopped. I know this is not a good solution, it was more of a quick patch. I also added an extra node, and we haven't had a service stopped in the past 3 days.

I looked at /var/log/elasticsearch/<CLUSTER_NAME>.log and in /var/log/syslog and saw nothing special.

Regarding the cache field data size - while this is a write heavy cluster, we do have a few reads as well, and I don't want the node to stop due to OOM for a big memory hog query - so leaving the field data cache limit seems appropriate. What value do you suggest to set for this?

Thank you,
Eldad

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.