Elasticsearch suddenly stops working

Eldad_Moneta · December 22, 2016, 1:33pm

Hi,

I've recently upgraded my 3 node cluster to elastic 2.4.2 from 1.5.
Since the upgrade (I think), I experience random shutdowns of the elasticsearch service: about once a day (at no specific hour), one of the nodes stops working- every time a different node.
I turned on DEBUG logging for the root logger, but there's nothing interesting there, it just stops with:

recalculating shard indexing buffer, total is [815.8mb] with [2] active shards, each shard set to indexing=[407.9mb], translog=[64kb]

I looked at the syslog, and there was nothing interesting there also.

I'm running on m4.xlrage EC2 machines with 16GB RAM.
Settings in elasticsearch.yml that are not the default:
bootstrap.memory_lock: true indices.fielddata.cache.size: 75% indices.breaker.fielddata.limit: 85% discovery.zen.fd.ping_timeout: 60s discovery.zen.fd.ping_retries: 5

and in /etc/default/elasticsearch:
ES_HEAP_SIZE=8g MAX_LOCKED_MEMORY=unlimited

Any ideas?

warkolm · December 29, 2016, 9:25pm

There's nothing in the logs other than that? What about on the master node?

That is not a good idea.

Eldad_Moneta · December 30, 2016, 7:37am

The master node log just shows that it has lost one of the nodes, nothing else.

Why is setting the field data cache size and the field data breaker to 75% and 85% respectively is not a good idea? can this be the problem?
This setting was recommended here: but this might have been good for ES 1.X and not for 2.X which uses doc count by default?

warkolm · December 30, 2016, 8:29am

What OS are you on? What JVM?

Basically you are allocating 75% of your heap to cache, that's seriously inefficient.

Plus it's 2 years old....

Eldad_Moneta · December 31, 2016, 1:56pm

OS: Ubuntu 14.04
JVM: we used to work with openJDK 7 where we saw most of the issues, but I recently upgraded to 8.
As I understood it: fielddata cache size is unbounded by default, which may cause OOM.
Capping this at 75% means that ES will evict the data before reaching OOM - it doesn't actually mean we're allocating 75% of the heap for the cache.
Am I getting this wrong?

In any case- which value do you recommend for a write-heavy cluster?

Thanks.

warkolm · January 1, 2017, 1:12am

If you are write heavy then caching field data isn't that much of a concern.

Anyway, there's nothing in the logs on the nodes, either ES or the OS? That seems super unusual. Is the node itself being restarted, or does the service stop?

Eldad_Moneta · January 1, 2017, 6:45am

The service simply stops. I eventually wrote a cron job that checks its status every minute and starts the service again if it's stopped. I know this is not a good solution, it was more of a quick patch. I also added an extra node, and we haven't had a service stopped in the past 3 days.

I looked at /var/log/elasticsearch/<CLUSTER_NAME>.log and in /var/log/syslog and saw nothing special.

Regarding the cache field data size - while this is a write heavy cluster, we do have a few reads as well, and I don't want the node to stop due to OOM for a big memory hog query - so leaving the field data cache limit seems appropriate. What value do you suggest to set for this?

Thank you,
Eldad

system · January 29, 2017, 6:45am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
ES becomes unresponsive! Elasticsearch	8	2457	July 5, 2017
ES not respecting indices.fielddata.cache.size setting? Elasticsearch	1	335	July 6, 2017
Elasticsearch gets shut down automatically Elasticsearch	18	6138	August 11, 2017
Elasticsearch突然崩溃，日志相关报错如下。中文提问与讨论	2	1969	December 31, 2018
Cluster failure Elasticsearch	1	287	July 6, 2017

Elasticsearch suddenly stops working

Related topics