ES becomes unresponsive!

Hi,

We're using latest ES version 2.3.4 on a 2 nodes cluster. After each 7-8 hours elasticsearch gets halt. Telneting to ES port stucks in Trying .... & to make it stable again we've to kill java & start elastic-search again otherwise it doesn't restart using 'service elasticsearch restart' & becomes unresonsive. Most likely we've these in logs before halt take place.

http://pastebin.com/Nx4C6ebJ

We've one master & other data node. Both servers have 64GB memory 30GB of which is allocated to JAVA Heap. Please let me know if you need any more info.

There's not enough in your logs to help, we'd need to see more.

Thanks for responding, well what we've found out is that during halt state of ES, no logs are appended which just seems like the whole ES service becomes unresponsive & unable to write any of the logs.

How are we supposed to troubleshoot if no logs are coming during that issue :(. No swap is used & there's nothing we can find suspicious as well.

Does it look like more of OS issue ?

It doesn't look like anything as there is very little info to go on.

What does your config look like? What do the logs, whatever you have, look like?
What OS?

Both nodes are now data+master nodes. Here are configs that we added :

bootstrap.mlockall: true
indices.fielddata.cache.size: 60%
indices.breaker.fielddata.limit: 70%
index.max_result_window: 50000
Java heap size is set to 31G out of 64G.

#Following queue sizes are configured over the cluster:

"threadpool.search.queue_size" : 20000
"threadpool.index.queue_size" : 10000

#/etc/security/limits.conf:

  •           soft    nofile  700000
    
  •           hard    nofile  900000
    

elasticsearch soft memlock unlimited
elasticsearch hard memlock unlimited

We've auto restart service check triggers after each 10mins to restart if ES goes down, according to server time ES again got restarted around 10:50 & you can see there are no logs just before the 10:50 while the last logged was around 8:00.

I can attach the recent logfile if you want.

We're really being affected by this issue & in need of help :frowning:

All of these;

Are a Very Bad Idea and likely to be putting pressure on things.
If you think you have needed to make these changes due to these problems, you probably just need more nodes or less data. But again, it's hard to say.

This is a community based forum, people will offer assistance as best as possible.

Thanks for response, well we've now removed these values from ES nodes & enabled gc logging, encountering lots of gc allocation failures here :

One more question, removing fielddata & breaker values from elasticsearch.yml will revert it to default values or i should make default values as well ?

We've found another thing, sometimes under /usr/share/elasticsearch directory large size of .hprof is created if ES becomes unresponsive. On googling it, this file is created if JVM gets crash. We've now updated ES to latest 2.3.5 version & current java version is :

openjdk version "1.8.0_101"
OpenJDK Runtime Environment (build 1.8.0_101-b13)
OpenJDK 64-Bit Server VM (build 25.101-b13, mixed mode)