First steps troubleshooting ES cluster crashes?

Hi All,

We've been running a three-node ELK cluster since last fall, running elasticsearch 5.5.2, with no trouble until last week. Then Friday evening two nodes dumped their heap, which filled the root volume and locked up the nodes completely. Resized the VMs' root volumes and started it all back up and it's been bumpy since.

All three nodes page out periodically with high load averages in short bursts, sometimes as high as 350+, but never for more than five minutes. I haven't had good luck spotting the problem while it's happening, but today the whole cluster locked up despite the loads being reasonable.

If I run "top -p PID" with the PID for the ES main PID, and then hit "H," I can see a slew of child java threads running and one camped at the top with a long run time.

Trying to stop ES via systemctl times out. Eventually had to kill -9 it. Also tried kill -9 on the stuck child thread on one node, but that just killed the main process.

Restarted on all three nodes and now it's just about done cleaning up the unassigned shards and getting back to green. Hopefully redis didn't miss too much during the interruption, but I'm not clear on what happens to events that redis thinks it's passed to ES if they don't make it in.

I'm aware that this is a pitiful dearth of usable information. The logs are dense with errors mostly along the lines of "Node not connected", I don't even have the dump files from last week. But can anyone point me toward first troubleshooting steps when this happens again?

Hope to hear from you.

Still happening! I now have three java heap dum p files. Any guidance on what to look for?

Do you have X-Pack Monitoring enabled, or any other type of monitoring?

So far just the standard linux performance troubleshooting tools, sysstat, ioptop, htop. Looking at installing x-pack with the basic license for the monitoring, do I install the ES/kibana plugins on all nodes?

Looks like installing x-pack for ES and kibana will turn on authentication for our ES instances, not something I want to do quickly on a prod cluster, I'm assuming I'd need to change our shipper configs and we also have other processes touching ES directly. What's needed to turn off authentication when x-packs are installed?

Yep.

If you apply a basic license then it won't :slight_smile:

Are you sure? I tried it on one node, following the steps at https://www.elastic.co/downloads/x-pack but skipping the parts about generating passwords and putting them in kibana.yml and after restarting ES, "curl localhost:9200/_cluster/health?pretty" went from returning a health summary to returning:
{
"error" : {
"root_cause" : [
{
"type" : "security_exception",
"reason" : "missing authentication token for REST request [/_cluster/health?pretty]",
"header" : {
"WWW-Authenticate" : "Basic realm="security" charset="UTF-8""
}
}
],
"type" : "security_exception",
"reason" : "missing authentication token for REST request [/_cluster/health?pretty]",
"header" : {
"WWW-Authenticate" : "Basic realm="security" charset="UTF-8""
}
},
"status" : 401
}

I uninstalled the x-pack from ES and kibana, restarted both on the node and restarted nginx, it came back.

Does the basic license include a full-featured eval period? Might that have turned on authentication?

You need to install X-Pack and then apply the basic license to disable the auth. By default the included trial license is a platinum level one.

makes sense, will try it.

  • our cluster is primarily a logstore for an application farm, documents are each a flat text entry of up to a dozen lines or so.
  • each indice is a day's entries, dating back to 2013
  • each indice contains anywhere from ~20K to 1M entries, with most being on the lower end (high days were usually due to some incident)
  • we don't query the data much so are running single shards
  • we keep two replicas, is that overkill? Would one make more sense?

Been digging further all day and have found some interesting bits but no clear problem or solution:

  • back on 12/22 we started getting a boatload more events from our app farm and the rate of new events is climbing steadily, currently up to ~20-30x the previous rate. waiting to hear from our devs if they know why.
  • The three VMs running our nodes seem to be bottlenecked on read IO on their data disks (virtual disks). Moved the data disks on two nodes to local host storage and the IO waits went away but the nodes still periodically nail 100%CPU and load averages in the hundreds.
  • Wondering if kibana queries were killing the ES with reads, tried stopping kibana on all three nodes. no change.

Thanks very much for your help so far, will keep you posted!

You may want to look at your indexing/sharding strategy.

1 million events isn't likely to be massive (in GB terms) and you may find less+larger shards will help.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.