How do I find the reason for a failed data node? (Elasticsearch 6.5)

We have been running elasticsearch for a few years now. We ran a 2 node simple cluster (version 1.7). This cluster supported some internal utilities so it was relatively low use. In the last 4 years that cluster has never crashed, been restarted or even hiccuped.

We decided to set up a more production focused cluster. I did a lot of research and this is that I came up with for the new cluster:

2 Client Nodes (a.k.a. Coordinating nodes) [4 core, 8GB memory, 300GB HD, Virtual]
3 Master Nodes[4 core, 8GB memory, 300GB HD, Virtual] 
3 Data Nodes[48 core, 64GB memory, 3TB HD (Raid 0), Physical] 

This cluster is running ES 6.5.4 CENTOS 7 (I am planning to upgrade to 7.1 soon). All of nodes are operating on essentially vanilla configurations. We only have about 5 million documents and less than 60GB of data total for the cluster. The configuration looks something like this:

# Example Master Config MYCLUSTER MASTER01
node.master: true false
node.ingest: false
cluster.remote.connect: false
path.repo: /repo/nfs/path

# Example Data Config MYCLUSTER DATA01
node.master: false true
node.ingest: false
cluster.remote.connect: false
path.repo: /repo/nfs/path

# Example Client Config MYCLUSTER CLIENT01
node.master: false false
node.ingest: false
cluster.remote.connect: false

# All have
discovery.zen.minimum_master_nodes: 2

In jvm.options the heap space is at the default 1G for all nodes except the data nodes which are at 26G.

The problem is that my data nodes keep crashing. 3 times in the last 3 days one of my 3 data nodes has crashed. Bringing it back online and ridding myself of corrupted pieces of indexes has been many hours of work and learning. I can't figure out what is making them crash. I see errors in the log that refer to the "Failed Node" and "CorruptIndexException" but I have no idea what caused the actual node to fail. I have examined log files of all of the severs and while they all show errors none seem to have anything that helps me pinpoint the cause of the failure. 2 of the 3 data nodes have failed.

The interwebs seem to suggest that the most common reason for this is hardware. Unfortunately, I don't see any evidence that the hardware is the problem.

Can anyone offer any advice on how I can figure out why the data nodes are crashing?

Welcome! :slight_smile:

Just set that to be your master nodes, it's much easier to maintain.

I would suggest you increase that. On the master+client nodes you can go to 3GB easily enough. For the data nodes you want to be just under 32GB.

Posting the logs, or using gist/pastebin/etc and linking, would be really helpful.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.