We have been running elasticsearch for a few years now. We ran a 2 node simple cluster (version 1.7). This cluster supported some internal utilities so it was relatively low use. In the last 4 years that cluster has never crashed, been restarted or even hiccuped.
We decided to set up a more production focused cluster. I did a lot of research and this is that I came up with for the new cluster:
2 Client Nodes (a.k.a. Coordinating nodes) [4 core, 8GB memory, 300GB HD, Virtual]
3 Master Nodes[4 core, 8GB memory, 300GB HD, Virtual]
3 Data Nodes[48 core, 64GB memory, 3TB HD (Raid 0), Physical]
This cluster is running ES 6.5.4 CENTOS 7 (I am planning to upgrade to 7.1 soon). All of nodes are operating on essentially vanilla configurations. We only have about 5 million documents and less than 60GB of data total for the cluster. The configuration looks something like this:
# Example Master Config
cluster.name: MYCLUSTER
node.name: MASTER01
node.master: true
node.data: false
node.ingest: false
cluster.remote.connect: false
path.repo: /repo/nfs/path
# Example Data Config
cluster.name: MYCLUSTER
node.name: DATA01
node.master: false
node.data: true
node.ingest: false
cluster.remote.connect: false
path.repo: /repo/nfs/path
# Example Client Config
cluster.name: MYCLUSTER
node.name: CLIENT01
node.master: false
node.data: false
node.ingest: false
cluster.remote.connect: false
# All have
http.port: MY_ES_PORT
discovery.zen.ping.unicast.hosts: MY_LIST_OF_SERVERS(8)
discovery.zen.minimum_master_nodes: 2
In jvm.options the heap space is at the default 1G for all nodes except the data nodes which are at 26G.
The problem is that my data nodes keep crashing. 3 times in the last 3 days one of my 3 data nodes has crashed. Bringing it back online and ridding myself of corrupted pieces of indexes has been many hours of work and learning. I can't figure out what is making them crash. I see errors in the log that refer to the "Failed Node" and "CorruptIndexException" but I have no idea what caused the actual node to fail. I have examined log files of all of the severs and while they all show errors none seem to have anything that helps me pinpoint the cause of the failure. 2 of the 3 data nodes have failed.
The interwebs seem to suggest that the most common reason for this is hardware. Unfortunately, I don't see any evidence that the hardware is the problem.
Can anyone offer any advice on how I can figure out why the data nodes are crashing?