Node seems to lock up randomly

I have two different 5 node ES clusters. Both are running on the same VMware cluster. Every now and again the health of the cluster will go to red or even yellow. When I look at marvel i'll see that one of the nodes is no longer part of the cluster.

I'll log into the node and I can move around various directories just fine until I attempt to go into the directory we have ES installed in. For example if I attempt to just do an ls within /opt (which is where ES is installed) the system will lock up and I cannot do anything within /opt.

I am able to go into say /var/log and look at various log files. I don't see anything related to why that part of the filesystem is not accessible.

Once I reboot the system it comes back up just fine, ES, kibana and Marvel all come up and it becomes part of the cluster again. It will work for various amounts of time. Might be a week or day, It's even gone a couple of weeks without a problem.

OEL 7 3.10.0-123.el7
OpenJDK 25.101-b13
JRE 1.8.0_101-b13
ES 2.3.4
kibana 4.5.4

Each VM has 32gb of memory and 4 cpu's

This sounds like a FS problem, are you using NFS mounted into the VM, or as underlying block storage that the hypervisor pulls in?

The underlying storage is iscsi, all the elasticsearch stuff sits on the hard drive presented to the OS through vmware.

I had a discussion with some of the folks on my team and we're talking about the use of xfs vs ext4. What is the recommended fs? We have another cluster in production that is using ext4 ours which is having the problems is using xfs.

We don't really make recommendations there (other than stay away from NFS as a data mount the app uses directly).

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.