Sorry for the lack of information here, but we have been experiencing massive cluster instability for the last week or so. We had upgraded to 1.5.2 last week and were stable for a full week before we started to have issues. It looked like the master node would lose connectivity with a data node, as we were seeing TimeOut exceptions in our logs. Restarting the node wouldn't fix this; when we powered the node off the cluster still thought it was joined. The only way to fix this was to completely stop every node in the cluster and restart the cluster. The Cluster would rebuild fine, and then this would happen again. Whenever it did, EVERY node would stop responding to REST API called. Any API call I would do that wasn't just 'localhost:9200' or 'localhost:9200/_cluster/health' would just hang and never respond (which also means no searching, plugins won't work, etc).
Size: 2 TB
Master: 4 nodes
Client: 4 nodes
Data: 3 nodes
Things I've tried:
- Moving all nodes to the same network
- Reducing the number of shards in the cluster
- Downgrading to 1.5.0 (When I tried starting a 1.5.0 node, I got a Java stacktrace: CorruptStateException: State version mismatch expected: )
This is extremely frustrating, and has been impacting production for days now!