We have an old production cluster which is in version 1.6.2 (yes we should upgrade)
We have 3 masters (one is also data node) and 9 data nodes.
For moment, we are facing some problem and after the master's nodes restarts (one by one), the cluster state is red.
Some infos :
active_primary_shards : 1138
active_shards : 2276
No pending tasks
We are using some plugins :
On master's nodes, we see some logs
[2019-05-08 10:31:25,396][DEBUG][http.netty ] [lg2] Caught exception while handling client http traffic, closing connection
[2019-05-08 10:31:15,287][DEBUG][action.admin.cluster.node.stats] [lg2] failed to execute on node [xFAXo7BAT6O4lxW7hvWVQA] org.elasticsearch.transport.ReceiveTimeoutTransportException: [lg7][inet[/X.X.X.X:9300]][cluster:monitor/nodes/stats[n]] request_id  timed out after [15000ms
[2019-05-08 10:34:03,595][WARN ][repositories ] [lg10] failed to create repository [swift][swift_backup]
org.elasticsearch.common.settings.NoClassSettingsException: failed to load class with value [swift]; tried [swift, org.elasticsearch.repositories.SwiftRepositoryModule, org.elasticsearch.repositories.swift.SwiftRepositoryModule, org.elasticsearch.repositories.swift.SwiftRepositoryModule]
I guess the 1st log is a connection closure from a client and is not really important
The 2nd logs is more important, because it seems lg7 nodes is sending time out. It's a data node.
In Kopf, I see that there is a lot load on it. Other nodes have small load.
3rd log is a S3 storage where we put snapshots. I removed the plugin (elasticsearch/bin/plugin -r swift-repository-plugin) to be sure it's not the problem. But one master is keeping sending those logs..
I don't know where the problem is. Should I restart the data node where the load is important ?
When we go on kibana interface, sometimes it's working, sometimes we have "could not contact elasticsearch"
Any idea ?