I have a 5 node cluster, with two data-only nodes, two master-eligible nodes and one client node. The client and master-eligible nodes are also my Logstash servers, as they are gargantuan servers that are underutilized when just running one instance of ES.
I'm testing this right now, but it appears that whenever my client node has ES turned on, the primary index gets randomly deleted between 5-120 minutes after it's creation, and the master node logs:
[2016-09-27 08:48:43,542][WARN ][transport ] [phys-node2] Transport response handler not found of id [38656492]
[2016-09-27 08:48:43,543][WARN ][transport ] [phys-node2] Transport response handler not found of id [38656480]
[2016-09-27 08:48:43,543][WARN ][transport ] [phys-node2] Transport response handler not found of id [38656489]
[2016-09-27 08:48:43,688][WARN ][transport ] [phys-node2] Transport response handler not found of id [38656510]
[2016-09-27 08:48:43,698][WARN ][action.bulk ] [phys-node2] unexpected error during the primary phase for action [indices:data/write/bulk[s]], request [BulkShardRequest to [filebeat-2016.09.27] containing [1] requests]
[filebeat-2016.09.27] IndexNotFoundException[no such index]
at org.elasticsearch.cluster.routing.RoutingTable.shardRoutingTable(RoutingTable.java:108)
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.doRun(TransportReplicationAction.java:461)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at org.elasticsearch.action.support.replication.TransportReplicationAction.doExecute(TransportReplicationAction.java:131)
at org.elasticsearch.action.support.replication.TransportReplicationAction.doExecute(TransportReplicationAction.java:82)
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:137)
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:85)
at org.elasticsearch.action.bulk.TransportBulkAction.executeBulk(TransportBulkAction.java:309)
<!code>
And the data nodes log a similar warning;
[2016-09-27 08:48:38,820][WARN ][action.bulk ] [hyd-mon-storage01] unexpected error during the primary phase for action [indices:data/write/bulk[s]], request [BulkShardRequest to [filebeat-2016.09.27] containing [6] requests]
[filebeat-2016.09.27] IndexNotFoundException[no such index]
<!code>
While the client node did not produce any relevant logs in /var/log/elasticsearch/*.log
My ES configs:
Master-eligible nodes:
node.name: phys-node2
node.master: true
node.data: true
<!code>
Data-Only Nodes:
node.name: hyd-mon-storage01
node.master: false
node.data: true
<!code>
Client Node:
node.name: load-balance-node
node.master: false
node.data: false
<!code>
While reviewing the elasticsearch.yml for the client node, I did notice I had the setting:
node.max_local_storage_nodes: 1
<!code>
I modified node.max_local_storage_nodes to zero on the client node, as it shouldn't have any storage nodes. I fired back up the problematic client node and will report back whether or not this max_local_storage_nodes was the culprit.
My Versions:
ES: 2.4.0
LS: 2.2.4
K: 4.6.1
F: 1.3.0
EDIT: after modifying node.max_local_storage_nodes I'm still seeing my filebeat index randomly get deleted, anyone have any idea how to troubleshoot? I'm not positive that my client node is the issue but I believe it is.. I'm turning this node off for the time being in order to further narrow down the issue.