I have 3 nodes in the cluster. All of them are master and data nodes. When all nodes are online, indexing works fine.
If I turn off only one node, it is still operating normally, but when 2 nodes goes offline, indexing works (getting successful result), but indexed documents are not searchable (Also number of documents are not updating. E.g. by running /test_index/_count returns the old value).
My goal is to have a cluster with 3 nodes, where if two of them goes offline, I must be able to index and query from 3rd node.
Any ideas why this could happen and how to achieve this goal?
Here is the technical details:
Number of shards: 15
Number of replicas: 2
node-1 config:
cluster.name: "cluster_name"
node.name: "node-1"
node.master: true
node.data: true
network.host: [_local_, "10.0.2.170"]
discovery.seed_hosts: ["10.0.2.170", "10.0.2.171", "10.0.2.172"]
action.auto_create_index: "*"
cluster.initial_master_nodes: ["node-1", "node-2", "node-3"]
node-2 config:
cluster.name: "cluster_name"
node.name: "node-2"
node.master: true
node.data: true
network.host: [_local_, "10.0.2.171"]
discovery.seed_hosts: ["10.0.2.170", "10.0.2.171", "10.0.2.172"]
action.auto_create_index: "*"
cluster.initial_master_nodes: ["node-1", "node-2", "node-3"]
node-3 config:
cluster.name: "cluster_name"
node.name: "node-3"
node.master: true
node.data: true
network.host: [_local_, "10.0.2.172"]
discovery.seed_hosts: ["10.0.2.170", "10.0.2.171", "10.0.2.172"]
action.auto_create_index: "*"
cluster.initial_master_nodes: ["node-1", "node-2", "node-3"]
All of them are running on AWS t2.micro servers
When 2 nodes are offline and try to get cluster health info from the 3rd node (using _cluster/health), I'm getting:
{
"error" : {
"root_cause" : [
{
"type" : "master_not_discovered_exception",
"reason" : null
}
],
"type" : "master_not_discovered_exception",
"reason" : null
},
"status" : 503
}
Using _cluster/stats, I'm getting:
{
"_nodes" : {
"total" : 3,
"successful" : 1,
"failed" : 2,
"failures" : [
{
"type" : "failed_node_exception",
"reason" : "Failed node [cQ8Z2v3TSFeF8eXs-OfIyw]",
"node_id" : "cQ8Z2v3TSFeF8eXs-OfIyw",
"caused_by" : {
"type" : "node_not_connected_exception",
"reason" : "[node-2][10.0.2.171:9300] Node not connected"
}
},
{
"type" : "failed_node_exception",
"reason" : "Failed node [boNPPrk-SaWSOtAx_ZfwMA]",
"node_id" : "boNPPrk-SaWSOtAx_ZfwMA",
"caused_by" : {
"type" : "node_not_connected_exception",
"reason" : "[node-3][10.0.2.172:9300] Node not connected"
}
}
]
}
If one of the offline nodes are turned on, same queries return following results (Now we have 2 available nodes):
_cluster/health:
{
"cluster_name" : "cluster_name",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 20,
"active_shards" : 40,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 15,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 72.72727272727273
}
_cluster/stats:
{
"_nodes" : {
"total" : 2,
"successful" : 2,
"failed" : 0
}
If 3rd node becomes available (Now we have all 3 nodes online):
_cluster/health
{
"cluster_name" : "cluster_name",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 3,
"number_of_data_nodes" : 3,
"active_primary_shards" : 20,
"active_shards" : 55,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 100.0
}
_cluster/stats
{
"_nodes" : {
"total" : 3,
"successful" : 3,
"failed" : 0
}