New documents are not indexed/searchable in the cluster when 2 nodes are off out of 3

I have 3 nodes in the cluster. All of them are master and data nodes. When all nodes are online, indexing works fine.
If I turn off only one node, it is still operating normally, but when 2 nodes goes offline, indexing works (getting successful result), but indexed documents are not searchable (Also number of documents are not updating. E.g. by running /test_index/_count returns the old value).

My goal is to have a cluster with 3 nodes, where if two of them goes offline, I must be able to index and query from 3rd node.

Any ideas why this could happen and how to achieve this goal?

Here is the technical details:

Number of shards: 15
Number of replicas: 2

node-1 config:

cluster.name: "cluster_name"
node.name: "node-1"
node.master: true
node.data: true
network.host: [_local_, "10.0.2.170"]
discovery.seed_hosts: ["10.0.2.170", "10.0.2.171", "10.0.2.172"]
action.auto_create_index: "*"
cluster.initial_master_nodes: ["node-1", "node-2", "node-3"]

node-2 config:

cluster.name: "cluster_name"
node.name: "node-2"
node.master: true
node.data: true
network.host: [_local_, "10.0.2.171"]
discovery.seed_hosts: ["10.0.2.170", "10.0.2.171", "10.0.2.172"]
action.auto_create_index: "*"
cluster.initial_master_nodes: ["node-1", "node-2", "node-3"]

node-3 config:

cluster.name: "cluster_name"
node.name: "node-3"
node.master: true
node.data: true
network.host: [_local_, "10.0.2.172"]
discovery.seed_hosts: ["10.0.2.170", "10.0.2.171", "10.0.2.172"]
action.auto_create_index: "*"
cluster.initial_master_nodes: ["node-1", "node-2", "node-3"]

All of them are running on AWS t2.micro servers

When 2 nodes are offline and try to get cluster health info from the 3rd node (using _cluster/health), I'm getting:

{
  "error" : {
    "root_cause" : [
      {
        "type" : "master_not_discovered_exception",
        "reason" : null
      }
    ],
    "type" : "master_not_discovered_exception",
    "reason" : null
  },
  "status" : 503
}

Using _cluster/stats, I'm getting:

{
  "_nodes" : {
    "total" : 3,
    "successful" : 1,
    "failed" : 2,
    "failures" : [
      {
        "type" : "failed_node_exception",
        "reason" : "Failed node [cQ8Z2v3TSFeF8eXs-OfIyw]",
        "node_id" : "cQ8Z2v3TSFeF8eXs-OfIyw",
        "caused_by" : {
          "type" : "node_not_connected_exception",
          "reason" : "[node-2][10.0.2.171:9300] Node not connected"
        }
      },
      {
        "type" : "failed_node_exception",
        "reason" : "Failed node [boNPPrk-SaWSOtAx_ZfwMA]",
        "node_id" : "boNPPrk-SaWSOtAx_ZfwMA",
        "caused_by" : {
          "type" : "node_not_connected_exception",
          "reason" : "[node-3][10.0.2.172:9300] Node not connected"
        }
      }
    ]
  }

If one of the offline nodes are turned on, same queries return following results (Now we have 2 available nodes):

_cluster/health:

{
  "cluster_name" : "cluster_name",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 2,
  "number_of_data_nodes" : 2,
  "active_primary_shards" : 20,
  "active_shards" : 40,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 15,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 72.72727272727273
}

_cluster/stats:

{
  "_nodes" : {
    "total" : 2,
    "successful" : 2,
    "failed" : 0
  }

If 3rd node becomes available (Now we have all 3 nodes online):

_cluster/health

{
  "cluster_name" : "cluster_name",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 20,
  "active_shards" : 55,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

_cluster/stats

{
  "_nodes" : {
    "total" : 3,
    "successful" : 3,
    "failed" : 0
  }

A majority of master eligible nodes always need to be available in order to have a fully functioning cluster, so with three nodes you can only tolerate one node being offline. What you describe is therefore not possible.

Thanks for the quick reply.
So it means, I can't afford to lose two nodes, but this guide says I can: https://www.elastic.co/guide/en/elasticsearch/guide/current/replica-shards.html#_balancing_load_with_replicas

As a bonus, we have also increased our availability. We can now afford to lose two nodes and still have a copy of all our data.

We are not losing old data, but if I can't index new documents while 2 nodes are offline, it also means that I lost new ones.

You can lose two nodes without losing data, but you will not have a fully functioning cluster that can index or update data. Indexing data requires an active master node in order to prevent data loss and that requires a majority to be available.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.