Restarted one node - Kibana reports all data nodes as down

Hi there
I am running a re-creational cluster at home for firewall logs and some nginx traffic logs. Recently I added two nodes to my setup and now there's 2 data nodes and one master non-data node.

Today I tried to restart the elasticsearch service on one of the data nodes and suddenly Kibana reported both data nodes as down. Only the master was reported as up.

/_cat/shards shows a bunch of unassigned shards after the reallocation stops. Restarting the node again resulted in even more. From the looks of it all primary shards has been allocated, but the replicas has not.

Checking one of the indexes with unallocated shards shows that replication is enabled:

{
  "fortigate-2019.11.19" : {
    "settings" : {
      "index" : {
        "creation_date" : "1574121601277",
        "number_of_shards" : "5",
        "number_of_replicas" : "1",
        "uuid" : "_lUL_6mjQYKGtP_LJ5K3ww",
        "version" : {
          "created" : "6080299",
          "upgraded" : "7050099"
        },
        "provided_name" : "fortigate-2019.11.19"
      }
    }
  }
}

All nodes seems to be detected from each of the members:

{
  "cluster_name" : "siem",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 2,
  "active_primary_shards" : 2182,
  "active_shards" : 3398,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 965,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 77.88219115287646
}

Cluster config (grep -e '^[^#]' /etc/elasticsearch/elasticsearch.yml)

cluster.name: siem
node.name: siem-1
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch
network.host: 192.168.70.150
discovery.seed_hosts:
    - 192.168.70.150
    - 192.168.70.161
cluster.max_shards_per_node: 4000

cluster.name: siem
node.name: siem-2
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch
network.host: 192.168.70.161
discovery.seed_hosts:
    - 192.168.70.150
    - 192.168.70.161
cluster.max_shards_per_node: 4000

cluster.name: siem
node.name: siem-master
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch
network.host: 192.168.70.162
discovery.seed_hosts:
    - 192.168.70.150
    - 192.168.70.161
    - 192.168.70.162
node.master: true
node.voting_only: false
node.data: false
node.ingest: false
node.ml: false
xpack.ml.enabled: true
cluster.remote.connect: false

Cluster log shows a bunch of these messages:
Caused by: org.elasticsearch.action.UnavailableShardsException: [.monitoring-es-7-2019.12.20][0] primary shard is not active Timeout: [1m]

I can probably fix this by running on of the allocation scripts, but I'd rather understand why this happened if anyone would be up for explaining.

Kind regards,
Patrik

Ok, I feel silly. I have the elastic apt repo configured on all nodes and an upgrade took place on one of them. This is why the replication did not work.

If anyone bumps into this I came to the conclusion and "solved" it like this:

Reason for unallocated shards:

curl -s http://192.168.70.150:9200/_cluster/allocation/explain?pretty

This actually said that the node could not allocated shards to the peer due to a difference in version.

This command also gave some information:
curl -s http://192.168.70.161:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason| grep UNASSIGNED

In the end I stopped the process on each of the nodes, then started it one by one. Looks like it is recovering now. Fingers crossed. :slight_smile:

Kind regards,
Patrik

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.