Clustering failed

Dear experts
I've had one elasticsearch server at home for a few months now and it's been working well for a few months. However, with an power outage I lost a few indices and looking around it seems like having a cluster would have avoided at least part of the "damage done".

Thus I started with creating a two more elastic search servers, one master with no node data, and another data node. However, I have not been able to get it working so I am hoping to get some pointers here.

It looks like data has been replicated between the data nodes but the data nodes only shows as one when issuing the cluster status command. The data nodes has the same cluster_uuid but the master node does not. Kibana also shows only one node even though I configure it to use all three.

Any ideas?

Master node (no data)

cluster.name: siem
node.name: siem-master
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch
network.host: 192.168.70.162
http.port: 9200
discovery.seed_hosts:
    - 192.168.70.150
    - 192.168.70.161
    - 192.168.70.162
cluster.initial_master_nodes:
    - siem-master
    - siem-1
    - siem-2
node.data: false

Data node 1

cluster.name: siem
node.name: siem-1
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch
network.host: 192.168.70.150
http.port: 9200
discovery.seed_hosts:
    - 192.168.70.150
    - 192.168.70.161
    - 192.168.70.162
cluster.initial_master_nodes:
    - siem-master
    - siem-1
    - siem-2
cluster.max_shards_per_node: 2000

Data node 2

cluster.name: siem
node.name: siem-2
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch
network.host: 192.168.70.161
http.port: 9200
discovery.seed_hosts:
    - 192.168.70.150
    - 192.168.70.161
    - 192.168.70.162
cluster.initial_master_nodes:
    - siem-master
    - siem-1
    - siem-2
cluster.max_shards_per_node: 2000

Cluster status output

{
    "cluster_name" : "siem",
    "status" : "red",
    "timed_out" : false,
    "number_of_nodes" : 1,
    "number_of_data_nodes" : 0,
    "active_primary_shards" : 0,
    "active_shards" : 0,
    "relocating_shards" : 0,
    "initializing_shards" : 0,
    "unassigned_shards" : 2,
    "delayed_unassigned_shards" : 0,
    "number_of_pending_tasks" : 0,
    "number_of_in_flight_fetch" : 0,
    "task_max_waiting_in_queue_millis" : 0,
    "active_shards_percent_as_number" : 0.0
}

Master Node: siem-master/_cluster/health?pretty

{
    "cluster_name" : "siem",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 1,
    "number_of_data_nodes" : 1,
    "active_primary_shards" : 1961,
    "active_shards" : 1961,
    "relocating_shards" : 0,
    "initializing_shards" : 0,
    "unassigned_shards" : 14,
    "delayed_unassigned_shards" : 0,
    "number_of_pending_tasks" : 0,
    "number_of_in_flight_fetch" : 0,
    "task_max_waiting_in_queue_millis" : 0,
    "active_shards_percent_as_number" : 99.29113924050633
}

Data Node 1: siem-2/_cluster/health?pretty

{
    "cluster_name" : "siem",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 1,
    "number_of_data_nodes" : 1,
    "active_primary_shards" : 1957,
    "active_shards" : 1957,
    "relocating_shards" : 0,
    "initializing_shards" : 0,
    "unassigned_shards" : 11,
    "delayed_unassigned_shards" : 0,
    "number_of_pending_tasks" : 0,
    "number_of_in_flight_fetch" : 0,
    "task_max_waiting_in_queue_millis" : 0,
    "active_shards_percent_as_number" : 99.4410569105691
}

When adding nodes to an existing cluster you should not set cluster.initial_master_nodes, just discovery.seed_hosts, and the new nodes should have an empty data path when started for the first time.

Your config looks ok now (except all three nodes are data nodes?) but I think this wasn't the config used the first time these nodes started up. Best to start again with empty data paths for the two new nodes.

1 Like

Also if you suffered data loss due to a power outage then this indicates something wrong with your disks, often a volatile write cache. The proper fix is to address this misconfiguration, rather than just adding nodes. I'm guessing that you don't have redundant power feeds at your home, so a power failure will likely affect all nodes and therefore has a good chance of corrupting all copies of each active shard if your storage isn't set up for power safety.

1 Like

Thank you David
I tried to do a complete re-install of Ubuntu and then installed elasticsearch again.

However, the node would not start and it looks like the culprit was the removal of the log- and data-path (looks like it defaults to /usr/share/elasticsearch/(data|logs) where it does not have write permissions.

Re-adding the path.data and path.logs configuration made it start and I now have three nodes in the cluster and indices are slowly turning from yellow to green.

You mentioned that all three nodes are data nodes but the master node had this config at the last line: node.data: false. Is that not sufficient?

Regarding the outage I was hoping that some of the data in one of the nodes would be intact and not corrupted and that this could be used for "self-healing" (a bit like RAID5). Perhaps a silly assumption (which most is).

Correct, no redundant power feeds in my home. I'd suggest it to my wife but I have a feeling that the result would be me and my VMware environment being left in the street with not power at all. :slight_smile:

Thanks for all the help and merry X-mas!

Data node 1

cluster.name: siem
node.name: siem-1
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch
network.host: 192.168.70.150
discovery.seed_hosts:
    - 192.168.70.150
    - 192.168.70.161
cluster.max_shards_per_node: 2000

Data node 2

cluster.name: siem
node.name: siem-2
path.logs: /var/log/elasticsearch
network.host: 192.168.70.161
discovery.seed_hosts:
    - 192.168.70.150
    - 192.168.70.161
cluster.max_shards_per_node: 2000

Master node

cluster.name: siem
node.name: siem-master
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch
network.host: 192.168.70.162
discovery.seed_hosts:
    - 192.168.70.150
    - 192.168.70.161
    - 192.168.70.162
node.master: true
node.voting_only: false
node.data: false
node.ingest: false
node.ml: false
xpack.ml.enabled: true
cluster.remote.connect: false

My take from this is that the cluster.initial_master_nodes option is used for bootstrapping (setting up totally new cluster from scratch?), while I tried to add nodes to a single node setup.

So removing that option seems to have been the magic to make it work?

Ah apologies, I managed to miss that vital line. Yes, that's sufficient.

This is still relying on luck, and the probability of losing is still pretty high. Best to fix your storage.

Really it was deleting the contents of the data path that did it. All your nodes had already bootstrapped their own one-node clusters and you can't merge clusters together once they've formed. But once you've got a working cluster you don't want to be accidentally bootstrapping another one, hence why we recommend not using cluster.initial_master_nodes.

1 Like

Thank you David for the help. Much appreciated!

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.