ES goes to red when node restarts

Hi

I am using ES 6.4 version. And I have 3 machines, Each of them have one instance of ES master and ES data node running as separate processes. totally 6 nodes, 3 master and 3 data nodes. Replication Factor is being set to 1.

When we restart all nodes together, ES goes to red state. We had set min master and min data node required to be 2. Any thing else can lead to this issue?

metrics-2019.08.01   2 p UNASSIGNED CLUSTER_RECOVERED
metrics-2019.08.01   2 r UNASSIGNED CLUSTER_RECOVERED


{
  "index": "metrics-2019.08.01",
 "shard": 2,
 "primary": true,
  "current_state": "unassigned",
"unassigned_info": {
  "reason": "CLUSTER_RECOVERED",
"at": "2019-08-01T19:22:50.394Z",
"last_allocation_status": "no_valid_shard_copy"
},
  "can_allocate": "no_valid_shard_copy",
"allocate_explanation": "cannot allocate because a previous copy of the primary shard existed but can no longer be found on the nodes in the cluster",
"node_allocation_decisions": [
{
  "node_id": "I8hrSGVsQcO0c7DQTdmdgA",
  "node_name": "metrics-datastore-1",
  "transport_address": "192.168.25.79:9300",
  "node_attributes": {
    "ml.machine_memory": "33566429184",
    "ml.max_open_jobs": "20",
    "xpack.installed": "true",
    "ml.enabled": "true"
  },
  "node_decision": "no",
  "store": {
    "found": false
  }
},
{
  "node_id": "L-TlEqTJRjuQKJBMFsnSgw",
  "node_name": "metrics-datastore-0",
  "transport_address": "192.168.25.18:9300",
  "node_attributes": {
    "ml.machine_memory": "33566429184",
    "ml.max_open_jobs": "20",
    "xpack.installed": "true",
    "ml.enabled": "true"
  },
  "node_decision": "no",
  "store": {
    "found": false
  }
},
{
  "node_id": "zTKAccDPSZezu7iyYbVVww",
  "node_name": "metrics-datastore-2",
  "transport_address": "192.168.25.53:9300",
  "node_attributes": {
    "ml.machine_memory": "33566429184",
    "ml.max_open_jobs": "20",
    "xpack.installed": "true",
    "ml.enabled": "true"
  },
  "node_decision": "no",
  "store": {
    "found": false
  }
}
]
}

This response is indicating that the data for this shard is completely gone. Is every shard unassigned or is it just some of them? Can you share the output of GET _cluster/health? Also could you share your elasticsearch.yml files (use https://gist.github.com if they are too large to fit here).

Hi only one of the index among 4 or 5 indices goes to red.

{
  "cluster_name" : "metrics-datastore",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 6,
"number_of_data_nodes" : 3,
"active_primary_shards" : 14,
"active_shards" : 28,
"relocating_shards" : 0,
"  initializing_shards" : 0,
"unassigned_shards" : 2,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 93.33333333333333
}

ES config File

cluster.name: metrics-datastore
node.name: metrics-datastore-0
node.master: false
node.data: true
node.max_local_storage_nodes: 1
path.data: /data/elasticsearch/data,logs/elasticsearch/data 
path.logs: /logs/elasticsearch
path.repo: /cfs/data/harmony_backup/esbackup
bootstrap.memory_lock: true
http.port: 9200
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping.unicast.hosts: metrics-master
discovery.zen.fd.ping_interval: 10s
discovery.zen.fd.ping_retries: 5
discovery.zen.fd.ping_timeout: 120s
gateway.recover_after_master_nodes: 2
gateway.recover_after_time: 5m
http.cors.enabled: false
indices.fielddata.cache.size: 10%
indices.memory.index_buffer_size: 30%
thread_pool.write.queue_size: 2000
network.bind_host: <IP>, _local_
network.publish_host: <IP>
logger.gateway: TRACE
gateway.expected_nodes: 6
gateway.expected_master_nodes: 3
gateway.expected_data_nodes: 3
gateway.recover_after_data_nodes: 2

index conf are as below

{
  "index_patterns": ["*"],
  "order": 1,
  "settings": {
"index.number_of_replicas": 1,
"index.number_of_shards": 3,
     "index.merge.scheduler.max_thread_count": 1,
     "index.refresh_interval": "30s",
     "index.translog.durability": "async",
     "index.translog.flush_threshold_size": "1g",
     "index.translog.sync_interval": "10s",
     "index.unassigned.node_left.delayed_timeout": "10m",
     "index.mapping.total_fields.limit": 3000
  }
}

This is the default for this setting. Are you setting this to something other than 1 on any of your nodes?

This tells Elasticsearch to split its data between /data/elasticsearch/data and $ES_HOME/logs/elasticsearch/data. This is a fairly unusual configuration. Why are you set up like this? It's normally better to just use a single data path. Is it possible that $ES_HOME/logs is being cleared out on a restart?

No we don't set below property to any other value

max_local_storage_nodes

and about path, we have two disks one mounted with /data and another with <es_home>/logs, and they don't get cleared

I don't have any further ideas. The shard data is no longer anywhere that Elasticsearch can find it, so either it's looking in the wrong places or else something else has deleted it.

Have you specified different data paths for the master and data nodes running on the same host? If not I assume the nodes may flip directories depending on which order they come up in??

That was my guess too, but that doesn't happen if node.max_local_storage_nodes: 1 on every node.

1 Like

Hi

Thanks for helping me out, found out the issue and fixed it. It was related to data.path where we had given two paths and one of them was going corrupted because of some disk issue. We can close this topic as resolved

1 Like