Shards just stay in UNASIGNED state after upgrade to 6.6

lecko · April 8, 2019, 12:15pm

I am preparing to upgrade from elastic 5.6 to 6.6 and before production I did the test upgrade on small 3 cluster environemnt with 48 shards with no replicas. The upgrade seems to went well, no errors.
But the status of cluster is just RED and all shards are UNASSIGNED

#  curl 'http://elast:9200/_cluster/health?pretty'   {
  "cluster_name" : "graylog-st",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 0,
  "active_shards" : 0,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 48,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 0.0
}

and whem looking into shards why they dont start it returns for every shard the same info:

 "cluster_name" : "graylog-st",
 "compressed_size_in_bytes" : 30633,
 "cluster_uuid" : "bcUJt_Q",
 "routing_table" : {
   "indices" : {
     "graylog_5135" : {
       "shards" : {
         "1" : [
           {
             "state" : "UNASSIGNED",
             "primary" : true,
             "node" : null,
             "relocating_node" : null,
             "shard" : 1,
             "index" : "graylog_5135",
             "recovery_source" : {
               "type" : "EXISTING_STORE",
               "bootstrap_new_history_uuid" : false
             },
             "unassigned_info" : {
               "reason" : "CLUSTER_RECOVERED",
               "at" : "2019-04-08T12:05:15.462Z",
               "delayed" : false,
               "allocation_status" : "no_valid_shard_copy"
             }
           },
           {
             "state" : "UNASSIGNED",
             "primary" : false,
             "node" : null,
             "relocating_node" : null,
             "shard" : 1,
             "index" : "graylog_5135",
             "recovery_source" : {
               "type" : "PEER"
             },
             "unassigned_info" : {
               "reason" : "CLUSTER_RECOVERED",
               "at" : "2019-04-08T12:05:15.462Z",
                  "delayed" : false,
                  "allocation_status" : "no_attempt"
               }
           }
         ],

DavidTurner · April 8, 2019, 12:23pm

Here's an article about the first things to try when your cluster health is red:

Probably a good idea to start here, but do come back to ask further questions if you need.

lecko · April 8, 2019, 1:04pm

thanks, allocation explain returns:

curl .. /_cluster/allocation/explain         ?pretty'
{
  "index" : "tre_0",
  "shard" : 3,
  "primary" : true,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "CLUSTER_RECOVERED",
    "at" : "2019-04-08T12:05:15.476Z",
    "last_allocation_status" : "no_valid_shard_copy"
  },
  "can_allocate" : "no_valid_shard_copy",
  "allocate_explanation" : "cannot allocate because a previous copy of the prima         ry shard existed but can no longer be found on the nodes in the cluster",
  "node_allocation_decisions" : [
    {
      "node_id" : "LVYVce9wQ_-R5lAjyWFlAA",
      "node_name" : "LVYVce9",
      "transport_address" : "10.13.12.5:9300",
      "node_attributes" : {
        "ml.machine_memory" : "4026585088",
        "ml.max_open_jobs" : "20",
        "xpack.installed" : "true",
        "ml.enabled" : "true"
      },
      "node_decision" : "no",
      "store" : {
        "found" : false
      }
    },
    {
      "node_id" : "gBR9uD7yThu3itlz-argDw",
      "node_name" : "gBR9uD7",
      "transport_address" : "10.13.12.6:9300",
      "node_attributes" : {
        "ml.machine_memory" : "6263009280",
        "ml.max_open_jobs" : "20",
        "xpack.installed" : "true",
        "ml.enabled" : "true"
      },
      "node_decision" : "no",
      "store" : {
        "found" : false
      }
    },
    {
      "node_id" : "iRycOvCwQ2q1G0Qoz5jYLg",
      "node_name" : "iRycOvC",
      "transport_address" : "10.13.12.7:9300",
      "node_attributes" : {
        "ml.machine_memory" : "4026585088",
        "ml.max_open_jobs" : "20",
        "xpack.installed" : "true",
        "ml.enabled" : "true"
      },
      "node_decision" : "no",
      "store" : {
        "found" : false
      }
    }
  ]
}

DavidTurner · April 8, 2019, 2:10pm

Ok, it's saying that it can't find any copies of this shard anywhere in the cluster. Are you sure you configured the upgrade cluster right? Have you changed the data path, for instance? Are you seeing anything in the logs indicating why it might not be able to find this data?

lecko · April 10, 2019, 11:10am

David, thanks for the answer.
I just went "by the book". I just used the info for Rolling upgrade

https://www.elastic.co/guide/en/elasticsearch/reference/5.6/rolling-upgrades.html

But now I looked into logs (the upgrade was done a month ago) and there was complain about permissions after upgrad and when trying to start elasticsearch:

[2019-02-22T16:29:11,408][ERROR][o.e.b.Bootstrap ] [unknown] Exception
java.lang.IllegalStateException: Unable to access 'path.data' (/usr/share/elasticsearch/data)
at org.elasticsearch.bootstrap.FilePermissionUtils.addDirectoryPath(FilePermissionUtils.java:70) ~[elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.bootstrap.Security.addFilePermissions(Security.java:299) ~[elasticsearch-6.6.1.jar:6.6.1]

At that time I saw that those dir were owned by root and I had to change its ownership to elasticsearch and them it started. After that no other erros, everything is running fine, but the indices are just there and unassigned.

DavidTurner · April 10, 2019, 11:32am

Ok, do you know at what point during the upgrade the cluster health became red? It isn't possible to complete a rolling upgrade with a red cluster because of step 7: Wait for the node to recover.

Can you find out the index UUID from GET /_cat/indices/tre_0?v and then look for a directory called nodes/0/indices/$INDEX_UUID/3/ inside your data paths? Either it exists, in which case we need to work out why Elasticsearch can't read it, or else it doesn't exist, in which case there's nothing for Elasticsearch to read.

lecko · April 10, 2019, 11:56am

Thanks David again. Yes I can locate that data path. On node one there are all directories, by that I mean 0, 1, 2, 3 and _state (all have permissions correct, owner elasticsearch). On node 2 there are just 0 and 2 and _state. and node 3 just 1 and 3 and _state.

So the data seems to be there and owners on all the path seem to be correct.

DavidTurner · April 10, 2019, 12:03pm

Ok. Do these directories also contain subdirectories and files? Can you list them here?

Could you restart node 3 and share the logs from when it starts up until it's finished rejoining the cluster?

Also could you create a new test index:

PUT /testindex
{"settings":{"number_of_shards":0,"number_of_replicas":3}}

Then find the UUID of this index with GET /_cat/indices/testindex?v and locate its data folders too. They should be alongside the ones you found above, but I am wondering if perhaps they are not. Can you share what you're seeing (ideally with full paths) in case it's something subtle that another pair of eyes might spot?

DavidTurner · April 10, 2019, 12:43pm

On reflection I think it'd be useful if you also did a PUT _cluster/reroute?retry_failed after restarting this node, to get a full picture in the logs.

lecko · April 10, 2019, 2:19pm

I created a new index and it was created, but data dir was not found. Then I ran a find over whole system and it looks like old indices are stored in 2 locations
/var/lib/elasticsearch/data
and
/usr/share/elasticseach/data

but the new created index testindex has data only under /usr/share..

But I must tell that my path.data variable is empty.

So I entered /usr/share/elasticsearch data on all 3 nodes and restarted.

The old shard are still Unassigned, but the new one tetindex is even active.

So yes, this is a test environment and another reason that it had many flaws before (although elastic status was green) and I guess those could be reason for upgrade problem. I may set up more clean 5.6 enviropnemnt and try another testupgrade.

David thank you, your troubleshooting questions may helm me more in cases when I need to troubleshoot elastic. Very much appreciated!

P.S. I noticed that X-pack modules are mentioned in 6.6 elasticsearch logs. I dont use X-pack in production, so I hope it wont cause problems.

DavidTurner · April 10, 2019, 2:47pm

X-pack is included in distributions of Elasticsearch by default since 6.3; many of the features are free to use, and the ones that require a paid licence are disabled. If you do not even want to use the free features then you can disable them, e.g. setting xpack.monitoring.enabled: false to completely disable monitoring.

system · May 8, 2019, 2:47pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
All shards remain in unassigned state after upgrading elasticsearch Elasticsearch	2	806	July 5, 2017
Unassigned shards Elasticsearch	3	524	July 6, 2017
Elasticsearch red status unassgined shards Elasticsearch	3	380	July 6, 2017
Shards UNASSIGNED even tho they exist on disk Elasticsearch	2	557	July 6, 2017
Cluster status: red Elasticsearch	6	385	July 6, 2017

Shards just stay in UNASIGNED state after upgrade to 6.6

Related topics