Problem with shards when upgrading to 2.3.3

misomij · July 4, 2016, 11:14am

Hi,

I have upgraded a cluster of 3 nodes from 2.1 to 2.3.3, now i have problems with shards replicas not being allocated on all my nodes:

I have upgraded the nodes this way:
1 turn off es
2 upgraded es with yum repositories
3 updated node with yum
4 checked configurations and permissions
5 started es

When the first node started it joined the cluster and everything became green.
After the second node has upgraded replicas won't allocate on that node. I thought that the problem could be the last non upgraded node which became master, so after a while I upgraded also it.
While upgrading the replicas started allocating on the second node (the one on which they won't allocate before), then after the upgrade of the third node finished i had the same problem on this, replicas won't allocate.
I decided to wait the we because i read that on 2.3.3 indexing can be very slow, now the state is what you can see in the image, new indices allocate on all nodes, old indices don't.

Nodes are centos 7 with logstash 2.3.3 and jdk1.8

Has anyone idea on what causes this problem?

Thank you,
Miso

ywelsch · July 4, 2016, 12:16pm

Please try a reroute command with the explain parameter turned on:

(see https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-reroute.html )

curl -XPOST 'localhost:9200/_cluster/reroute?explain' -d '{
    "commands" : [
        {
          "allocate" : {
              "index" : "winlogbeat-2016.07.01", "shard" : 0, "node" : "ith-grs-sec-centos03"
          }
        }
    ]
}'

misomij · July 4, 2016, 1:13pm

Hi,

i launched the command, here's the output:

{
"acknowledged": true,
"explanations": [
{
"command": "allocate",
"decisions": [
{
"decider": "filter",
"decision": "YES",
"explanation": "node passes include/exclude/require filters"
},
{
"decider": "enable",
"decision": "YES",
"explanation": "allocation disabling is ignored"
},
{
"decider": "shards_limit",
"decision": "YES",
"explanation": "total shard limit disabled: [index: -1, cluster: -1] <= 0"
},
{
"decider": "same_shard",
"decision": "YES",
"explanation": "shard is not allocated to same node or host"
},
{
"decider": "awareness",
"decision": "YES",
"explanation": "no allocation awareness enabled"
},
{
"decider": "disable",
"decision": "YES",
"explanation": "allocation disabling is ignored"
},
{
"decider": "node_version",
"decision": "YES",
"explanation": "target node version [2.3.3] is same or newer than source node version [2.3.3]"
},
{
"decider": "disk_threshold",
"decision": "YES",
"explanation": "enough disk for shard on node, free: [22.4gb]"
},
{
"decider": "throttling",
"decision": "YES",
"explanation": "below shard recovery limit of [2]"
},
{
"decider": "replica_after_primary_active",
"decision": "YES",
"explanation": "primary is already active"
},
{
"decider": "snapshot_in_progress",
"decision": "YES",
"explanation": "shard not primary or relocation disabled"
}
],
"parameters": {
"allow_primary": false,
"index": "winlogbeat-2016.07.01",
"node": "ith-grs-sec-centos03",
"shard": 1
}
}
],
"state": {
"blocks": {},
"master_node": "KMqX1LEhR1ubRj5ixdiFLw",
"nodes": {
"KMqX1LEhR1ubRj5ixdiFLw": {
"attributes": {
"master": "true",
"rack": "ith-grs-sec-centos08"
},
"name": "ith-grs-sec-centos08",
"transport_address": "10.200.144.25:9300"
},
"W4MbeuAOTC-UbLovyqi4QA": {
"attributes": {
"master": "true",
"rack": "ith-grs-sec-centos03"
},
"name": "ith-grs-sec-centos03",
"transport_address": "10.200.144.23:9300"
},
"j017l3miS6iLYlD4fLyHng": {
"attributes": {
"master": "true",
"rack": "ith-grs-sec-centos04"
},
"name": "ith-grs-sec-centos04",
"transport_address": "10.200.144.27:9300"
}
},
"routing_nodes": {

then there's info on all of my nodes and shards, i removed this information because repetitive and the output was very very big, the info on all unassigned replicas is:

            {
                "index": "winlogbeat-2016.07.01",
                "node": null,
                "primary": false,
                "relocating_node": null,
                "shard": 2,
                "state": "UNASSIGNED",
                "unassigned_info": {
                    "at": "2016-07-01T15:03:18.945Z",
                    "reason": "REPLICA_ADDED"
                },
                "version": 14
            },

after the command the shard was allocated.

If you need more info or the complete output let me know

ywelsch · July 4, 2016, 1:50pm

Are all the shards now allocated or just this one? Are shards allocated if you issue an empty reroute command?

curl -XPOST 'localhost:9200/_cluster/reroute

misomij · July 4, 2016, 2:04pm

Only one shard was allocated

No, shards are not allocated, i launched the command with explain but got no explanations

{
"acknowledged": true,
"explanations": ,
"state": {
"blocks": {},
"master_node": "KMqX1LEhR1ubRj5ixdiFLw",
"nodes": {
"KMqX1LEhR1ubRj5ixdiFLw": {
"attributes": {
"master": "true",
"rack": "ith-grs-sec-centos08"
},
"name": "ith-grs-sec-centos08",
"transport_address": "10.200.144.25:9300"
},
"W4MbeuAOTC-UbLovyqi4QA": {
"attributes": {
"master": "true",
"rack": "ith-grs-sec-centos03"
},
"name": "ith-grs-sec-centos03",
"transport_address": "10.200.144.23:9300"
},
"j017l3miS6iLYlD4fLyHng": {
"attributes": {
"master": "true",
"rack": "ith-grs-sec-centos04"
},
"name": "ith-grs-sec-centos04",
"transport_address": "10.200.144.27:9300"
}
},
"routing_nodes": {

ywelsch · July 4, 2016, 2:14pm

Maybe allocation is just disabled?

Can you show me the cluster settings?

curl -XGET 'http://localhost:9200/_cluster/settings?pretty'

misomij · July 4, 2016, 2:56pm

I had tried ad got

{
"persistent" : { },
"transient" : { }
}

I tried to enable with:

curl -XPUT 'localhost:9200/_cluster/settings' -d '{
"transient" : {
"cluster.routing.allocation.enable" : "all"
}
}'

and now i get:

{
"persistent" : { },
"transient" : {
"cluster" : {
"routing" : {
"allocation" : {
"enable" : "all"
}
}
}
}
}

but after waiting a while the shards are not allocated yet

ywelsch · July 4, 2016, 3:50pm

I'm running out of ideas here. Can you check that all three nodes are indeed on v2.3.3?

Also output of

curl localhost:9200/_cluster/health?pretty

and

curl localhost:9200/_cat/shards

misomij · July 4, 2016, 4:09pm

I checked with _nodes?pretty and I got that all versions are 2.3.3

"KMqX1LEhR1ubRj5ixdiFLw" : {
"name" : "ith-grs-sec-centos08",
"transport_address" : "10.200.144.25:9300",
"host" : "10.200.144.25",
"ip" : "10.200.144.25",
"version" : "2.3.3",
"build" : "218bdf1",
"http_address" : "10.200.144.25:9200",
"attributes" : {
"rack" : "ith-grs-sec-centos08",
"master" : "true"
},
...
"W4MbeuAOTC-UbLovyqi4QA" : {
"name" : "ith-grs-sec-centos03",
"transport_address" : "10.200.144.23:9300",
"host" : "10.200.144.23",
"ip" : "10.200.144.23",
"version" : "2.3.3",
"build" : "218bdf1",
"http_address" : "10.200.144.23:9200",
"attributes" : {
"rack" : "ith-grs-sec-centos03",
"master" : "true"
},
...
"j017l3miS6iLYlD4fLyHng" : {
"name" : "ith-grs-sec-centos04",
"transport_address" : "10.200.144.27:9300",
"host" : "10.200.144.27",
"ip" : "10.200.144.27",
"version" : "2.3.3",
"build" : "218bdf1",
"http_address" : "10.200.144.27:9200",
"attributes" : {
"rack" : "ith-grs-sec-centos04",
"master" : "true"
},

If you need entire output just let me know.

Here it is:

{
"cluster_name" : "security-test-kibana",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 3,
"number_of_data_nodes" : 3,
"active_primary_shards" : 495,
"active_shards" : 595,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 395,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 60.1010101010101
}

and finally:

the output is too big for this form, I'm copying here some sample output lines:

winlogbeat-2016.06.30 winlogbeat-2016.06.30 winlogbeat-2016.06.30 winlogbeat-2016.06.30 winlogbeat-2016.06.30 winlogbeat-2016.06.30 winlogbeat-2016.06.30 winlogbeat-2016.06.30 winlogbeat-2016.07.01 winlogbeat-2016.07.01 winlogbeat-2016.07.01 winlogbeat-2016.07.01 winlogbeat-2016.07.01 winlogbeat-2016.07.01 winlogbeat-2016.07.01 winlogbeat-2016.07.01 winlogbeat-2016.07.02 winlogbeat-2016.07.02 winlogbeat-2016.07.02 winlogbeat-2016.07.02 winlogbeat-2016.07.02 winlogbeat-2016.07.02 winlogbeat-2016.07.02 winlogbeat-2016.07.02 3 p STARTED 6531 4.4mb 10.200.144.25 ith-grs-sec-centos08
3 r UNASSIGNED
2 p STARTED 6372 4.4mb 10.200.144.25 ith-grs-sec-centos08
2 r UNASSIGNED
1 p STARTED 6397 4.3mb 10.200.144.23 ith-grs-sec-centos03
1 r UNASSIGNED
0 p STARTED 6235 4.3mb 10.200.144.25 ith-grs-sec-centos08
0 r UNASSIGNED
3 p STARTED 5802 4.1mb 10.200.144.23 ith-grs-sec-centos03
3 r UNASSIGNED
2 p STARTED 4394 2.9mb 10.200.144.25 ith-grs-sec-centos08
2 r STARTED 4394 2.9mb 10.200.144.27 ith-grs-sec-centos04
1 r STARTED 4517 3.1mb 10.200.144.23 ith-grs-sec-centos03
1 p STARTED 4517 3.1mb 10.200.144.25 ith-grs-sec-centos08
0 r STARTED 4461 3mb 10.200.144.23 ith-grs-sec-centos03
0 p STARTED 4461 3mb 10.200.144.25 ith-grs-sec-centos08
3 r STARTED 3085 2mb 10.200.144.25 ith-grs-sec-centos08
3 p STARTED 3085 2mb 10.200.144.27 ith-grs-sec-centos04
2 r STARTED 3108 2.1mb 10.200.144.23 ith-grs-sec-centos03
2 p STARTED 3108 2.1mb 10.200.144.27 ith-grs-sec-centos04
1 r STARTED 3165 2.1mb 10.200.144.23 ith-grs-sec-centos03
1 p STARTED 3165 2.1mb 10.200.144.27 ith-grs-sec-centos04
0 r STARTED 3194 2.1mb 10.200.144.23 ith-grs-sec-centos03
0 p STARTED 3194 2.1mb 10.200.144.27 ith-grs-sec-centos04

The shards from 07.02 and after are those allocated after the upgrade.
The started shards from 07.01 are those which i allocated with the command @ywelsch suggested to me.
The shards from 07.01 and before are all like these, with primary allocated and replicas unassigned.
If you need complete output let me know.

ywelsch · July 4, 2016, 5:07pm

Maybe it's easier to provide me with the full output of curl 'http://localhost:9200/_cluster/state?pretty' (upload on http://pastebin.com for example). If it contains sensitive information, send me the link via private message here.

misomij · July 5, 2016, 7:44am

Ok i uploaded the result on my google drive because it was too big also for pastebin.

https://drive.google.com/open?id=0B5Vf-vwufarxdU5jQXI0YXJGT3c

ywelsch · July 5, 2016, 8:42am

Ok, this shows the issue clearly. For many of the indices you have the index setting index.routing.allocation.disable_allocation set to all, disabling allocation for shards of that index. Please set this to false and shards should be allocating again.

curl -XPUT http://localhost:9200/*/_settings -d '{"index.routing.allocation.disable_allocation": false}'

misomij · July 5, 2016, 9:52am

OK now is slowly allocating all the shards, It has not finished yet but the problem seems resolved.
I can't understand why the setting was like this since I tell the cluster to enable allocation:

Anyway thank you very much!

Topic		Replies	Views
Clarity for shard allocation disable/enable during upgrade Elasticsearch	5	572	July 6, 2017
One shard continually fails to allocate Elasticsearch	9	492	July 6, 2017
Newly created index has unallocated shards Elasticsearch	16	1566	January 27, 2017
Help! After upgrading to Elasticsearch 6 cluster shard replicas will not allocate Elasticsearch	9	5936	December 25, 2017
Explicitly allocating shards to nodes Elasticsearch	4	2671	July 6, 2017

Problem with shards when upgrading to 2.3.3

Related topics