Allocation awareness - something not right

I have an 11 node Cluster 3 Master nodes and 8 Data nodes:
I have 4 physical machines with 2 Virtual VM's on each, the 2 physical machines are in two different rooms.

Data nodes 1,3,5&7 are in room 1 Data nodes 2,4,6&8 are in the other room.

The goal is to have the primary shard in 1 room and the replica in the other.

I set on the odd nodes in elasticsearch.yml
cluster.routing.allocation.awareness.attributes: roomid
cluster.routing.allocation.awareness.force.roomid.values: 1,2
node.attr.roomid: 1

I set on the even nodes in elasticsearch.yml
cluster.routing.allocation.awareness.attributes: roomid
cluster.routing.allocation.awareness.force.roomid.values: 1,2
node.attr.roomid: 2

when I check the index, I can see that I hace shard 3 in the same room:
index-20190531 1 r STARTED 9704310 11.6gb x.x.x.x node1
index-20190531 1 p STARTED 9704310 11.6gb x.x.x.x node8
index-20190531 2 r STARTED 9706200 11.7gb x.x.x.x node2
index-20190531 2 p STARTED 9706200 11.7gb x.x.x.x node3
index-20190531 3 p STARTED 9704938 11.7gb x.x.x.x node2
index-20190531 3 r STARTED 9704938 11.6gb x.x.x.x node6
index-20190531 5 r STARTED 9705567 11.6gb x.x.x.x node7
index-20190531 5 p STARTED 9705567 11.6gb x.x.x.x node4
index-20190531 4 r STARTED 9707267 11.6gb x.x.x.x node7
index-20190531 4 p STARTED 9707267 11.6gb x.x.x.x node6
index-20190531 0 p STARTED 9703313 11.6gb x.x.x.x node5
index-20190531 0 r STARTED 9703313 11.6gb x.x.x.x node8

Is there something I am doing wrong or mis-understanding ?

Hi @john275 and welcome! Yes, this does look wrong to me:

I'd like to double-check that these settings are applied as you claim on every node. Can you share the full output of the following commands?

GET /_nodes/settings?filter_path=nodes.*.settings.node.attr.roomid,nodes.*.name,nodes.*.settings.cluster.routing.allocation.awareness
GET /_cluster/settings?filter_path=*.cluster.routing.allocation.awareness
1 Like

So the nodes settings yields:
{"nodes":{"sZsWrCkwTbmus19eqs1otA":{"name":"node4","settings":{"cluster":{"routing":{"allocation":{"awareness":{"attributes":"roomid","force":{"roomid":{"values":"1,2"}}}}}},"node":{"attr":{"roomid":"2"}}}},"lWI-nJwuRlaa-uQ4HSSZ8A":{"name":"node6","settings":{"cluster":{"routing":{"allocation":{"awareness":{"attributes":"roomid","force":{"roomid":{"values":"1,2"}}}}}},"node":{"attr":{"roomid":"2"}}}},"a4okbUUvSBSzEXj_2QD3BA":{"name":"node2","settings":{"cluster":{"routing":{"allocation":{"awareness":{"attributes":"roomid","force":{"roomid":{"values":"1,2"}}}}}},"node":{"attr":{"roomid":"2"}}}},"liJIj97NQZuT95-Frnrorg":{"name":"node8","settings":{"cluster":{"routing":{"allocation":{"awareness":{"attributes":"roomid","force":{"roomid":{"values":"1,2"}}}}}},"node":{"attr":{"roomid":"2"}}}},"Q9NLbgGXR5m0UqxL_dUsIw":{"name":"master-a1"},"-Fy_OEmLRVOlZVnRs1i-9g":{"name":"node1","settings":{"cluster":{"routing":{"allocation":{"awareness":{"attributes":"roomid","force":{"roomid":{"values":"1,2"}}}}}},"node":{"attr":{"roomid":"1"}}}},"gGnNGn_RS0Kv4StnAjZoPQ":{"name":"master-a3"},"tce9HEsSSR6Rv4aM3cNz1g":{"name":"node5","settings":{"cluster":{"routing":{"allocation":{"awareness":{"attributes":"roomid","force":{"roomid":{"values":"1,2"}}}}}},"node":{"attr":{"roomid":"1"}}}},"MVmuBaQ3QWalwvHxsz0_VA":{"name":"node7","settings":{"cluster":{"routing":{"allocation":{"awareness":{"attributes":"roomid","force":{"roomid":{"values":"1,2"}}}}}},"node":{"attr":{"roomid":"1"}}}},"xHLpuyCWSk-l9JAM3umK8g":{"name":"master-a2"},"3d-6McBhQGeYUKT8LR7LWw":{"name":"node3","settings":{"cluster":{"routing":{"allocation":{"awareness":{"attributes":"roomid","force":{"roomid":{"values":"1,2"}}}}}},"node":{"attr":{"roomid":"1"}}}}}}

and the cluster settings yields
{}

Hmm, ok, that all looks correct to me, thanks. Can you use the allocation explain API to ask about the allocation of the problematic shard? You'll need to specify the shard as it's assigned, just not in the right place.

Also, what version are you using?

Thanks for your replies....

I'll have to research using the allocation explain API and reply back here.

Version is:
"version" : {
"number" : "6.6.0",
"build_flavor" : "default",
"build_type" : "deb",
"build_hash" : "a9861f4",
"build_date" : "2019-01-24T11:27:09.439740Z",
"build_snapshot" : false,
"lucene_version" : "7.6.0",
"minimum_wire_compatibility_version" : "5.6.0",
"minimum_index_compatibility_version" : "5.0.0"
},

PS I don't think this is limited to just a particular shard, it seems to be a cluster wide issue.

Another example index from today:
index-20190606 1 r STARTED 6838929 8.9gb x.x.x.x node1
index-20190606 1 p STARTED 6838933 8.7gb x.x.x.x node3
index-20190606 2 r STARTED 6838836 8.9gb x.x.x.x node7
index-20190606 2 p STARTED 6838767 8.9gb x.x.x.x node6
index-20190606 3 p STARTED 6841220 9.7gb x.x.x.x node1
index-20190606 3 r STARTED 6841357 9.4gb x.x.x.x node5
index-20190606 5 r STARTED 6838348 9.9gb x.x.x.x node2
index-20190606 5 p STARTED 6838423 9gb x.x.x.x node8
index-20190606 4 p STARTED 6839791 8.8gb x.x.x.x node4
index-20190606 4 r STARTED 6839791 9.2gb x.x.x.x node8
index-20190606 0 p STARTED 6840745 9.2gb x.x.x.x node7
index-20190606 0 r STARTED 6840678 8.8gb x.x.x.x node3

Sorry I was on mobile earlier and couldn't check the right syntax. The allocation explain command is this:

GET /_cluster/allocation/explain
{
  "index": "index-20190531",
  "shard": 3,
  "primary": false
}

Just to prevent confusion for people reading this thread now or later.
What you reported as being your config is not actually how your cluster is configured.
Your ids are not room1 and room2 but 1 and 2.

I'm not saying that has anything to do with your issue as the output you showed coming from your cluster via GET looks self-consistent. Just not consistent with what you had posted before. Which is confusing, that's all.

You probaly just changed the settings in between the posts to use numbers instead of strings.
Continue with David's latest question, I don't want to derail your post either or "squirrel!" any of you.

1 Like

@martinr_ubi makes a good point. Maybe we're not seeing the true information because you consider it to be sensitive? It's ok if you want to redact some things, but please make it clear what, if anything, you've altered. It's all too easy to accidentally obscure the very thing that needs to be adjusted.

Yes indeed, I have made changes to the IP's, hostnames, index name and room id, Sorry I was not careful enough to change everything carefully and consistently

Here is the output from
GET /_cluster/allocation/explain
{
"index": "index-20190531",
"shard": 3,
"primary": false
}

{"index":"index-20190531","shard":3,"primary":false,"current_state":"started","current_node":{"id":"lWI-nJwuRlaa-uQ4HSSZ8A","name":"node6","transport_address":"x.x.x.x:9300","attributes":{"ml.machine_memory":"202844217344","ml.max_open_jobs":"20","xpack.installed":"true","ml.enabled":"true","roomid":"2"},"weight_ranking":1},"can_remain_on_current_node":"yes","can_rebalance_cluster":"yes","can_rebalance_to_other_node":"no","rebalance_explanation":"cannot rebalance as no target node exists that can both allocate this shard and improve the cluster balance","node_allocation_decisions":[{"node_id":"a4okbUUvSBSzEXj_2QD3BA","node_name":"node2","transport_address":"x.x.x.x:9300","node_attributes":{"ml.machine_memory":"202844217344","ml.max_open_jobs":"20","xpack.installed":"true","ml.enabled":"true","roomid":"2"},"node_decision":"no","weight_ranking":1,"deciders":[{"decider":"same_shard","decision":"NO","explanation":"the shard cannot be allocated to the same node on which a copy of the shard already exists [[index-20190531][3], node[a4okbUUvSBSzEXj_2QD3BA], [P], s[STARTED], a[id=dmv0skOtReCQ1GZesV6h9w]]"}]},{"node_id":"-Fy_OEmLRVOlZVnRs1i-9g","node_name":"node1","transport_address":"x.x.x.x:9300","node_attributes":{"ml.machine_memory":"202843451392","ml.max_open_jobs":"20","xpack.installed":"true","ml.enabled":"true","roomid":"1"},"node_decision":"worse_balance","weight_ranking":1},{"node_id":"3d-6McBhQGeYUKT8LR7LWw","node_name":"node3","transport_address":"x.x.x.x:9300","node_attributes":{"ml.machine_memory":"202843451392","ml.max_open_jobs":"20","xpack.installed":"true","ml.enabled":"true","roomid":"1"},"node_decision":"worse_balance","weight_ranking":1},{"node_id":"MVmuBaQ3QWalwvHxsz0_VA","node_name":"node7","transport_address":"x.x.x.x:9300","node_attributes":{"ml.machine_memory":"202843451392","ml.max_open_jobs":"20","xpack.installed":"true","ml.enabled":"true","roomid":"1"},"node_decision":"worse_balance","weight_ranking":1},{"node_id":"liJIj97NQZuT95-Frnrorg","node_name":"node8","transport_address":"x.x.x.x:9300","node_attributes":{"ml.machine_memory":"202844217344","ml.max_open_jobs":"20","xpack.installed":"true","ml.enabled":"true","roomid":"2"},"node_decision":"worse_balance","weight_ranking":1},{"node_id":"sZsWrCkwTbmus19eqs1otA","node_name":"node4","transport_address":"x.x.x.x:9300","node_attributes":{"ml.machine_memory":"202844217344","ml.max_open_jobs":"20","xpack.installed":"true","ml.enabled":"true","roomid":"2"},"node_decision":"worse_balance","weight_ranking":1},{"node_id":"tce9HEsSSR6Rv4aM3cNz1g","node_name":"node5","transport_address":"x.x.x.x:9300","node_attributes":{"ml.machine_memory":"202843451392","ml.max_open_jobs":"20","xpack.installed":"true","ml.enabled":"true","roomid":"1"},"node_decision":"worse_balance","weight_ranking":1}]}

I'll check the other index from 06/06 and report back here as that cluster has been stable since that index was created.

The other index 06/06 yields the same narrative for all of the shards:

"rebalance_explanation": "cannot rebalance as no target node exists that can both allocate this shard and improve the cluster balance",

      "explanation": "the shard cannot be allocated to the same node on which a copy of the shard already exists [[index-20190606][5], node[liJIj97NQZuT95-Frnrorg], [P], s[STARTED], a[id=EvLUyDqLSE-P8_PdSDhHZQ]]"

The telling line is this one: "can_remain_on_current_node": "yes", which tells us that the awareness allocator (and all other allocation decoders) are happy. I'm mystified. I've traced through the code from 6.6.0 and can't see how this could be happening, if your config is exactly as you describe. I am concerned that we're losing something vital when you're obscuring the room attributes.

mmm, now that I read from the top wasn't it under our nose the whole time:

You're not also putting the settings on the master nodes?

cluster.routing.allocation.awareness.attributes: roomid
cluster.routing.allocation.awareness.force.roomid.values: 1,2

Which I believe is critical.
I guess you should do that.

1 Like

I'll add the original config here, but will remove it once you have validated, I have not mangled anything.

Ok I've taken a copy.

Actually, you might have it Martin, I recall pondering that at the time I was setting it and it dropped from my mind.

So I should add:
cluster.routing.allocation.awareness.attributes: roomid
cluster.routing.allocation.awareness.force.roomid.values: 1,2 #but with my true roomid values

I'll give that a try next week, unless Dave finds anything more relevant

Now, we need to set up shard allocation awareness by telling Elasticsearch which attributes to use. This can be configured in the elasticsearch.yml file on all master-eligible nodes, or it can be set (and changed) with the cluster-update-settings API.

Oh of course. It's the master node that is making the decision about where to allocate things, so it needs to know that allocation awareness is enabled for this attribute. When I tried to reproduce, I had 8 nodes that were all master-eligible and data nodes, but of course that's not what's going on here. In fact there's no need to add the cluster.routing.allocation.awareness.* settings on the data nodes at all, they just need the node.attr.roomid settings.

I need to look into making MR against the doc :slight_smile: That'd be cool.

The doc could use more clarity here, I just read it all again and it doesn't specifically says node attributes goes on the data nodes and the awareness settings goes in the masters eligible.
( It does say it for the awareness.attributes, but not VS the node.attr )

And more importantly doesn't repeat any of that, lower, in the "Forced Awareness" section of the page. So in the end the force.$attrib.values setting is never referenced in terms of where it goes.

Where each settings go should be said or repeated outside the normal narrative to make it special and important is what I'm saying.

For someone new to Elasticsearch just learning about node types, or just reading the section about Force Awareness at a later time because he read the first section in the past... That leads to the current forum thread :face_with_monocle:

Let's be clear, that doc is not that bad, but I'm pretty sure @john275 had to read it at some point or else he wouldn't even know about awareness. So somehow he missed it. Maybe exactly because he read only the forced awareness section? Just thinking aloud.

1 Like