Cluster.routing.allocation.same_shard.host doesnt seem to work properly

Elasticsearch version (bin/elasticsearch --version): 6.3.0

Plugins installed: [ingest-geoip]

JVM version (java -version): openjdk version "1.8.0_171"

OS version (uname -a if on a Unix-like system): Linux elk-ela-1f 4.15.0-22-generic #24-Ubuntu SMP Wed May 16 12:15:17 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:
I have setup 4 nodes in one machine elk-ela-1f. Did this with elastic/ansible-elasticsearch.
Each node has setup cluster.routing.allocation.same_shard.host: true to prevent same shard allocation on same host. But when I check the shard allocation with head, I can see that same shard have been allocated on same host, but different nodes.

data-1/elasticsearch.yml:cluster.routing.allocation.same_shard.host: true
data-2/elasticsearch.yml:cluster.routing.allocation.same_shard.host: true
data-3/elasticsearch.yml:cluster.routing.allocation.same_shard.host: true
data-4/elasticsearch.yml:cluster.routing.allocation.same_shard.host: true

Below is the response for GET _nodes/elk-ela-1f-data-1,elk-ela-1f-data-2,elk-ela-1f-data-3,elk-ela-1f-data-4/stats to ensure that nodes share same host:
https://pastebin.com/TtTYBE3D

And here is the list of shard allocation written form:

ngx-2018.06.20 4 p STARTED 29733 16.9mb 10.10.10.45 elk-ela-1f-data-2
ngx-2018.06.20 4 r STARTED 29740 16.9mb 10.10.10.43 elk-ela-1d-data-2
ngx-2018.06.20 7 p STARTED 29633 16.8mb 10.10.10.44 elk-ela-1e-data-1
ngx-2018.06.20 7 r STARTED 29615 16.8mb 10.10.10.46 elk-ela-1g-data-2
ngx-2018.06.20 5 p STARTED 29749   17mb 10.10.10.47 elk-ela-1h-data-1
ngx-2018.06.20 5 r STARTED 29762 16.9mb 10.10.10.44 elk-ela-1e-data-4
ngx-2018.06.20 3 p STARTED 29641   17mb 10.10.10.47 elk-ela-1h-data-3
ngx-2018.06.20 3 r STARTED 29638 16.8mb 10.10.10.45 elk-ela-1f-data-1
ngx-2018.06.20 9 p STARTED 29728 16.9mb 10.10.10.43 elk-ela-1d-data-4
ngx-2018.06.20 9 r STARTED 29726   17mb 10.10.10.44 elk-ela-1e-data-3
ngx-2018.06.20 6 p STARTED 29655 16.9mb 10.10.10.47 elk-ela-1h-data-2
ngx-2018.06.20 6 r STARTED 29650   17mb 10.10.10.43 elk-ela-1d-data-1
ngx-2018.06.20 2 p STARTED 29596   17mb 10.10.10.43 elk-ela-1d-data-3
ngx-2018.06.20 2 r STARTED 29575   17mb 10.10.10.44 elk-ela-1e-data-2
ngx-2018.06.20 8 p STARTED 29748   17mb 10.10.10.46 elk-ela-1g-data-2
ngx-2018.06.20 8 r STARTED 29748 33.9mb 10.10.10.46 elk-ela-1g-data-3
ngx-2018.06.20 1 p STARTED 29612   17mb 10.10.10.45 elk-ela-1f-data-4
ngx-2018.06.20 1 r STARTED 29609   17mb 10.10.10.45 elk-ela-1f-data-3
ngx-2018.06.20 0 p STARTED 29312 16.9mb 10.10.10.46 elk-ela-1g-data-1
ngx-2018.06.20 0 r STARTED 29313 16.7mb 10.10.10.44 elk-ela-1e-data-3

As you can see then following shard are allocated on the same host:

ngx-2018.06.20 1 p STARTED 29612   17mb 10.10.10.45 elk-ela-1f-data-4
ngx-2018.06.20 1 r STARTED 29609   17mb 10.10.10.45 elk-ela-1f-data-3
ngx-2018.06.20 8 p STARTED 29748   17mb 10.10.10.46 elk-ela-1g-data-2
ngx-2018.06.20 8 r STARTED 29748 33.9mb 10.10.10.46 elk-ela-1g-data-3

bump

cluster.routing.allocation.same_shard.host: true needs to be set up on the master node, not the data nodes.

Looking at GET _nodes - it looks like all 4 nodes are data and ingest. How is that possible? I thought you had to have at least one master node - or am I misunderstanding/reading something incorrectly, or is there additional information missing?

I have set it up on all nodes. I have only two master nodes.
Here’s the structure of infra:

Machine 1

1 x master node

3 x data node

Machine 2

1 x master node

3 x data node

Machine 3

4 x data node

Machine 4

4 x data node

Machine 5

4 x data node

Machine N

4x data node.

Shouldnt it still work?

Why are the IPs shown in the 2 listings different?

ngx-2018.06.20 1 p STARTED 29612   17mb 10.11.6.45 elk-ela-1f-data-4

_nodes:

"name": "elk-ela-1f-data-4",
      "transport_address": "10.10.10.45:9303",
      "host": "10.10.10.45",

Sorry, that's a typo on my paste. Wanted to obfuscate the ip's but didnt to it on last paste

Can you get the cluster allocation explain API output for shard 1 of index ngx-2018.06.20 ?

Hi! Here's the output for primary shard explain:
https://pastebin.com/uJMhmAyb

Output for non-primary shard allocation API
https://pastebin.com/3KVLaC15

It looks like the cluster.routing.allocation.same_shard.host setting is not active.

Can you provide the output of the following request when sent to the current master node: /_cluster/settings?include_defaults=true?

Well that's very embarrasing. Turns out that in my playbook all the data nodes had the neccessary configuration for same_shard allocation but master didnt. I shouldnt had trusted my memory and double check the master configs. Very sorry for such dumb user experience debugging. And big thank you for pointing out that :slight_smile: Case closed :slight_smile:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.