Node does not match global include filters

secumind · June 16, 2020, 1:09pm

I have a cluster of albeit older 2.4.6 Elasticsearch. We're upgrading soon but for right now we need to keep this version running.

We have 3 master nodes, 1 search node, 6 data nodes. There are a couple of billion documents amounting to only a few TB of data.

In the last few weeks, one of the nodes has decided that it is going to house all of the data and as such is running out of space.

I manually tried moving shards to another node but no matter what I do I end up with the same error:

    {"error":{"root_cause":[{"type":"remote_transport_exception","reason":"[master1][192.168.20.197:9300][cluster:admin/reroute]"}],"type":"illegal_argument_exception","reason":"[move_allocation] can't move [data-geo2][18], from {data4new}{jCBcHbjpTWyiUkX5lWdT0Q}{192.168.20.111}{192.168.20:9300}{ingest=true, master=false}, to {data1}{ZLm7mF--Sna7LNQ4XtGETw}{192.168.20.254}{192.168.20.254:9300}{ingest=true, master=false}, since its not allowed, reason: [NO(node does not match global include filters [_name:\"search\",_ip:\"192.168.20.194\",_id:\"5ZafZ5YARMynloDTevhQJw\"])][YES(shard is primary)][YES(target node version [2.4.6] is same or newer than source node version [2.4.6])][YES(shard is not allocated to same node or host)][YES(allocation disabling is ignored)][YES(allocation disabling is ignored)][YES(no snapshots are currently running)][YES(enough disk for shard on node, free: [368.1gb])][YES(below shard recovery limit of [2])][YES(no allocation awareness enabled)][YES(total shard limit disabled: [index: -1, cluster: -1] <= 0)]"},"status":400}

I've run:

    curl -XPUT localhost:9200/_cluster/settings -H 'Content-Type: application/json' -d '{
  "transient" :{
      "cluster.routing.allocation.include._ip" : "192.168.20.254",
      "cluster.routing.allocation.include._id" : "ZLm7mF--Sna7LNQ4XtGETw",
      "cluster.routing.allocation.include._name" : "data1"
   }
}';echo

for every node in the cluster and I get back:

    {"acknowledged":true,"persistent":{},"transient":{"cluster":{"routing":{"allocation":{"include":{"_name":"data1","_id":"ZLm7mF--Sna7LNQ4XtGETw","_ip":"192.168.20.254"}}}}}}

and I've enabled routing.allocation.enable "all" . Yet I still get the same error that "node doe
s not match global include filters".

Any idea's why?

secumind · June 16, 2020, 3:13pm

Here's my node details if it helps:

{
  "cluster_name" : "ourcluster",
  "nodes" : {
    "5ZafZ5YARMynloDTevhQJw" : {
      "name" : "search",
      "transport_address" : "192.168.20.194:9300",
      "host" : "192.168.20.194",
      "ip" : "192.168.20.194",
      "version" : "2.4.6",
      "build" : "5376dca",
      "http_address" : "192.168.20.194:9200",
      "attributes" : {
        "data" : "false",
        "ingest" : "false",
        "master" : "false"
      },
      "process" : {
        "refresh_interval_in_millis" : 1000,
        "id" : 13383,
        "mlockall" : false
      }
    },
    "jCBcHbjpTWyiUkX5lWdT0Q" : {
      "name" : "data4new",
      "transport_address" : "192.168.20.111:9300",
      "host" : "192.168.20.111",
      "ip" : "192.168.20.111",
      "version" : "2.4.6",
      "build" : "5376dca",
      "http_address" : "192.168.20.111:9200",
      "attributes" : {
        "ingest" : "true",
        "master" : "false"
      },
      "process" : {
        "refresh_interval_in_millis" : 1000,
        "id" : 25935,
        "mlockall" : false
      }
    },
    "_GH0uZhrTCW1y9G617WZ9Q" : {
      "name" : "data3new",
      "transport_address" : "192.168.20.62:9300",
      "host" : "192.168.20.62",
      "ip" : "192.168.20.62",
      "version" : "2.4.6",
      "build" : "5376dca",
      "http_address" : "192.168.20.62:9200",
      "attributes" : {
        "ingest" : "true",
        "master" : "true"
      },
      "process" : {
        "refresh_interval_in_millis" : 1000,
        "id" : 467,
        "mlockall" : false
      }
    },
    "WbVQxtWfRhSjuolVLM4OCQ" : {
      "name" : "data2new",
      "transport_address" : "192.168.20.186:9300",
      "host" : "192.168.20.186",
      "ip" : "192.168.20.186",
      "version" : "2.4.6",
      "build" : "5376dca",
      "http_address" : "192.168.20.186:9200",
      "attributes" : {
        "ingest" : "true",
        "master" : "false"
      },
      "process" : {
        "refresh_interval_in_millis" : 1000,
        "id" : 29041,
        "mlockall" : false
      }
    },
    "Dji8AgGFT0-5lCQc_DDGNA" : {
      "name" : "data2",
      "transport_address" : "192.168.20.118:9300",
      "host" : "192.168.20.118",
      "ip" : "192.168.20.118",
      "version" : "2.4.6",
      "build" : "5376dca",
      "http_address" : "192.168.20.118:9200",
      "attributes" : {
        "ingest" : "true",
        "master" : "false"
      },
      "process" : {
        "refresh_interval_in_millis" : 1000,
        "id" : 23215,
        "mlockall" : false
      }
    },
    "ZLm7mF--Sna7LNQ4XtGETw" : {
      "name" : "data1",
      "transport_address" : "192.168.20.254:9300",
      "host" : "192.168.20.254",
      "ip" : "192.168.20.254",
      "version" : "2.4.6",
      "build" : "5376dca",
      "http_address" : "192.168.20.254:9200",
      "attributes" : {
        "ingest" : "true",
        "master" : "false"
      },
      "process" : {
        "refresh_interval_in_millis" : 1000,
        "id" : 18914,
        "mlockall" : false
      }
    },
    "S_L14pV2R4uWw_DOaVJ01g" : {
      "name" : "master1new",
      "transport_address" : "192.168.20.248:9300",
      "host" : "192.168.20.248",
      "ip" : "192.168.20.248",
      "version" : "2.4.6",
      "build" : "5376dca",
      "http_address" : "192.168.20.248:9200",
      "attributes" : {
        "ingest" : "true",
        "master" : "true"
      },
      "process" : {
        "refresh_interval_in_millis" : 1000,
        "id" : 6314,
        "mlockall" : false
      }
    },
    "BNSCgGJvS-av_7BaYJEhaA" : {
      "name" : "data1new",
      "transport_address" : "192.168.20.152:9300",
      "host" : "192.168.20.152",
      "ip" : "192.168.20.152",
      "version" : "2.4.6",
      "build" : "5376dca",
      "http_address" : "192.168.20.152:9200",
      "attributes" : {
        "ingest" : "true",
        "master" : "false"
      },
      "process" : {
        "refresh_interval_in_millis" : 1000,
        "id" : 31108,
        "mlockall" : false
      }
    },
    "C7_VVeliSu6BnQuN5fsGFg" : {
      "name" : "master1",
      "transport_address" : "192.168.20.197:9300",
      "host" : "192.168.20.197",
      "ip" : "192.168.20.197",
      "version" : "2.4.6",
      "build" : "5376dca",
      "http_address" : "192.168.20.197:9200",
      "attributes" : {
        "data" : "false",
        "ingest" : "false",
        "master" : "true"
      },
      "process" : {
        "refresh_interval_in_millis" : 1000,
        "id" : 13773,
        "mlockall" : false
      }
    }
  }
}

Vinayak_Sapre · June 17, 2020, 3:45am

@secumind
From error message it looks like you trying to move from 192.168.20.111 to 192.168.20.254 which is failing.

Your cluster settings command will try to move all shards to 192.168.20.254. Is it what your are expecting? or you want to move shards only for index data-geo2? In case of later, use index level routing configuration.
There is no need to use _ip, _id, _name. You can use any one of them preferably _name.

Can you post data-geo2 index settings? Particularly index.routing.allocation.

secumind · June 17, 2020, 4:16am

Thank you for the reply. I was using _ip, _id, _name as last ditch options. I don't understand why the cluster as a whole is no longer self balancing. In any case here's the settings for that particular index.

{
  "data-geo2" : {
    "settings" : {
      "index" : {
        "refresh_interval" : "180s",
        "number_of_shards" : "24",
        "merge" : {
          "policy" : {
            "max_merged_segment" : "10gb"
          }
        },
        "creation_date" : "1588300312718",
        "number_of_replicas" : "0",
        "uuid" : "prBJtnsYRiO94jC9TO67AA",
        "version" : {
          "created" : "2040699"
        }
      }
    }
  }
}

I've checked the other indicies, none of them seem to have an index.routing.allocation set, is this wrong?

Edit: I think the thing that I'm hung up on for both the auto balancing and this manual index shard moving I'm trying to do is the same problem. The "node does not match global include filters". That is why I was trying to manually include the IP, node name, and ID in the routing allocation.

Vinayak_Sapre · June 17, 2020, 6:46am

Thanks for posting index settings.

In general routing.allocation is not required either at index level or cluster level. ES takes care of moving shard as needed. There are few specific cases where I use this

To decommission a node, I add that node to cluster routing allocation exclude to move all shards out of that node.
To use hot warm architecture I use index level include setting. So all hot indices remain on one set of nodes and older read-only indices move to another set of nodes.
It's not supported in 2.4.6 but newer version of ES provides shrink API that allows reducing number of shards of an index. It requires moving all shards for that index to a single node. Index level settings with include filter is useful for this scenario.

The error seems to indicate that cluster.routing.allocation.include was set to _name:"search", _ip:"192.168.20.194", _id:"5ZafZ5YARMynloDTevhQJw" when the move operation was performed. This can be either in transient or in persistent settings. In general check all cluster settings.

You can

If you do not have a specific need, try removing cluster level cluster.routing.allocation.include.* from transient and persistent cluster settings.
Use cluster reroute API (move command) to move individual shard. https://www.elastic.co/guide/en/elasticsearch/reference/2.4/cluster-reroute.html

There is a possibility that cluster.routing.allocation.include was set to a single ip which will result in ES moving all shards to a single node. You can also check logs on the master node for cluster setting changes.

secumind · June 17, 2020, 1:18pm

I think all of this stems from the decommissioning of a node a few weeks ago and inclusion of a replacement node. I went ahead and:

  "transient" :{
      "cluster.routing.allocation.include._id" : "",
      "cluster.routing.allocation.include._name" : "",
      "cluster.routing.allocation.include._ip" : ""
   }
}';echo

To remove all enabled node entry's at the cluster level and now the system is moving data around freely and my timeouts due to the one node being overloaded has been solved. I will be clad to upgrade this portion of the system sometime soon because I think a lot of the issues we experience are related to various cluster routing issues that seem to be solved in the current version.

@Vinayak_Sapre thank you for your help!

system · July 15, 2020, 1:18pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.