Plusieurs problème avec mon cluster ElasticSearch

Bonjour tous,

j'ai monter un cluster Elasticsearch
2 sont dans un DC (node-1 & 2) et 1 dans un autre DC (node-3)

Voici la version de Elasticsearch sur mes nodes
la communication entre eux se fais en HTTPS

{
  "name" : "node-01",
  "cluster_name" : "cluster",
  "cluster_uuid" : "//////////////////////",
  "version" : {
    "number" : "6.8.0",
    "build_flavor" : "default",
    "build_type" : "deb",
    "build_hash" : "65b////",
    "build_date" : "2019-05-15T20:06:13.172855Z",
    "build_snapshot" : false,
    "lucene_version" : "7.7.0",
    "minimum_wire_compatibility_version" : "5.6.0",
    "minimum_index_compatibility_version" : "5.0.0"
  },
  "tagline" : "You Know, for Search"
}
{
  "name" : "node-02",
  "cluster_name" : "cluster",
  "cluster_uuid" : "//////////////////////",
  "version" : {
    "number" : "6.8.0",
    "build_flavor" : "default",
    "build_type" : "deb",
    "build_hash" : "65b////",
    "build_date" : "2019-05-15T20:06:13.172855Z",
    "build_snapshot" : false,
    "lucene_version" : "7.7.0",
    "minimum_wire_compatibility_version" : "5.6.0",
    "minimum_index_compatibility_version" : "5.0.0"
  },
  "tagline" : "You Know, for Search"
}
{
  "name" : "node-03",
  "cluster_name" : "cluster",
  "cluster_uuid" : "//////////////////////",
  "version" : {
    "number" : "6.8.0",
    "build_flavor" : "default",
    "build_type" : "deb",
    "build_hash" : "65b////",
    "build_date" : "2019-05-15T20:06:13.172855Z",
    "build_snapshot" : false,
    "lucene_version" : "7.7.0",
    "minimum_wire_compatibility_version" : "5.6.0",
    "minimum_index_compatibility_version" : "5.0.0"
  },
  "tagline" : "You Know, for Search"
}

j'ai pas mal de problème récurent dans les logs et j'aimerais ne plus avoir de [WARN] pour avoir un cluster propre et deplus pour ne pas pollué la supervision

voici les Erreurs récurrentes

pour le premier node :

[2020-01-31T13:43:01,318][WARN ][o.e.g.G.InternalReplicaShardAllocator] [Node-1] [.monitoring-kibana-6-2020.01.31][0]: failed to list shard for shard_store on node [l4lQuRv6e-BaHsOTQ]
[2020-01-31T13:43:01,319][WARN ][o.e.g.G.InternalReplicaShardAllocator] [Node-1] [.monitoring-kibana-6-2020.01.30][0]: failed to list shard for shard_store on node [l4lQuRv6e-BaHsOTQ]
[2020-01-31T13:43:01,320][WARN ][o.e.g.G.InternalReplicaShardAllocator] [Node-1] [nom d'un index sur ES][2]: failed to list shard for shard_store on node [l4lQuRv6e-BaHsOTQ]
[2020-01-31T13:43:01,321][WARN ][o.e.g.G.InternalReplicaShardAllocator] [Node-1] [nom d'un index sur ES][0]: failed to list shard for shard_store on node [l4lQuRv6e-BaHsOTQ]
[2020-01-31T13:43:01,322][WARN ][o.e.g.G.InternalReplicaShardAllocator] [Node-1] [nom d'un index sur ES][3]: failed to list shard for shard_store on node [l4lQuRv6e-BaHsOTQ]
[2020-01-31T13:43:01,323][WARN ][o.e.g.G.InternalReplicaShardAllocator] [Node-1] [.monitoring-kibana-6-2020.01.29][0]: failed to list shard for shard_store on node [l4lQuRv6e-BaHsOTQ]
[2020-01-31T13:43:01,324][WARN ][o.e.g.G.InternalReplicaShardAllocator] [Node-1] [.monitoring-es-6-2020.01.28][0]: failed to list shard for shard_store on node [l4lQuRv6e-BaHsOTQ]
[2020-01-31T13:43:01,324][WARN ][o.e.g.G.InternalReplicaShardAllocator] [Node-1] [nom d'un index sur ES][1]: failed to list shard for shard_store on node [l4lQuRv6e-BaHsOTQ]
[2020-01-31T13:43:01,325][WARN ][o.e.g.G.InternalReplicaShardAllocator] [Node-1] [nom d'un index sur ES][4]: failed to list shard for shard_store on node [l4lQuRv6e-BaHsOTQ]
[2020-01-31T13:43:01,326][WARN ][o.e.g.G.InternalReplicaShardAllocator] [Node-1] [nom d'un index sur ES][2]: failed to list shard for shard_store on node [l4lQuRv6e-BaHsOTQ]
[2020-01-31T13:43:01,331][WARN ][o.e.g.G.InternalReplicaShardAllocator] [Node-1] [.monitoring-es-6-2020.01.27][0]: failed to list shard for shard_store on node [l4lQuRv6e-BaHsOTQ]
[2020-01-31T13:43:01,332][WARN ][o.e.g.G.InternalReplicaShardAllocator] [Node-1] [.monitoring-es-6-2020.01.26][0]: failed to list shard for shard_store on node [l4lQuRv6e-BaHsOTQ]
[2020-01-31T13:43:01,333][WARN ][o.e.g.G.InternalReplicaShardAllocator] [Node-1] [.monitoring-kibana-6-2020.01.26][0]: failed to list shard for shard_store on node [l4lQuRv6e-BaHsOTQ]
[2020-01-31T13:43:01,334][WARN ][o.e.g.G.InternalReplicaShardAllocator] [Node-1] [.monitoring-es-6-2020.01.25][0]: failed to list shard for shard_store on node [l4lQuRv6e-BaHsOTQ]
[2020-01-31T13:43:01,335][WARN ][o.e.g.G.InternalReplicaShardAllocator] [Node-1] [.monitoring-kibana-6-2020.01.25][0]: failed to list shard for shard_store on node [l4lQuRv6e-BaHsOTQ]
[2020-01-31T13:43:01,336][WARN ][o.e.g.G.InternalReplicaShardAllocator] [Node-1] [.monitoring-kibana-6-2020.01.24][0]: failed to list shard for shard_store on node [l4lQuRv6e-BaHsOTQ]
[2020-01-31T13:43:01,337][WARN ][o.e.g.G.InternalReplicaShardAllocator] [Node-1] [.monitoring-es-6-2020.01.24][0]: failed to list shard for shard_store on node [l4lQuRv6e-BaHsOTQ]
[2020-01-31T13:43:01,338][WARN ][o.e.g.G.InternalReplicaShardAllocator] [Node-1] [nom d'un index sur ES_03][3]: failed to list shard for shard_store on node [l4lQuRv6e-BaHsOTQ]
[2020-01-31T13:43:01,339][WARN ][o.e.g.G.InternalReplicaShardAllocator] [Node-1] [nom d'un index sur ES_03][2]: failed to list shard for shard_store on node [l4lQuRv6e-BaHsOTQ]
[2020-01-31T13:43:01,340][WARN ][o.e.g.G.InternalReplicaShardAllocator] [Node-1] [nom d'un index sur ES_03][0]: failed to list shard for shard_store on node [l4lQuRv6e-BaHsOTQ]
[2020-01-31T13:43:01,350][WARN ][o.e.g.G.InternalReplicaShardAllocator] [Node-1] [archivage-config][3]: failed to list shard for shard_store on node [l4lQuRv6e-BaHsOTQ]
[2020-01-31T13:43:01,351][WARN ][o.e.g.G.InternalReplicaShardAllocator] [Node-1] [.kibana_1][0]: failed to list shard for shard_store on node [l4lQuRv6e-BaHsOTQ]
[2020-01-31T13:43:01,352][WARN ][o.e.g.G.InternalReplicaShardAllocator] [Node-1] [.kibana_task_manager][0]: failed to list shard for shard_store on node [l4lQuRv6e-BaHsOTQ]
[2020-01-31T13:43:01,355][WARN ][o.e.g.G.InternalReplicaShardAllocator] [Node-1] [nom d'un index sur ES_03][0]: failed to list shard for shard_store on node [l4lQuRv6e-BaHsOTQ]
[2020-01-31T13:43:01,348][WARN ][o.e.g.G.InternalReplicaShardAllocator] [Node-1] [nom d'un index sur ES_03][1]: failed to list shard for shard_store on node [l4lQuRv6e-BaHsOTQ]
[2020-01-31T13:43:01,349][WARN ][o.e.g.G.InternalReplicaShardAllocator] [Node-1] [nom d'un index sur ES_03][4]: failed to list shard for shard_store on node [l4lQuRv6e-BaHsOTQ]
[2020-01-31T13:43:01,369][WARN ][o.e.g.G.InternalReplicaShardAllocator] [Node-1] [nom d'un index sur ES_03][2]: failed to list shard for shard_store on node [l4lQuRv6e-BaHsOTQ]
[2020-01-31T13:43:01,371][WARN ][o.e.g.G.InternalReplicaShardAllocator] [Node-1] [searchguard][0]: failed to list shard for shard_store on node [l4lQuRv6e-BaHsOTQ]
[2020-01-31T13:43:01,465][WARN ][o.e.d.z.PublishClusterStateAction] [Node-1] publishing cluster state with version [46646] failed for the following nodes: [[{Node-3}{l4lQuRv6e-BaHsOTQ}{tx8Yiai7QZalfilMPzPj9g}{192.168.112.173}{192.168.112.173:9300}{ml.machine_memory=8339591168, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}]]

pour le 2eme nodes

[2020-01-31T13:41:26,401][WARN ][r.suppressed             ] [Node-2] path: /_template/.management-beats, params: {include_type_name=true, name=.management-beats}
[2020-01-31T13:41:38,959][WARN ][r.suppressed             ] [Node-2] path: /_template/.management-beats, params: {include_type_name=true, name=.management-beats}
[2020-01-31T13:41:46,505][WARN ][r.suppressed             ] [Node-2] path: /_template/.management-beats, params: {include_type_name=true, name=.management-beats}
[2020-01-31T13:41:56,570][WARN ][r.suppressed             ] [Node-2] path: /_template/.management-beats, params: {include_type_name=true, name=.management-beats}
[2020-01-31T13:42:11,709][WARN ][r.suppressed             ] [Node-2] path: /_template/.management-beats, params: {include_type_name=true, name=.management-beats}
[2020-01-31T13:42:26,746][WARN ][r.suppressed             ] [Node-2] path: /_template/.management-beats, params: {include_type_name=true, name=.management-beats}
[2020-01-31T13:42:39,325][WARN ][r.suppressed             ] [Node-2] path: /_template/.management-beats, params: {include_type_name=true, name=.management-beats}
[2020-01-31T13:42:46,888][WARN ][r.suppressed             ] [Node-2] path: /_template/.management-beats, params: {include_type_name=true, name=.management-beats}

et pour finir voici les WARN du 3eme

[2020-01-31T13:39:51,694][WARN ][o.e.x.m.MonitoringService] [Node-3] monitoring execution failed
[2020-01-31T13:40:10,894][WARN ][r.suppressed             ] [Node-3] path: /_xpack/monitoring/_bulk, params: {system_id=kibana, system_api_version=6, interval=10000ms}
[2020-01-31T13:40:40,891][WARN ][r.suppressed             ] [Node-3] path: /_xpack/monitoring/_bulk, params: {system_id=kibana, system_api_version=6, interval=10000ms}
[2020-01-31T13:41:01,696][WARN ][o.e.x.m.MonitoringService] [Node-3] monitoring execution failed
[2020-01-31T13:41:10,894][WARN ][r.suppressed             ] [Node-3] path: /_xpack/monitoring/_bulk, params: {system_id=kibana, system_api_version=6, interval=10000ms}
[2020-01-31T13:41:40,894][WARN ][r.suppressed             ] [Node-3] path: /_xpack/monitoring/_bulk, params: {system_id=kibana, system_api_version=6, interval=10000ms}
[2020-01-31T13:42:10,901][WARN ][r.suppressed             ] [Node-3] path: /_xpack/monitoring/_bulk, params: {system_id=kibana, system_api_version=6, interval=10000ms}
[2020-01-31T13:42:11,696][WARN ][o.e.x.m.MonitoringService] [Node-3] monitoring execution failed

Merci d'avance de l'aide

Comment as-tu désigné ton cluster ? Les rôles des noeuds ? Peux -tu nous faire connaitre plutôt la config de tes nœuds ?

node 1 :

#---------------------------------- Cluster -----------------------------------
#
cluster.name: cluster.es
#
# ------------------------------------ Node ------------------------------------
#
node.name: node_1
#
# ----------------------------------- Paths ------------------------------------
#
path.data: /var/lib/elasticsearch
#
path.logs: /var/log/elasticsearch
#
# ----------------------------------- Memory -----------------------------------
#
bootstrap.memory_lock: true
#
# ---------------------------------- Network -----------------------------------
#
network.host: 192.168.111.171
#
# --------------------------------- Discovery ----------------------------------
#
#discovery.zen.ping.unicast.hosts: ["VM-1", "VM-2"]
discovery.zen.ping.unicast.hosts: ["192.168.0.172", "192.168.0.173"]
#discovery.zen.ping.unicast.hosts.resolve_timeout: 30s
discovery.zen.fd.ping_timeout: 40s
discovery.zen.fd.ping_retries: 10
discovery.zen.minimum_master_nodes: 2
#
# --------------------------------- Premium features -------------------------
#
#Disable premium features
xpack.security.enabled: false
searchguard.enterprise_modules_enabled: false

# -------------------------------- SearchGuard -------------------------------

#SSL security on the transport layer (for SG administration and inter-node communication)
searchguard.ssl.transport.pemcert_filepath: /etc/elasticsearch/config/cert/es_1.pem
searchguard.ssl.transport.pemkey_filepath: /etc/elasticsearch/config/cert/es_1.key
searchguard.ssl.transport.pemtrustedcas_filepath: /etc/elasticsearch/config/cert/root-ca_1.pem


#Declare other nodes of the cluster
searchguard.nodes_dn:
- CN=es_2.toto-tata.com,OU=escluster,O=ES toto,DC=toto-tata,DC=com
- CN=es_3.toto-tata.com,OU=escluster,O=ES toto,DC=toto-tata,DC=com

#Enable hostname verification. Disable if node hostname does not match node certificate
searchguard.ssl.transport.enforce_hostname_verification: false
searchguard.ssl.transport.resolve_hostname: false

#Admin certificate declaration for Searchguard administration
searchguard.authcz.admin_dn:
- CN=admin.toto-tata.com,OU=escluster,O=ES toto,DC=toto-tata,DC=com

#
#
# -------------------------------------- SSL ----------------------------
#SSL security on the REST layer (End users, Kibana, etc.)
searchguard.ssl.http.enabled: true
searchguard.ssl.http.pemcert_filepath: /etc/elasticsearch/config/cert/toto-tata.crt
searchguard.ssl.http.pemkey_filepath: /etc/elasticsearch/config/cert/toto-tata.key
searchguard.ssl.http.pemtrustedcas_filepath: /etc/elasticsearch/config/cert/toto-tata-root-ca.crt

node 2 :

# ---------------------------------- Cluster -----------------------------------
#
cluster.name: cluster.es
#
# ------------------------------------ Node ------------------------------------
#
node.name: node_2
#
# ----------------------------------- Paths ------------------------------------
#
path.data: /var/lib/elasticsearch
#
path.logs: /var/log/elasticsearch
#
# ----------------------------------- Memory -----------------------------------
#
bootstrap.memory_lock: true
#
# ---------------------------------- Network -----------------------------------
#
network.host: 192.168.0.172
#
# --------------------------------- Discovery ----------------------------------
#
#discovery.zen.ping.unicast.hosts: ["VM-03", "VM-01"]
discovery.zen.ping.unicast.hosts: ["192.168.0.171", "192.168.0.173"]
#
# Prevent the "split brain" by configuring the majority of nodes (total number of master-eligible nodes / 2 + 1):
#
discovery.zen.minimum_master_nodes: 2
discovery.zen.fd.ping_timeout: 40s
discovery.zen.fd.ping_retries: 10
#
# ------------------------------ Premium Feature
xpack.security.enabled: false
searchguard.enterprise_modules_enabled: false
#
# -------------------------------- SearchGuard -------------------------------
#
#SSL security on the transport layer (for SG administration and inter-node communication)
searchguard.ssl.transport.pemcert_filepath: /etc/elasticsearch/config/cert/es_2.pem
searchguard.ssl.transport.pemkey_filepath: /etc/elasticsearch/config/cert/es_2.toto-toto.com.key
searchguard.ssl.transport.pemtrustedcas_filepath: /etc/elasticsearch/config/cert/root-ca.pem

#Declare other nodes of the cluster
searchguard.nodes_dn:
- CN=es_1.toto-toto.com,OU=escluster,O=ES toto,DC=toto-toto,DC=com
- CN=es_3.toto-toto.com,OU=escluster,O=ES toto,DC=toto-toto,DC=com

#Enable hostname verification. Disable if node hostname does not match node certificate
searchguard.ssl.transport.enforce_hostname_verification: false
searchguard.ssl.transport.resolve_hostname: false

#Admin certificate declaration for Searchguard administration
searchguard.authcz.admin_dn:
- CN=admin.toto-toto.com,OU=escluster,O=ES toto,DC=toto-toto,DC=com

#
#
# -------------------------------------- SSL ----------------------------
#SSL security on the REST layer (End users, Kibana, etc.)
searchguard.ssl.http.enabled: true
searchguard.ssl.http.pemcert_filepath: /etc/elasticsearch/config/cert/toto-toto.crt
searchguard.ssl.http.pemkey_filepath: /etc/elasticsearch/config/cert/toto-toto.key
searchguard.ssl.http.pemtrustedcas_filepath: /etc/elasticsearch/config/cert/toto-toto-root-ca.crt
```

node 3 

```
# ---------------------------------- Cluster -----------------------------------
#
cluster.name: cluster.es
#
# ------------------------------------ Node ------------------------------------
#
node.name: node_3
#
# ----------------------------------- Paths ------------------------------------
#
path.data: /var/lib/elasticsearch
#
path.logs: /var/log/elasticsearch
#
# ----------------------------------- Memory -----------------------------------
#
bootstrap.memory_lock: true
#
# ---------------------------------- Network -----------------------------------
#
network.host: 192.168.0.173
#
# --------------------------------- Discovery ----------------------------------
#
#discovery.zen.ping.unicast.hosts: ["VM-3", "VM-2"]
discovery.zen.ping.unicast.hosts: ["192.168.0.171", "192.168.0.172"]
#
# Prevent the "split brain" by configuring the majority of nodes (total number of master-eligible nodes / 2 + 1):
#
discovery.zen.minimum_master_nodes: 2
discovery.zen.fd.ping_timeout: 40s
discovery.zen.fd.ping_retries: 10
#
# --------------------------------- Disable Premium Feature
#
xpack.security.enabled: false
searchguard.enterprise_modules_enabled: false

# -------------------------------- SearchGuard -------------------------------

#SSL security on the transport layer (for SG administration and inter-node communication)
searchguard.ssl.transport.pemcert_filepath: /etc/elasticsearch/config/cert/es_3.pem
searchguard.ssl.transport.pemkey_filepath: /etc/elasticsearch/config/cert/es_3.key
searchguard.ssl.transport.pemtrustedcas_filepath: /etc/elasticsearch/config/cert/root-ca.pem

#Declare other nodes of the cluster
searchguard.nodes_dn:
- CN=es_1.toto-tata.com,OU=escluster,O=ES toto,DC=toto-tata,DC=com
- CN=es_2.toto-tata.com,OU=escluster,O=ES toto,DC=toto-tata,DC=com

#Enable hostname verification. Disable if node hostname does not match node certificate
searchguard.ssl.transport.enforce_hostname_verification: false
searchguard.ssl.transport.resolve_hostname: false

#Admin certificate declaration for Searchguard administration
searchguard.authcz.admin_dn:
- CN=admin.toto-tata.com,OU=escluster,O=ES toto,DC=toto-tata,DC=com

#
#
# -------------------------------------- SSL ----------------------------
#SSL security on the REST layer (End users, Kibana, etc.)
searchguard.ssl.http.enabled: true
searchguard.ssl.http.pemcert_filepath: /etc/elasticsearch/config/cert/toto-tata.crt
searchguard.ssl.http.pemkey_filepath: /etc/elasticsearch/config/cert/toto-tata.key
searchguard.ssl.http.pemtrustedcas_filepath: /etc/elasticsearch/config/cert/toto-tata-root-ca.crt
```

Sur ta conf, Il me semble que tu as du oublier de configurer le rôle de chaque noeud.
LIen qui pourrait t'aider : https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-node.html

Oui mais par défaut il s'auto organise pour les roles et dans les erreurs des logs c'est pas un problème de log a priori

Non. La configuration des rôles surtout pour un si petit cluster, n'est pas utile à mon avis.

Que donne:

GET /
GET /_cat/nodes?v
GET /_cat/health?v
GET /_cat/indices?v
GET /_cat/shards?v

Merci de formatter ton code/logs pour rendre tout ça lisible. Je l'ai fait pour toi pour ta question et ta première réponse. Tu peux utiliser l'outil </> ou du markdown.

Bonjour,

Merci pour vos réponses après expertise le problème c'était que ES-03 devais passé par un routeur qui fais des timeout sur des sessions de + de 2heures.

je les changer de réseau et problème résolu a suivre pour les autres problèmes si c'était lié

A noter toutefois que nous déconseillons d'avoir des noeuds dans plusieurs zones géographiques différentes.
Si la latence entre tes noeuds est plus élevées que le réseau local, alors il vaut mieux utiliser le cross cluster replication (license commerciale).