Nodes crashes, problem with leader check

Greetings all! Previously, sorry for my English.
Having trouble with ES cluster. Sometimes data-nodes crashes (one or more (rarely)). This can be preceded by different events: force-merge, relocating shards, real-time upload data to indices, creating replicas, but the logs are always the same (I attach the link below to DEBUG-level logs).
About real-time bulk-upload data:

  1. every 10 minutes, 3 indices (15 shard with no replica, kNN)
  2. data size ~100k documents (during rush hour) = ~200-400Mb (depending on the index)
  3. upload time spent ~60 seconds (upload + flush), translog = 512Mb
  4. in total, 10-15 million documents are collected per day (1 day = 1 index), in amount 270 indices with replicas (3 months, 3 different sources)

example index template:

{
“settings”: {
“index”: {
“knn”: “true”,
“number_of_shards”: 15,
“number_of_replicas”: 0,
“knn.space_type”: “cosinesimil”,
“knn.algo_param.m”: 48,
“knn.algo_param.ef_construction”: 8192,
“knn.algo_param.ef_search”: 8192,
“max_result_window”: 1000000
}
},
“mappings”: {
“_source”: {
“excludes”: [
“my_vector”
]
},
“properties”: {
“detect_id”: {
“type”: “long”
},
“cam_id”: {
“type”: “integer”
},
“time_check”: {
“format”: “yyyy-MM-dd HH:mm:ss”,
“store”: True,
“type”: “date”
},
“my_vector”: {
“type”: “knn_vector”,
“dimension”: 320
}
}
}
}

The same indexes are then force_merged (at night, when the load is small)

Cluster formation:
ES version = 7.10.0 (ODFE 1.12.0)
3 master nodes, 4core, RAM=4gb, heap=2gb, configuration file:

cluster.name: open-distro-cluster
node.name: es-masternode001
node.master: true
node.data: false
node.ingest: false
network.host: 10.250.9.100
discovery.seed_hosts: [“10.250.9.100”, “10.250.9.102”, “10.250.9.104”]
cluster.initial_master_nodes: [“10.250.9.100”, “10.250.9.102”, “10.250.9.104”]

3 coordinator nodes, 14core, RAM=30gb, heap=15gb, same configuration file (node.master: false)

70 data-nodes, 18core, RAM=368gb, heap=30gb, same configuration file (node.master: false, node.data:True, node.ingest:True)

cluster changed parameters:

{
“persistent”: {
“cluster”: {
“routing”: {
“rebalance”: {
“enable”: “none”
},
“allocation”: {
“allow_rebalance”: “indices_all_active”,
“cluster_concurrent_rebalance”: “15”,
“node_concurrent_recoveries”: “2”,
“disk”: {
“threshold_enabled”: “true”,
“watermark”: {
“low”: “80%”,
“high”: “85%”
}
},
“enable”: “all”,
“node_concurrent_outgoing_recoveries”: “2”
}
},
“metadata”: {
“perf_analyzer”: {
“state”: “0”
}
}
},
“knn”: {
“algo_param”: {
“index_thread_qty”: “4”
},
“memory”: {
“circuit_breaker”: {
“limit”: “80%”,
“enabled”: “true”
}
}
}
},
“transient”: {}
}

master-log and data-node log:
ES logs - Google Drive

first crash at “2021-06-21T12:12:59,356Z”, manually restart ES service at “2021-06-21T12:47:46,314Z”, normally works at “2021-06-21T12:53:54,577Z”. At the same time, two other nodes broke, near to “2021-06-21T14:14:16,213Z” all nodes in cluster started working normally

second crash at “2021-06-21T22:35:17,319Z”, normally started at “2021-06-22T01:22:43,459Z” (without my doing anything)

Thoughts about it - it is not trouble with network (because every crash = any actions other than real-time upload data, the cluster can work normally for a week and more). Monitoring in Zabbix shows no anomalies, CPU utilization ~25-30%, RAM utilization on every node depends on size of indices. I can provide logs of other crashes, but in general the situation is the same everywhere - many times “leader check”, then starts working normally… I do not know what the problem may be, I am grateful for any advice! Thanks!

If this is opendistro then you will need to ask them for assistance, it's a fork of the original Elasticsearch and we are unable to help with the changes they have made.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.