Master node fail when unexpected error while indexing document

Elasticsearch version: 7.0.1
Master nodes: 3
Datanodes: 31

When one of my index fail to initiate primary shard (in my case its the .monitoring-es-7-2019.08.12 index). The leader master will fail in some minutes.

The reason of that index fail to be initiated primary shard may likely a hard disk fail in the datanode. But this makes no sense to collapse master node. Also master will try to remove the problematic datanode, but add it back in every one minute or so. This loop will continue until master fail.

I understand that master node will ping and remove a datanode if it fails. But not vice versa, right?

Sample log
{"log":"[2019-08-12T04:14:11,672][INFO ][o.e.c.s.ClusterApplierService] [dc17-esmaster-04] removed {{dc17-esdata-02}{gFQlBesvQxaIZy4Sfx6Xtg}{1Zue7DvTSTWoBSdoPH3c9Q}{dc17-esdata-02}{10.36.60.55:9302}{ml.machine_memory=405543784448, rack=1, ml.max_open_jobs=20, xpack.installed=true},}, term: 18685, version: 98825, reason: ApplyCommitRequest{term=18685, version=98825, sourceNode={dc17-esmaster-02}{nE5cqi4OQKui5VSWz1hW7g}{1DV1Umc6RPOYvU_lBozVuA}{dc17-esmaster-02}{10.36.60.56:9300}{ml.machine_memory=405543784448, rack=2, ml.max_open_jobs=20, xpack.installed=true}}\n","stream":"stdout","time":"2019-08-12T04:14:11.672496264Z"}
{"log":"[2019-08-12T04:14:59,996][WARN ][o.e.x.m.e.l.LocalExporter] [dc17-esmaster-04] unexpected error while indexing monitoring document\n","stream":"stdout","time":"2019-08-12T04:15:00.000309393Z"}
{"log":"org.elasticsearch.xpack.monitoring.exporter.ExportException: UnavailableShardsException[[.monitoring-es-7-2019.08.12][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[.monitoring-es-7-2019.08.12][0]] containing [index {[.monitoring-es-7-2019.08.12][_doc][faAIhGwBLZ0qFg9pa-o2], source[{\"cluster_uuid\":\"qPAXvLX3T3Gdj2hnq1xHhg\",\"timestamp\":\"2019-08-12T04:13:59.984Z\",\"interval_ms\":10000,\"type\":\"node_stats\",\"source_node\":{\"uuid\":\"LSS3ms9gSlmisxOgLmQZXw\",\"host\":\"dc17-esmaster-04\",\"transport_address\":\"10.36.60.58:9300\",\"ip\":\"10.36.60.58\",\"name\":\"dc17-esmaster-04\",\"timestamp\":\"2019-08-12T04:13:59.984Z\"},\"node_stats\":{\"node_id\":\"LSS3ms9gSlmisxOgLmQZXw\",\"node_master\":false,\"mlockall\":true,\"indices\":{\"docs\":{\"count\":0},\"store\":{\"size_in_bytes\":0},\"indexing\":{\"index_total\":0,\"index_time_in_millis\":0,\"throttle_time_in_millis\":0},\"search\":{\"query_total\":0,\"query_time_in_millis\":0},\"query_cache\":{\"memory_size_in_bytes\":0,\"hit_count\":0,\"miss_count\":0,\"evictions\":0},\"fielddata\":{\"memory_size_in_bytes\":0,\"evictions\":0},\"segments\":{\"count\":0,\"memory_in_bytes\":0,\"terms_memory_in_bytes\":0,\"stored_fields_memory_in_bytes\":0,\"term_vectors_memory_in_bytes\":0,\"norms_memory_in_bytes\":0,\"points_memory_in_bytes\":0,\"doc_values_memory_in_bytes\":0,\"index_writer_memory_in_bytes\":0,\"version_map_memory_in_bytes\":0,\"fixed_bit_set_memory_in_bytes\":0},\"request_cache\":{\"memory_size_in_bytes\":0,\"evictions\":0,\"hit_count\":0,\"miss_count\":0}},\"os\":{\"cpu\":{\"load_average\":{\"1m\":7.72,\"5m\":8.81,\"15m\":7.97}}},\"process\":{\"open_file_descriptors\":1502,\"max_file_descriptors\":65536,\"cpu\":{\"percent\":0}},\"jvm\":{\"mem\":{\"heap_used_in_bytes\":1823671544,\"heap_used_percent\":7,\"heap_max_in_bytes\":25525551104},\"gc\":{\"collectors\":{\"young\":{\"collection_count\":9,\"collection_time_in_millis\":753},\"old\":{\"collection_count\":1,\"collection_time_in_millis\":651}}}},\"thread_pool\":{\"generic\":{\"threads\":66,\"queue\":0,\"rejected\":0},\"get\":{\"threads\":0,\"queue\":0,\"rejected\":0},\"management\":{\"threads\":5,\"queue\":0,\"rejected\":0},\"search\":{\"threads\":0,\"queue\":0,\"rejected\":0},\"watcher\":{\"threads\":0,\"queue\":0,\"rejected\":0},\"write\":{\"threads\":0,\"queue\":0,\"rejected\":0}},\"fs\":{\"total\":{\"total_in_bytes\":53660876800,\"free_in_bytes\":24523497472,\"available_in_bytes\":24523497472},\"io_stats\":{\"total\":{\"operations\":37738,\"read_operations\":95,\"write_operations\":37643,\"read_kilobytes\":1208,\"write_kilobytes\":543691}}}}}]}]]]\n","stream":"stdout","time":"2019-08-12T04:15:00.000344459Z"}
{"log":"\u0009at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$throwExportException$2(LocalBulk.java:125) ~[x-pack-monitoring-7.0.1.jar:7.0.1]\n","stream":"stdout","time":"2019-08-12T04:15:00.000387048Z"}

In which area are you trying to add node.

@steve2 i don;t understand your question. The failed datanode was added automatically. All cluster nodes are within the same datacenter and same network

one more thought is that my master is also accepting indexing request with port 9200 open. When the datanode stuck, the master think itself timeout to index, and kill itself. Make any sense?

Yes. If you have dedicated master nodes these should not serve requests.

I have now made master nodes dedicated. Hope the leader master won;t be killed if any datanode stuck or fail.