Master node fail when unexpected error while indexing document

Jehutywong · August 12, 2019, 5:24am

Elasticsearch version: 7.0.1
Master nodes: 3
Datanodes: 31

When one of my index fail to initiate primary shard (in my case its the .monitoring-es-7-2019.08.12 index). The leader master will fail in some minutes.

The reason of that index fail to be initiated primary shard may likely a hard disk fail in the datanode. But this makes no sense to collapse master node. Also master will try to remove the problematic datanode, but add it back in every one minute or so. This loop will continue until master fail.

I understand that master node will ping and remove a datanode if it fails. But not vice versa, right?

Sample log
{"log":"[2019-08-12T04:14:11,672][INFO ][o.e.c.s.ClusterApplierService] [dc17-esmaster-04] removed {{dc17-esdata-02}{gFQlBesvQxaIZy4Sfx6Xtg}{1Zue7DvTSTWoBSdoPH3c9Q}{dc17-esdata-02}{10.36.60.55:9302}{ml.machine_memory=405543784448, rack=1, ml.max_open_jobs=20, xpack.installed=true},}, term: 18685, version: 98825, reason: ApplyCommitRequest{term=18685, version=98825, sourceNode={dc17-esmaster-02}{nE5cqi4OQKui5VSWz1hW7g}{1DV1Umc6RPOYvU_lBozVuA}{dc17-esmaster-02}{10.36.60.56:9300}{ml.machine_memory=405543784448, rack=2, ml.max_open_jobs=20, xpack.installed=true}}\n","stream":"stdout","time":"2019-08-12T04:14:11.672496264Z"}
{"log":"[2019-08-12T04:14:59,996][WARN ][o.e.x.m.e.l.LocalExporter] [dc17-esmaster-04] unexpected error while indexing monitoring document\n","stream":"stdout","time":"2019-08-12T04:15:00.000309393Z"}
{"log":"org.elasticsearch.xpack.monitoring.exporter.ExportException: UnavailableShardsException[[.monitoring-es-7-2019.08.12][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[.monitoring-es-7-2019.08.12][0]] containing [index {[.monitoring-es-7-2019.08.12][_doc][faAIhGwBLZ0qFg9pa-o2], source[{\"cluster_uuid\":\"qPAXvLX3T3Gdj2hnq1xHhg\",\"timestamp\":\"2019-08-12T04:13:59.984Z\",\"interval_ms\":10000,\"type\":\"node_stats\",\"source_node\":{\"uuid\":\"LSS3ms9gSlmisxOgLmQZXw\",\"host\":\"dc17-esmaster-04\",\"transport_address\":\"10.36.60.58:9300\",\"ip\":\"10.36.60.58\",\"name\":\"dc17-esmaster-04\",\"timestamp\":\"2019-08-12T04:13:59.984Z\"},\"node_stats\":{\"node_id\":\"LSS3ms9gSlmisxOgLmQZXw\",\"node_master\":false,\"mlockall\":true,\"indices\":{\"docs\":{\"count\":0},\"store\":{\"size_in_bytes\":0},\"indexing\":{\"index_total\":0,\"index_time_in_millis\":0,\"throttle_time_in_millis\":0},\"search\":{\"query_total\":0,\"query_time_in_millis\":0},\"query_cache\":{\"memory_size_in_bytes\":0,\"hit_count\":0,\"miss_count\":0,\"evictions\":0},\"fielddata\":{\"memory_size_in_bytes\":0,\"evictions\":0},\"segments\":{\"count\":0,\"memory_in_bytes\":0,\"terms_memory_in_bytes\":0,\"stored_fields_memory_in_bytes\":0,\"term_vectors_memory_in_bytes\":0,\"norms_memory_in_bytes\":0,\"points_memory_in_bytes\":0,\"doc_values_memory_in_bytes\":0,\"index_writer_memory_in_bytes\":0,\"version_map_memory_in_bytes\":0,\"fixed_bit_set_memory_in_bytes\":0},\"request_cache\":{\"memory_size_in_bytes\":0,\"evictions\":0,\"hit_count\":0,\"miss_count\":0}},\"os\":{\"cpu\":{\"load_average\":{\"1m\":7.72,\"5m\":8.81,\"15m\":7.97}}},\"process\":{\"open_file_descriptors\":1502,\"max_file_descriptors\":65536,\"cpu\":{\"percent\":0}},\"jvm\":{\"mem\":{\"heap_used_in_bytes\":1823671544,\"heap_used_percent\":7,\"heap_max_in_bytes\":25525551104},\"gc\":{\"collectors\":{\"young\":{\"collection_count\":9,\"collection_time_in_millis\":753},\"old\":{\"collection_count\":1,\"collection_time_in_millis\":651}}}},\"thread_pool\":{\"generic\":{\"threads\":66,\"queue\":0,\"rejected\":0},\"get\":{\"threads\":0,\"queue\":0,\"rejected\":0},\"management\":{\"threads\":5,\"queue\":0,\"rejected\":0},\"search\":{\"threads\":0,\"queue\":0,\"rejected\":0},\"watcher\":{\"threads\":0,\"queue\":0,\"rejected\":0},\"write\":{\"threads\":0,\"queue\":0,\"rejected\":0}},\"fs\":{\"total\":{\"total_in_bytes\":53660876800,\"free_in_bytes\":24523497472,\"available_in_bytes\":24523497472},\"io_stats\":{\"total\":{\"operations\":37738,\"read_operations\":95,\"write_operations\":37643,\"read_kilobytes\":1208,\"write_kilobytes\":543691}}}}}]}]]]\n","stream":"stdout","time":"2019-08-12T04:15:00.000344459Z"}
{"log":"\u0009at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$throwExportException$2(LocalBulk.java:125) ~[x-pack-monitoring-7.0.1.jar:7.0.1]\n","stream":"stdout","time":"2019-08-12T04:15:00.000387048Z"}

steve2 · August 12, 2019, 5:54am

In which area are you trying to add node.

Jehutywong · August 12, 2019, 5:57am

@steve2 i don;t understand your question. The failed datanode was added automatically. All cluster nodes are within the same datacenter and same network

Jehutywong · August 12, 2019, 6:05am

one more thought is that my master is also accepting indexing request with port 9200 open. When the datanode stuck, the master think itself timeout to index, and kill itself. Make any sense?

Christian_Dahlqvist · August 12, 2019, 7:10am

Yes. If you have dedicated master nodes these should not serve requests.

Jehutywong · August 12, 2019, 7:25am

I have now made master nodes dedicated. Hope the leader master won;t be killed if any datanode stuck or fail.

system · September 9, 2019, 7:32am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
One datanode stalled will cause master fail Elasticsearch	1	944	September 19, 2019
Elasticsearch data node does not failover when data disk fails Elasticsearch	3	1424	July 5, 2017
Elasticsearch nodes are leaving the cluster continuously Elasticsearch	21	9017	August 29, 2020
Primary shard is not active or isn't assigned to a known node Elasticsearch	20	40329	November 30, 2017
Data node removed; master_failed Elasticsearch	7	1623	July 4, 2017

Master node fail when unexpected error while indexing document

Related topics