I had a 7 node cluster, with 4 data nodes and 3 master eligible nodes.
Each data node is of 64GB Ram configured with 30GB heapsize, 8 cpu and 2TB SSD Disk.
All the 4 data nodes are going down, after few hours of restart constantly, with out of memory error.
I reduced my ingestion load from 20k docs/second to 15k and 10k docs/second, but it didn't help much. I kept my search rate to 0 other than the kibana monitoring dashboard refreshing once 30s to show what is the state of my cluster on.
Following are my cluster stats
{ "_nodes" : { "total" : 7, "successful" : 7, "failed" : 0 }, "cluster_name" : "mtr-ng1", "cluster_uuid" : "C9jjlhCmR9yeMwukeQyZZA", "timestamp" : 1553487063305, "status" : "green", "indices" : { "count" : 196, "shards" : { "total" : 1534, "primaries" : 767, "replication" : 1.0, "index" : { "shards" : { "min" : 2, "max" : 10, "avg" : 7.826530612244898 }, "primaries" : { "min" : 1, "max" : 5, "avg" : 3.913265306122449 }, "replication" : { "min" : 1.0, "max" : 1.0, "avg" : 1.0 } } }, "docs" : { "count" : 9609557961, "deleted" : 47530194 }, "store" : { "size_in_bytes" : 2525518088368 }, "fielddata" : { "memory_size_in_bytes" : 135728, "evictions" : 0 }, "query_cache" : { "memory_size_in_bytes" : 4884394, "total_count" : 120970, "hit_count" : 14068, "miss_count" : 106902, "cache_size" : 595, "cache_count" : 651, "evictions" : 56 }, "completion" : { "size_in_bytes" : 0 }, "segments" : { "count" : 20164, "memory_in_bytes" : 6746448528, "terms_memory_in_bytes" : 4794965934, "stored_fields_memory_in_bytes" : 1225705688, "term_vectors_memory_in_bytes" : 0, "norms_memory_in_bytes" : 1583616, "points_memory_in_bytes" : 667648162, "doc_values_memory_in_bytes" : 56545128, "index_writer_memory_in_bytes" : 69417260, "version_map_memory_in_bytes" : 636406, "fixed_bit_set_memory_in_bytes" : 159360, "max_unsafe_auto_id_timestamp" : 1553485454779, "file_sizes" : { } } }, "nodes" : { "count" : { "total" : 7, "data" : 4, "coordinating_only" : 0, "master" : 3, "ingest" : 3 }, "versions" : [ "6.6.2" ], "os" : { "available_processors" : 72, "allocated_processors" : 72, "names" : [ { "name" : "Linux", "count" : 7 } ], "pretty_names" : [ { "pretty_name" : "CentOS Linux 7 (Core)", "count" : 7 } ], "mem" : { "total_in_bytes" : 424374198272, "free_in_bytes" : 82914578432, "used_in_bytes" : 341459619840, "free_percent" : 20, "used_percent" : 80 } }, "process" : { "cpu" : { "percent" : 0 }, "open_file_descriptors" : { "min" : 423, "max" : 3719, "avg" : 2171 } }, "jvm" : { "max_uptime_in_millis" : 320148670, "versions" : [ { "version" : "1.8.0_65", "vm_name" : "Java HotSpot(TM) 64-Bit Server VM", "vm_version" : "25.65-b01", "vm_vendor" : "Oracle Corporation", "count" : 7 } ], "mem" : { "heap_used_in_bytes" : 20870601960, "heap_max_in_bytes" : 218467926016 }, "threads" : 661 }, "fs" : { "total_in_bytes" : 7685746163712, "free_in_bytes" : 4969893056512, "available_in_bytes" : 4589853052928 }, "plugins" : [ ], "network_types" : { "transport_types" : { "security4" : 7 }, "http_types" : { "security4" : 7 } } } }
Following are the snapshots of kibana monitoring dashboard, at the time the problem is happening.