Strange memory leak ElasticSearch

Hi, we have a cluster of ES.
Our Heap is 31 Gb on each node. Total RAM 128 Gb on each node.
And our RAM fills up over time and starts to run out, it that time we get faults of cluster.
We see that "cached" constantly growing and in times of run out equals "free".
image
Have anyone any ideas why it may be and what's going on?

That's normal for Linux, it just means any spare memory is being used for temporary things like the filesystem cache.

Not normal, that in time of RAM run out we have cluster faults. When we clean cache it comes normal again.

It is normal. How are you running your cluster? Bare-metal hardware? VMs? Cloud? Containers?

It's metal. Now we increasing vfs_cache_pressure setting of OS. Seems like it helps

What sort of faults are you seeing?

Connection timeouts from source, when it tryies to write to ES, and "channel closed" messages in logs of elastic nodes, sometimes nodes leaving cluster

Sharing those would be useful.
As would the output of _cluster/stats?human&pretty.

Faults are described here

_cluster/stats?human&pretty:

{
  "_nodes" : {
    "total" : 3,
    "successful" : 3,
    "failed" : 0
  },
  "cluster_name" : "h1",
  "cluster_uuid" : "s71YMgBeQhyRUyYVIzX2sg",
  "timestamp" : 1608177516321,
  "status" : "green",
  "indices" : {
    "count" : 646,
    "shards" : {
      "total" : 2851,
      "primaries" : 1770,
      "replication" : 0.6107344632768361,
      "index" : {
        "shards" : {
          "min" : 1,
          "max" : 10,
          "avg" : 4.413312693498452
        },
        "primaries" : {
          "min" : 1,
          "max" : 5,
          "avg" : 2.739938080495356
        },
        "replication" : {
          "min" : 0.0,
          "max" : 2.0,
          "avg" : 0.7631578947368421
        }
      }
    },
    "docs" : {
      "count" : 7128542981,
      "deleted" : 47654
    },
    "store" : {
      "size" : "9.3tb",
      "size_in_bytes" : 10305383653790,
      "reserved" : "0b",
      "reserved_in_bytes" : 0
    },
    "fielddata" : {
      "memory_size" : "206.1mb",
      "memory_size_in_bytes" : 216186160,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size" : "2.3gb",
      "memory_size_in_bytes" : 2492159398,
      "total_count" : 194493637,
      "hit_count" : 4816136,
      "miss_count" : 189677501,
      "cache_size" : 81248,
      "cache_count" : 89063,
      "evictions" : 7815
    },
    "completion" : {
      "size" : "0b",
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 45997,
      "memory" : "1.2gb",
      "memory_in_bytes" : 1338770608,
      "terms_memory" : "984mb",
      "terms_memory_in_bytes" : 1031834704,
      "stored_fields_memory" : "130.2mb",
      "stored_fields_memory_in_bytes" : 136538312,
      "term_vectors_memory" : "0b",
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory" : "56.2mb",
      "norms_memory_in_bytes" : 59000704,
      "points_memory" : "0b",
      "points_memory_in_bytes" : 0,
      "doc_values_memory" : "106.2mb",
      "doc_values_memory_in_bytes" : 111396888,
      "index_writer_memory" : "182.9mb",
      "index_writer_memory_in_bytes" : 191827088,
      "version_map_memory" : "0b",
      "version_map_memory_in_bytes" : 0,
      "fixed_bit_set" : "67.4kb",
      "fixed_bit_set_memory_in_bytes" : 69072,
      "max_unsafe_auto_id_timestamp" : 1608145213074,
      "file_sizes" : { }
    },
    "mappings" : {
      "field_types" : [
        {
          "name" : "boolean",
          "count" : 1652,
          "index_count" : 389
        },
        {
          "name" : "date",
          "count" : 4004,
          "index_count" : 644
        },
        {
          "name" : "double",
          "count" : 4,
          "index_count" : 1
        },
        {
          "name" : "float",
          "count" : 623,
          "index_count" : 301
        },
        {
          "name" : "geo_point",
          "count" : 1057,
          "index_count" : 141
        },
        {
          "name" : "integer",
          "count" : 739,
          "index_count" : 67
        },
        {
          "name" : "ip",
          "count" : 2320,
          "index_count" : 404
        },
        {
          "name" : "keyword",
          "count" : 114137,
          "index_count" : 643
        },
        {
          "name" : "long",
          "count" : 9445,
          "index_count" : 627
        },
        {
          "name" : "nested",
          "count" : 169,
          "index_count" : 94
        },
        {
          "name" : "object",
          "count" : 31513,
          "index_count" : 621
        },
        {
          "name" : "text",
          "count" : 19739,
          "index_count" : 635
        }
      ]
    },
    "analysis" : {
      "char_filter_types" : [ ],
      "tokenizer_types" : [ ],
      "filter_types" : [ ],
      "analyzer_types" : [ ],
      "built_in_char_filters" : [ ],
      "built_in_tokenizers" : [ ],
      "built_in_filters" : [ ],
      "built_in_analyzers" : [ ]
    }
  },
  "nodes" : {
    "count" : {
      "total" : 3,
      "coordinating_only" : 0,
      "data" : 3,
      "ingest" : 3,
      "master" : 3,
      "remote_cluster_client" : 3
    },
    "versions" : [
      "7.9.1"
    ],
    "os" : {
      "available_processors" : 144,
      "allocated_processors" : 144,
      "names" : [
        {
          "name" : "Linux",
          "count" : 3
        }
      ],
      "pretty_names" : [
        {
          "pretty_name" : "CentOS Linux 7 (Core)",
          "count" : 3
        }
      ],
      "mem" : {
        "total" : "376.8gb",
        "total_in_bytes" : 404655390720,
        "free" : "3gb",
        "free_in_bytes" : 3268222976,
        "used" : "373.8gb",
        "used_in_bytes" : 401387167744,
        "free_percent" : 1,
        "used_percent" : 99
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 11
      },
      "open_file_descriptors" : {
        "min" : 11423,
        "max" : 11803,
        "avg" : 11657
      }
    },
    "jvm" : {
      "max_uptime" : "2.6d",
      "max_uptime_in_millis" : 228990188,
      "versions" : [
        {
          "version" : "14.0.1",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "14.0.1+7",
          "vm_vendor" : "AdoptOpenJDK",
          "bundled_jdk" : true,
          "using_bundled_jdk" : true,
          "count" : 3
        }
      ],
      "mem" : {
        "heap_used" : "55.2gb",
        "heap_used_in_bytes" : 59319432448,
        "heap_max" : "93gb",
        "heap_max_in_bytes" : 99857989632
      },
      "threads" : 1302
    },
    "fs" : {
      "total" : "21.4tb",
      "total_in_bytes" : 23627102601216,
      "free" : "12.1tb",
      "free_in_bytes" : 13314248302592,
      "available" : "11tb",
      "available_in_bytes" : 12113912176640
    },

...

Is it possible to limit cache usage of elastic by playing with indices.fielddata.cache.size setting?

That only limits the caching that is done on the heap within Elasticsearch. To tune page cache usage at the operating system level you need to tune the operating system, not Elasticsearch.

ok, thank you

Well, sometimes we also get this error:

[2020-12-21T14:59:24,543][DEBUG][o.e.m.j.JvmGcMonitorService] [h1-es01] [gc][511559] overhead, spent [133ms] collecting in the last [1s]

Full GC log here:https://github.com/NailBash/just_log/blob/main/gc.log
Is it mean that we have problem in gc?

No, this is a DEBUG log which the manual says is "only intended for expert use".

Also our flush latency is bigger than 100 ms. Can it indicate that our problem is something certain?

What do you mean "flush latency"?

This metric

100ms sounds fairly normal for that metric.

well, we have this message

[2020-12-23T14:11:12,026][INFO ][o.e.m.j.JvmGcMonitorService] [h1-es02] [gc][764620] overhead, spent [296ms] collecting in the last [1s]
[2020-12-23T14:12:10,536][INFO ][o.e.m.j.JvmGcMonitorService] [h1-es02] [gc][764678] overhead, spent [365ms] collecting in the last [1s]
[2020-12-23T14:12:12,537][INFO ][o.e.m.j.JvmGcMonitorService] [h1-es02] [gc][764680] overhead, spent [312ms] collecting in the last [1s]
[2020-12-23T14:13:55,376][DEBUG][o.e.m.j.JvmGcMonitorService] [h1-es02] [gc][764782] overhead, spent [117ms] collecting in the last [1s]

and then

[2020-12-23T14:14:30,931][DEBUG][o.e.c.c.LeaderChecker    ] [h1-es02] 1 consecutive failures (limit [cluster.fault_detection.leader_check.retry_count] is 3) with leader [{h1-es03}{Qshtg7-TQIyxeiccpkmlIA}{3Yj31P_rSmCMjymjRtyyEQ}{h1-es03ip}{h1-es03ip:9300}{dimr}]

could it be related?
We periodically receive second messages and can't understand why, network seems to be ok. This messages usually appear in time of higher load on cluster, and sometimes node left the cluster. In atop log we don't see any real problem: ram, disks IO, CPU all is ok too.
P.S. Sometimes we even don't receive information about CPU, network from failed node by Zabbix.