Hi, we have a cluster of ES.
Our Heap is 31 Gb on each node. Total RAM 128 Gb on each node.
And our RAM fills up over time and starts to run out, it that time we get faults of cluster.
We see that "cached" constantly growing and in times of run out equals "free".
Have anyone any ideas why it may be and what's going on?
That's normal for Linux, it just means any spare memory is being used for temporary things like the filesystem cache.
Not normal, that in time of RAM run out we have cluster faults. When we clean cache it comes normal again.
It is normal. How are you running your cluster? Bare-metal hardware? VMs? Cloud? Containers?
It's metal. Now we increasing vfs_cache_pressure setting of OS. Seems like it helps
What sort of faults are you seeing?
Connection timeouts from source, when it tryies to write to ES, and "channel closed" messages in logs of elastic nodes, sometimes nodes leaving cluster
Sharing those would be useful.
As would the output of _cluster/stats?human&pretty
.
Faults are described here
_cluster/stats?human&pretty:
{
"_nodes" : {
"total" : 3,
"successful" : 3,
"failed" : 0
},
"cluster_name" : "h1",
"cluster_uuid" : "s71YMgBeQhyRUyYVIzX2sg",
"timestamp" : 1608177516321,
"status" : "green",
"indices" : {
"count" : 646,
"shards" : {
"total" : 2851,
"primaries" : 1770,
"replication" : 0.6107344632768361,
"index" : {
"shards" : {
"min" : 1,
"max" : 10,
"avg" : 4.413312693498452
},
"primaries" : {
"min" : 1,
"max" : 5,
"avg" : 2.739938080495356
},
"replication" : {
"min" : 0.0,
"max" : 2.0,
"avg" : 0.7631578947368421
}
}
},
"docs" : {
"count" : 7128542981,
"deleted" : 47654
},
"store" : {
"size" : "9.3tb",
"size_in_bytes" : 10305383653790,
"reserved" : "0b",
"reserved_in_bytes" : 0
},
"fielddata" : {
"memory_size" : "206.1mb",
"memory_size_in_bytes" : 216186160,
"evictions" : 0
},
"query_cache" : {
"memory_size" : "2.3gb",
"memory_size_in_bytes" : 2492159398,
"total_count" : 194493637,
"hit_count" : 4816136,
"miss_count" : 189677501,
"cache_size" : 81248,
"cache_count" : 89063,
"evictions" : 7815
},
"completion" : {
"size" : "0b",
"size_in_bytes" : 0
},
"segments" : {
"count" : 45997,
"memory" : "1.2gb",
"memory_in_bytes" : 1338770608,
"terms_memory" : "984mb",
"terms_memory_in_bytes" : 1031834704,
"stored_fields_memory" : "130.2mb",
"stored_fields_memory_in_bytes" : 136538312,
"term_vectors_memory" : "0b",
"term_vectors_memory_in_bytes" : 0,
"norms_memory" : "56.2mb",
"norms_memory_in_bytes" : 59000704,
"points_memory" : "0b",
"points_memory_in_bytes" : 0,
"doc_values_memory" : "106.2mb",
"doc_values_memory_in_bytes" : 111396888,
"index_writer_memory" : "182.9mb",
"index_writer_memory_in_bytes" : 191827088,
"version_map_memory" : "0b",
"version_map_memory_in_bytes" : 0,
"fixed_bit_set" : "67.4kb",
"fixed_bit_set_memory_in_bytes" : 69072,
"max_unsafe_auto_id_timestamp" : 1608145213074,
"file_sizes" : { }
},
"mappings" : {
"field_types" : [
{
"name" : "boolean",
"count" : 1652,
"index_count" : 389
},
{
"name" : "date",
"count" : 4004,
"index_count" : 644
},
{
"name" : "double",
"count" : 4,
"index_count" : 1
},
{
"name" : "float",
"count" : 623,
"index_count" : 301
},
{
"name" : "geo_point",
"count" : 1057,
"index_count" : 141
},
{
"name" : "integer",
"count" : 739,
"index_count" : 67
},
{
"name" : "ip",
"count" : 2320,
"index_count" : 404
},
{
"name" : "keyword",
"count" : 114137,
"index_count" : 643
},
{
"name" : "long",
"count" : 9445,
"index_count" : 627
},
{
"name" : "nested",
"count" : 169,
"index_count" : 94
},
{
"name" : "object",
"count" : 31513,
"index_count" : 621
},
{
"name" : "text",
"count" : 19739,
"index_count" : 635
}
]
},
"analysis" : {
"char_filter_types" : [ ],
"tokenizer_types" : [ ],
"filter_types" : [ ],
"analyzer_types" : [ ],
"built_in_char_filters" : [ ],
"built_in_tokenizers" : [ ],
"built_in_filters" : [ ],
"built_in_analyzers" : [ ]
}
},
"nodes" : {
"count" : {
"total" : 3,
"coordinating_only" : 0,
"data" : 3,
"ingest" : 3,
"master" : 3,
"remote_cluster_client" : 3
},
"versions" : [
"7.9.1"
],
"os" : {
"available_processors" : 144,
"allocated_processors" : 144,
"names" : [
{
"name" : "Linux",
"count" : 3
}
],
"pretty_names" : [
{
"pretty_name" : "CentOS Linux 7 (Core)",
"count" : 3
}
],
"mem" : {
"total" : "376.8gb",
"total_in_bytes" : 404655390720,
"free" : "3gb",
"free_in_bytes" : 3268222976,
"used" : "373.8gb",
"used_in_bytes" : 401387167744,
"free_percent" : 1,
"used_percent" : 99
}
},
"process" : {
"cpu" : {
"percent" : 11
},
"open_file_descriptors" : {
"min" : 11423,
"max" : 11803,
"avg" : 11657
}
},
"jvm" : {
"max_uptime" : "2.6d",
"max_uptime_in_millis" : 228990188,
"versions" : [
{
"version" : "14.0.1",
"vm_name" : "OpenJDK 64-Bit Server VM",
"vm_version" : "14.0.1+7",
"vm_vendor" : "AdoptOpenJDK",
"bundled_jdk" : true,
"using_bundled_jdk" : true,
"count" : 3
}
],
"mem" : {
"heap_used" : "55.2gb",
"heap_used_in_bytes" : 59319432448,
"heap_max" : "93gb",
"heap_max_in_bytes" : 99857989632
},
"threads" : 1302
},
"fs" : {
"total" : "21.4tb",
"total_in_bytes" : 23627102601216,
"free" : "12.1tb",
"free_in_bytes" : 13314248302592,
"available" : "11tb",
"available_in_bytes" : 12113912176640
},
...
Is it possible to limit cache usage of elastic by playing with indices.fielddata.cache.size setting?
That only limits the caching that is done on the heap within Elasticsearch. To tune page cache usage at the operating system level you need to tune the operating system, not Elasticsearch.
ok, thank you
Well, sometimes we also get this error:
[2020-12-21T14:59:24,543][DEBUG][o.e.m.j.JvmGcMonitorService] [h1-es01] [gc][511559] overhead, spent [133ms] collecting in the last [1s]
Full GC log here:https://github.com/NailBash/just_log/blob/main/gc.log
Is it mean that we have problem in gc?
No, this is a DEBUG
log which the manual says is "only intended for expert use".
Also our flush latency is bigger than 100 ms. Can it indicate that our problem is something certain?
What do you mean "flush latency"?
100ms sounds fairly normal for that metric.
well, we have this message
[2020-12-23T14:11:12,026][INFO ][o.e.m.j.JvmGcMonitorService] [h1-es02] [gc][764620] overhead, spent [296ms] collecting in the last [1s]
[2020-12-23T14:12:10,536][INFO ][o.e.m.j.JvmGcMonitorService] [h1-es02] [gc][764678] overhead, spent [365ms] collecting in the last [1s]
[2020-12-23T14:12:12,537][INFO ][o.e.m.j.JvmGcMonitorService] [h1-es02] [gc][764680] overhead, spent [312ms] collecting in the last [1s]
[2020-12-23T14:13:55,376][DEBUG][o.e.m.j.JvmGcMonitorService] [h1-es02] [gc][764782] overhead, spent [117ms] collecting in the last [1s]
and then
[2020-12-23T14:14:30,931][DEBUG][o.e.c.c.LeaderChecker ] [h1-es02] 1 consecutive failures (limit [cluster.fault_detection.leader_check.retry_count] is 3) with leader [{h1-es03}{Qshtg7-TQIyxeiccpkmlIA}{3Yj31P_rSmCMjymjRtyyEQ}{h1-es03ip}{h1-es03ip:9300}{dimr}]
could it be related?
We periodically receive second messages and can't understand why, network seems to be ok. This messages usually appear in time of higher load on cluster, and sometimes node left the cluster. In atop log we don't see any real problem: ram, disks IO, CPU all is ok too.
P.S. Sometimes we even don't receive information about CPU, network from failed node by Zabbix.