Hello
Im looking at the logs and such but nothing really calls out why my Elastic is running out of memory and CPU. There is nothing huge size wise and nothing consuming ALL of the CPU.
Where would be a good log to view?
Thank you.
Hello
Im looking at the logs and such but nothing really calls out why my Elastic is running out of memory and CPU. There is nothing huge size wise and nothing consuming ALL of the CPU.
Where would be a good log to view?
Thank you.
How do you know it's running out of memory? Are you getting OOM errors in the logs? How are you monitoring CPU?
Memory usage is at 95% of the total memory.
Afterwards (I imagine it runs out of memory) the CPU load goes up similar.
Its all being monitored thru Nagios.
What is the specs of the node? How much memory and CPU does your node have?
How much memory is set to elasticsearch? Do you have any OOM lines in elasticsearch logs?
What do you run in this machine? Only elasticsearch or anything else?
What is the specs of the node? How much memory and CPU does your node have?
Its a Hyper-V VM running on a Failover Cluster. It has 4 cores and currently 32GB of RAM.
How much memory is set to elasticsearch?
The JVM part of Elasticsearch has 16GB
Do you have any OOM lines in elasticsearch logs?
Not one. Its something Ive looked desperate for....
What do you run in this machine? Only elasticsearch or anything else?
This currently runs the Elastic Stack: Elasticsearch, Logstash and Kibana. When doing a top, the most consuming, memory and CPU wise is Elastic.
Thank you for all your help.
If you aren't getting OOM and the heap use isn't >75%, but you are seeing OS memory (aka off heap) being used, then that is the OS caching commonly used files. This is normal behaviour.
If you aren't getting OOM and the heap use isn't >75%, but you are seeing OS memory (aka off heap) being used, then that is the OS caching commonly used files. This is normal behaviour.
The thing is that this for months has never been a issue. CPU and memory usage have been OK. For one week or two, its been like this.
I also need some kind of proof saying thats it OS caching commonly used files; Alerts popping off from one day to next isnt that common so I need a source for it.
What does free -m
or similar show? What's your heap use at?
[root@server /]# free -m
total used free shared buff/cache available
Mem: 31976 21515 4561 1111 5899 8960
Swap: 5119 2650 2469
[root@server /]# ./jstat -gc 104444
S0C S1C S0U S1U EC EU OC OU MC MU CCSC CCSU YGC YGCT FGC FGCT CGC CGCT GCT
0.0 32768.0 0.0 26671.5 851968.0 729088.0 15892480.0 12089611.0 131484.0 126934.4 16512.0 15111.6 195944 7301.451 0 0.000 14654 240.817 7542.268
I believe that is the information you are asking me
Im seeing this in the logs (please see the timestamps though)
[2021-08-24T10:46:39,250][INFO ][o.e.m.j.JvmGcMonitorService] [server] [gc][411943] overhead, spent [388ms] collecting in the last [1s]
[2021-08-24T10:48:55,464][INFO ][o.e.m.j.JvmGcMonitorService] [server] [gc][412078] overhead, spent [487ms] collecting in the last [1s]
[2021-08-24T10:49:07,862][INFO ][o.e.m.j.JvmGcMonitorService] [server] [gc][young][412090][195902] duration [709ms], collections [1]/[1.2s], total [709ms]/[2h], memory [12gb]->[11.5gb]/[16gb], all_pools {[young] [536mb]->[16mb]/[0b]}{[old] [11.4gb]->[11.4gb]/[16gb]}{[survivor] [62.3mb]->[39.5mb]/[0b]}
[2021-08-24T10:49:07,863][WARN ][o.e.m.j.JvmGcMonitorService] [server] [gc][412090] overhead, spent [709ms] collecting in the last [1.2s]
[2021-08-24T10:49:15,799][INFO ][o.e.m.j.JvmGcMonitorService] [server] [gc][young][412097][195906] duration [917ms], collections [1]/[1.8s], total [917ms]/[2h], memory [11.7gb]->[11.5gb]/[16gb], all_pools {[young] [280mb]->[0b]/[0b]}{[old] [11.4gb]->[11.4gb]/[16gb]}{[survivor] [59.5mb]->[68.6mb]/[0b]}
[2021-08-24T10:49:15,938][INFO ][o.e.m.j.JvmGcMonitorService] [server] [gc][412097] overhead, spent [917ms] collecting in the last [1.8s]
[2021-08-24T10:50:34,266][WARN ][o.e.m.j.JvmGcMonitorService] [server] [gc][young][412173][195946] duration [2.2s], collections [1]/[2.9s], total [2.2s]/[2h], memory [11.9gb]->[10.7gb]/[16gb], all_pools {[young] [416mb]->[0b]/[0b]}{[old] [11.5gb]->[10.6gb]/[16gb]}{[survivor] [29.2mb]->[29.7mb]/[0b]}
[2021-08-24T10:50:34,269][WARN ][o.e.m.j.JvmGcMonitorService] [server] [gc][412173] overhead, spent [2.2s] collecting in the last [2.9s]
[2021-08-24T11:08:15,341][WARN ][o.e.m.j.JvmGcMonitorService] [server] [gc][413227] overhead, spent [557ms] collecting in the last [1.1s]
[2021-08-24T11:08:26,363][INFO ][o.e.m.j.JvmGcMonitorService] [server] [gc][413238] overhead, spent [312ms] collecting in the last [1s]
[2021-08-24T11:09:28,925][INFO ][o.e.m.j.JvmGcMonitorService] [server] [gc][413300] overhead, spent [442ms] collecting in the last [1s]
[2021-08-24T11:10:32,369][WARN ][o.e.m.j.JvmGcMonitorService] [server] [gc][413363] overhead, spent [683ms] collecting in the last [1s]
[2021-08-24T11:12:18,165][INFO ][o.e.m.j.JvmGcMonitorService] [server] [gc][413467] overhead, spent [679ms] collecting in the last [1.5s]
[2021-08-24T11:12:51,359][WARN ][o.e.m.j.JvmGcMonitorService] [server] [gc][young][413499][196562] duration [1.1s], collections [1]/[2s], total [1.1s]/[2h], memory [11.5gb]->[10.9gb]/[16gb], all_pools {[young] [136mb]->[0b]/[0b]}{[old] [11.4gb]->[10.8gb]/[16gb]}{[survivor] [36mb]->[40mb]/[0b]}
[2021-08-24T11:12:51,359][WARN ][o.e.m.j.JvmGcMonitorService] [server] [gc][413499] overhead, spent [1.1s] collecting in the last [2s]
[2021-08-24T11:25:26,704][INFO ][o.e.m.j.JvmGcMonitorService] [server] [gc][414250] overhead, spent [326ms] collecting in the last [1s]
[2021-08-24T11:25:46,102][INFO ][o.e.m.j.JvmGcMonitorService] [server] [gc][414269] overhead, spent [609ms] collecting in the last [1.3s]
I am not sure if this is normal or not.
What is the output from the _cluster/stats?pretty&human
API?
This is currently the status:
"_nodes": {
"total": 1,
"successful": 1,
"failed": 0
},
"cluster_name": "elasticsearch",
"cluster_uuid": "g-123456-jursdfghw-x",
"timestamp": 1629797385796,
"status": "yellow",
"indices": {
"count": 4106,
"shards": {
"total": 4106,
"primaries": 4106,
"replication": 0.0,
"index": "@{shards=; primaries=; replication=}"
},
"docs": {
"count": 309368027,
"deleted": 4196
},
"store": {
"size": "211.4gb",
"size_in_bytes": 227037222741,
"reserved": "0b",
"reserved_in_bytes": 0
},
"fielddata": {
"memory_size": "0b",
"memory_size_in_bytes": 0,
"evictions": 0
},
"query_cache": {
"memory_size": "13.2kb",
"memory_size_in_bytes": 13584,
"total_count": 14292,
"hit_count": 313,
"miss_count": 13979,
"cache_size": 1,
"cache_count": 29,
"evictions": 28
},
"completion": {
"size": "0b",
"size_in_bytes": 0
},
"segments": {
"count": 27959,
"memory": "902mb",
"memory_in_bytes": 945838494,
"terms_memory": "749mb",
"terms_memory_in_bytes": 785437936,
"stored_fields_memory": "13.4mb",
"stored_fields_memory_in_bytes": 14121432,
"term_vectors_memory": "0b",
"term_vectors_memory_in_bytes": 0,
"norms_memory": "104.5mb",
"norms_memory_in_bytes": 109596992,
"points_memory": "0b",
"points_memory_in_bytes": 0,
"doc_values_memory": "34.9mb",
"doc_values_memory_in_bytes": 36682134,
"index_writer_memory": "317.8mb",
"index_writer_memory_in_bytes": 333304232,
"version_map_memory": "3.5mb",
"version_map_memory_in_bytes": 3751225,
"fixed_bit_set": "12.4mb",
"fixed_bit_set_memory_in_bytes": 13014296,
"max_unsafe_auto_id_timestamp": 1629764489972,
"file_sizes": ""
},
"mappings": {
"field_types": " "
},
"analysis": {
"char_filter_types": "",
"tokenizer_types": "",
"filter_types": "",
"analyzer_types": "",
"built_in_char_filters": "",
"built_in_tokenizers": "",
"built_in_filters": " ",
"built_in_analyzers": ""
}
},
"nodes": {
"count": {
"total": 1,
"coordinating_only": 0,
"data": 1,
"data_cold": 1,
"data_content": 1,
"data_hot": 1,
"data_warm": 1,
"ingest": 1,
"master": 1,
"ml": 1,
"remote_cluster_client": 1,
"transform": 1,
"voting_only": 0
},
"versions": [
"7.10.1"
],
"os": {
"available_processors": 4,
"allocated_processors": 4,
"names": "",
"pretty_names": "",
"mem": "@{total=31.2gb; total_in_bytes=33530023936; free=400.5mb; free_in_bytes=419991552; used=30.8gb; used_in_bytes=33110032384; free_percent=1; used_percent=9
9}"
},
"process": {
"cpu": "@{percent=79}",
"open_file_descriptors": "@{min=22870; max=22870; avg=22870}"
},
"jvm": {
"max_uptime": "4.8d",
"max_uptime_in_millis": 417449522,
"versions": "",
"mem": "@{heap_used=10.9gb; heap_used_in_bytes=11771068080; heap_max=16gb; heap_max_in_bytes=17179869184}",
"threads": 186
},
"fs": {
"total": "299.9gb",
"total_in_bytes": 322065928192,
"free": "85.1gb",
"free_in_bytes": 91467526144,
"available": "85.1gb",
"available_in_bytes": 91467526144
},
"plugins": [
],
"network_types": {
"transport_types": "@{security4=1}",
"http_types": "@{security4=1}"
},
"discovery_types": {
"single-node": 1
},
"packaging_types": [
"@{flavor=default; type=rpm; count=1}"
],
"ingest": {
"number_of_pipelines": 21,
"processor_stats": "@{conditional=; convert=; date=; foreach=; geoip=; grok=; gsub=; json=; lowercase=; pipeline=; remove=; rename=; script=; set=; user_agen
t=}"
}
}
}
The only thing right is is the amount of shards; I set in all the index templates
"number_of_shards": "1",
So they SHOULD be using using one.
Ok, then this relates to Failed to execute progress listener on query failure.
Yes yes, this is all mostly related. I felt that maybe I could get info from here to solve the other issue.
Why do you have over 4000 shards for 211GB of data???
It should be set to 1 shard.
The only thing I can think of is that those are "older" indexes before the index template was established to 1 but.....They are months ago.
Shards are not free and contribute to heap usage and overhead. You have far too many and should look to reduce that dramatically.
© 2020. All Rights Reserved - Elasticsearch
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant logo are trademarks of the Apache Software Foundation in the United States and/or other countries.