How to figure what what is using so much CPU and memory in Elastic Search?

riahc3 · August 23, 2021, 8:58am

Hello

Im looking at the logs and such but nothing really calls out why my Elastic is running out of memory and CPU. There is nothing huge size wise and nothing consuming ALL of the CPU.

Where would be a good log to view?

Thank you.

warkolm · August 23, 2021, 9:44am

How do you know it's running out of memory? Are you getting OOM errors in the logs? How are you monitoring CPU?

riahc3 · August 23, 2021, 10:29am

Memory usage is at 95% of the total memory.

Afterwards (I imagine it runs out of memory) the CPU load goes up similar.

Its all being monitored thru Nagios.

leandrojmp · August 23, 2021, 12:12pm

What is the specs of the node? How much memory and CPU does your node have?

How much memory is set to elasticsearch? Do you have any OOM lines in elasticsearch logs?

What do you run in this machine? Only elasticsearch or anything else?

riahc3 · August 23, 2021, 1:13pm

What is the specs of the node? How much memory and CPU does your node have?

Its a Hyper-V VM running on a Failover Cluster. It has 4 cores and currently 32GB of RAM.

How much memory is set to elasticsearch?

The JVM part of Elasticsearch has 16GB

Do you have any OOM lines in elasticsearch logs?

Not one. Its something Ive looked desperate for....

What do you run in this machine? Only elasticsearch or anything else?

This currently runs the Elastic Stack: Elasticsearch, Logstash and Kibana. When doing a top, the most consuming, memory and CPU wise is Elastic.

Thank you for all your help.

warkolm · August 23, 2021, 9:47pm

If you aren't getting OOM and the heap use isn't >75%, but you are seeing OS memory (aka off heap) being used, then that is the OS caching commonly used files. This is normal behaviour.

riahc3 · August 24, 2021, 7:57am

If you aren't getting OOM and the heap use isn't >75%, but you are seeing OS memory (aka off heap) being used, then that is the OS caching commonly used files. This is normal behaviour.

The thing is that this for months has never been a issue. CPU and memory usage have been OK. For one week or two, its been like this.

I also need some kind of proof saying thats it OS caching commonly used files; Alerts popping off from one day to next isnt that common so I need a source for it.

warkolm · August 24, 2021, 8:36am

What does free -m or similar show? What's your heap use at?

riahc3 · August 24, 2021, 8:49am

[root@server /]# free -m
              total        used        free      shared  buff/cache   available
Mem:          31976       21515        4561        1111        5899        8960
Swap:          5119        2650        2469

riahc3 · August 24, 2021, 8:51am

[root@server /]# ./jstat -gc 104444
 S0C    S1C    S0U    S1U      EC       EU        OC         OU       MC     MU    CCSC   CCSU   YGC     YGCT    FGC    FGCT    CGC    CGCT     GCT
 0.0   32768.0  0.0   26671.5 851968.0 729088.0 15892480.0 12089611.0 131484.0 126934.4 16512.0 15111.6 195944 7301.451   0      0.000 14654  240.817 7542.268

riahc3 · August 24, 2021, 8:51am

I believe that is the information you are asking me

riahc3 · August 24, 2021, 9:28am

Im seeing this in the logs (please see the timestamps though)

[2021-08-24T10:46:39,250][INFO ][o.e.m.j.JvmGcMonitorService] [server] [gc][411943] overhead, spent [388ms] collecting in the last [1s]
[2021-08-24T10:48:55,464][INFO ][o.e.m.j.JvmGcMonitorService] [server] [gc][412078] overhead, spent [487ms] collecting in the last [1s]
[2021-08-24T10:49:07,862][INFO ][o.e.m.j.JvmGcMonitorService] [server] [gc][young][412090][195902] duration [709ms], collections [1]/[1.2s], total [709ms]/[2h], memory [12gb]->[11.5gb]/[16gb], all_pools {[young] [536mb]->[16mb]/[0b]}{[old] [11.4gb]->[11.4gb]/[16gb]}{[survivor] [62.3mb]->[39.5mb]/[0b]}
[2021-08-24T10:49:07,863][WARN ][o.e.m.j.JvmGcMonitorService] [server] [gc][412090] overhead, spent [709ms] collecting in the last [1.2s]
[2021-08-24T10:49:15,799][INFO ][o.e.m.j.JvmGcMonitorService] [server] [gc][young][412097][195906] duration [917ms], collections [1]/[1.8s], total [917ms]/[2h], memory [11.7gb]->[11.5gb]/[16gb], all_pools {[young] [280mb]->[0b]/[0b]}{[old] [11.4gb]->[11.4gb]/[16gb]}{[survivor] [59.5mb]->[68.6mb]/[0b]}
[2021-08-24T10:49:15,938][INFO ][o.e.m.j.JvmGcMonitorService] [server] [gc][412097] overhead, spent [917ms] collecting in the last [1.8s]
[2021-08-24T10:50:34,266][WARN ][o.e.m.j.JvmGcMonitorService] [server] [gc][young][412173][195946] duration [2.2s], collections [1]/[2.9s], total [2.2s]/[2h], memory [11.9gb]->[10.7gb]/[16gb], all_pools {[young] [416mb]->[0b]/[0b]}{[old] [11.5gb]->[10.6gb]/[16gb]}{[survivor] [29.2mb]->[29.7mb]/[0b]}
[2021-08-24T10:50:34,269][WARN ][o.e.m.j.JvmGcMonitorService] [server] [gc][412173] overhead, spent [2.2s] collecting in the last [2.9s]
[2021-08-24T11:08:15,341][WARN ][o.e.m.j.JvmGcMonitorService] [server] [gc][413227] overhead, spent [557ms] collecting in the last [1.1s]
[2021-08-24T11:08:26,363][INFO ][o.e.m.j.JvmGcMonitorService] [server] [gc][413238] overhead, spent [312ms] collecting in the last [1s]
[2021-08-24T11:09:28,925][INFO ][o.e.m.j.JvmGcMonitorService] [server] [gc][413300] overhead, spent [442ms] collecting in the last [1s]
[2021-08-24T11:10:32,369][WARN ][o.e.m.j.JvmGcMonitorService] [server] [gc][413363] overhead, spent [683ms] collecting in the last [1s]
[2021-08-24T11:12:18,165][INFO ][o.e.m.j.JvmGcMonitorService] [server] [gc][413467] overhead, spent [679ms] collecting in the last [1.5s]
[2021-08-24T11:12:51,359][WARN ][o.e.m.j.JvmGcMonitorService] [server] [gc][young][413499][196562] duration [1.1s], collections [1]/[2s], total [1.1s]/[2h], memory [11.5gb]->[10.9gb]/[16gb], all_pools {[young] [136mb]->[0b]/[0b]}{[old] [11.4gb]->[10.8gb]/[16gb]}{[survivor] [36mb]->[40mb]/[0b]}
[2021-08-24T11:12:51,359][WARN ][o.e.m.j.JvmGcMonitorService] [server] [gc][413499] overhead, spent [1.1s] collecting in the last [2s]
[2021-08-24T11:25:26,704][INFO ][o.e.m.j.JvmGcMonitorService] [server] [gc][414250] overhead, spent [326ms] collecting in the last [1s]
[2021-08-24T11:25:46,102][INFO ][o.e.m.j.JvmGcMonitorService] [server] [gc][414269] overhead, spent [609ms] collecting in the last [1.3s]

I am not sure if this is normal or not.

warkolm · August 24, 2021, 9:28am

What is the output from the _cluster/stats?pretty&human API?

riahc3 · August 24, 2021, 9:31am

This is currently the status:


    "_nodes":  {
                   "total":  1,
                   "successful":  1,
                   "failed":  0
               },
    "cluster_name":  "elasticsearch",
    "cluster_uuid":  "g-123456-jursdfghw-x",
    "timestamp":  1629797385796,
    "status":  "yellow",
    "indices":  {
                    "count":  4106,
                    "shards":  {
                                   "total":  4106,
                                   "primaries":  4106,
                                   "replication":  0.0,
                                   "index":  "@{shards=; primaries=; replication=}"
                               },
                    "docs":  {
                                 "count":  309368027,
                                 "deleted":  4196
                             },
                    "store":  {
                                  "size":  "211.4gb",
                                  "size_in_bytes":  227037222741,
                                  "reserved":  "0b",
                                  "reserved_in_bytes":  0
                              },
                    "fielddata":  {
                                      "memory_size":  "0b",
                                      "memory_size_in_bytes":  0,
                                      "evictions":  0
                                  },
                    "query_cache":  {
                                        "memory_size":  "13.2kb",
                                        "memory_size_in_bytes":  13584,
                                        "total_count":  14292,
                                        "hit_count":  313,
                                        "miss_count":  13979,
                                        "cache_size":  1,
                                        "cache_count":  29,
                                        "evictions":  28
                                    },
                    "completion":  {
                                       "size":  "0b",
                                       "size_in_bytes":  0
                                   },
                    "segments":  {
                                     "count":  27959,
                                     "memory":  "902mb",
                                     "memory_in_bytes":  945838494,
                                     "terms_memory":  "749mb",
                                     "terms_memory_in_bytes":  785437936,
                                     "stored_fields_memory":  "13.4mb",
                                     "stored_fields_memory_in_bytes":  14121432,
                                     "term_vectors_memory":  "0b",
                                     "term_vectors_memory_in_bytes":  0,
                                     "norms_memory":  "104.5mb",
                                     "norms_memory_in_bytes":  109596992,
                                     "points_memory":  "0b",
                                     "points_memory_in_bytes":  0,
                                     "doc_values_memory":  "34.9mb",
                                     "doc_values_memory_in_bytes":  36682134,
                                     "index_writer_memory":  "317.8mb",
                                     "index_writer_memory_in_bytes":  333304232,
                                     "version_map_memory":  "3.5mb",
                                     "version_map_memory_in_bytes":  3751225,
                                     "fixed_bit_set":  "12.4mb",
                                     "fixed_bit_set_memory_in_bytes":  13014296,
                                     "max_unsafe_auto_id_timestamp":  1629764489972,
                                     "file_sizes":  ""
                                 },
                    "mappings":  {
                                     "field_types":  "                   "
                                 },
                    "analysis":  {
                                     "char_filter_types":  "",
                                     "tokenizer_types":  "",
                                     "filter_types":  "",
                                     "analyzer_types":  "",
                                     "built_in_char_filters":  "",
                                     "built_in_tokenizers":  "",
                                     "built_in_filters":  " ",
                                     "built_in_analyzers":  ""
                                 }
                },
    "nodes":  {
                  "count":  {
                                "total":  1,
                                "coordinating_only":  0,
                                "data":  1,
                                "data_cold":  1,
                                "data_content":  1,
                                "data_hot":  1,
                                "data_warm":  1,
                                "ingest":  1,
                                "master":  1,
                                "ml":  1,
                                "remote_cluster_client":  1,
                                "transform":  1,
                                "voting_only":  0
                            },
                  "versions":  [
                                   "7.10.1"
                               ],
                  "os":  {
                             "available_processors":  4,
                             "allocated_processors":  4,
                             "names":  "",
                             "pretty_names":  "",
                             "mem":  "@{total=31.2gb; total_in_bytes=33530023936; free=400.5mb; free_in_bytes=419991552; used=30.8gb; used_in_bytes=33110032384; free_percent=1; used_percent=9
9}"
                         },
                  "process":  {
                                  "cpu":  "@{percent=79}",
                                  "open_file_descriptors":  "@{min=22870; max=22870; avg=22870}"
                              },
                  "jvm":  {
                              "max_uptime":  "4.8d",
                              "max_uptime_in_millis":  417449522,
                              "versions":  "",
                              "mem":  "@{heap_used=10.9gb; heap_used_in_bytes=11771068080; heap_max=16gb; heap_max_in_bytes=17179869184}",
                              "threads":  186
                          },
                  "fs":  {
                             "total":  "299.9gb",
                             "total_in_bytes":  322065928192,
                             "free":  "85.1gb",
                             "free_in_bytes":  91467526144,
                             "available":  "85.1gb",
                             "available_in_bytes":  91467526144
                         },
                  "plugins":  [

                              ],
                  "network_types":  {
                                        "transport_types":  "@{security4=1}",
                                        "http_types":  "@{security4=1}"
                                    },
                  "discovery_types":  {
                                          "single-node":  1
                                      },
                  "packaging_types":  [
                                          "@{flavor=default; type=rpm; count=1}"
                                      ],
                  "ingest":  {
                                 "number_of_pipelines":  21,
                                 "processor_stats":  "@{conditional=; convert=; date=; foreach=; geoip=; grok=; gsub=; json=; lowercase=; pipeline=; remove=; rename=; script=; set=; user_agen
t=}"
                             }
              }
}

riahc3 · August 24, 2021, 9:32am

The only thing right is is the amount of shards; I set in all the index templates

"number_of_shards": "1",

So they SHOULD be using using one.

warkolm · August 24, 2021, 9:33am

Ok, then this relates to Failed to execute progress listener on query failure.

riahc3 · August 24, 2021, 9:34am

Yes yes, this is all mostly related. I felt that maybe I could get info from here to solve the other issue.

Christian_Dahlqvist · August 24, 2021, 9:39am

Why do you have over 4000 shards for 211GB of data???

riahc3 · August 24, 2021, 9:42am

It should be set to 1 shard.

The only thing I can think of is that those are "older" indexes before the index template was established to 1 but.....They are months ago.

Christian_Dahlqvist · August 25, 2021, 4:37am

Shards are not free and contribute to heap usage and overhead. You have far too many and should look to reduce that dramatically.

Topic		Replies	Views
All of a sudden Elastic Search, 100% CPU usage and high memory usage Elasticsearch	3	2616	September 22, 2021
Elasticsearch is using most of the memory Elasticsearch	1	572	August 25, 2017
Is it normal for client nodes in a cluster to have high OS memory usage? Elasticsearch	3	523	September 12, 2019
Memory usage seems excessive Elasticsearch	3	309	July 6, 2017
Elasticsearch huge CPU utilisation Elasticsearch	7	851	November 12, 2021

How to figure what what is using so much CPU and memory in Elastic Search?

Related topics