How to figure what what is using so much CPU and memory in Elastic Search?

Hello

Im looking at the logs and such but nothing really calls out why my Elastic is running out of memory and CPU. There is nothing huge size wise and nothing consuming ALL of the CPU.

Where would be a good log to view?

Thank you.

How do you know it's running out of memory? Are you getting OOM errors in the logs? How are you monitoring CPU?

Memory usage is at 95% of the total memory.

Afterwards (I imagine it runs out of memory) the CPU load goes up similar.

Its all being monitored thru Nagios.

What is the specs of the node? How much memory and CPU does your node have?

How much memory is set to elasticsearch? Do you have any OOM lines in elasticsearch logs?

What do you run in this machine? Only elasticsearch or anything else?

What is the specs of the node? How much memory and CPU does your node have?

Its a Hyper-V VM running on a Failover Cluster. It has 4 cores and currently 32GB of RAM.

How much memory is set to elasticsearch?

The JVM part of Elasticsearch has 16GB

Do you have any OOM lines in elasticsearch logs?

Not one. Its something Ive looked desperate for....

What do you run in this machine? Only elasticsearch or anything else?

This currently runs the Elastic Stack: Elasticsearch, Logstash and Kibana. When doing a top, the most consuming, memory and CPU wise is Elastic.

Thank you for all your help.

If you aren't getting OOM and the heap use isn't >75%, but you are seeing OS memory (aka off heap) being used, then that is the OS caching commonly used files. This is normal behaviour.

If you aren't getting OOM and the heap use isn't >75%, but you are seeing OS memory (aka off heap) being used, then that is the OS caching commonly used files. This is normal behaviour.

The thing is that this for months has never been a issue. CPU and memory usage have been OK. For one week or two, its been like this.

I also need some kind of proof saying thats it OS caching commonly used files; Alerts popping off from one day to next isnt that common so I need a source for it.

What does free -m or similar show? What's your heap use at?

[root@server /]# free -m
              total        used        free      shared  buff/cache   available
Mem:          31976       21515        4561        1111        5899        8960
Swap:          5119        2650        2469
[root@server /]# ./jstat -gc 104444
 S0C    S1C    S0U    S1U      EC       EU        OC         OU       MC     MU    CCSC   CCSU   YGC     YGCT    FGC    FGCT    CGC    CGCT     GCT
 0.0   32768.0  0.0   26671.5 851968.0 729088.0 15892480.0 12089611.0 131484.0 126934.4 16512.0 15111.6 195944 7301.451   0      0.000 14654  240.817 7542.268

I believe that is the information you are asking me

Im seeing this in the logs (please see the timestamps though)

[2021-08-24T10:46:39,250][INFO ][o.e.m.j.JvmGcMonitorService] [server] [gc][411943] overhead, spent [388ms] collecting in the last [1s]
[2021-08-24T10:48:55,464][INFO ][o.e.m.j.JvmGcMonitorService] [server] [gc][412078] overhead, spent [487ms] collecting in the last [1s]
[2021-08-24T10:49:07,862][INFO ][o.e.m.j.JvmGcMonitorService] [server] [gc][young][412090][195902] duration [709ms], collections [1]/[1.2s], total [709ms]/[2h], memory [12gb]->[11.5gb]/[16gb], all_pools {[young] [536mb]->[16mb]/[0b]}{[old] [11.4gb]->[11.4gb]/[16gb]}{[survivor] [62.3mb]->[39.5mb]/[0b]}
[2021-08-24T10:49:07,863][WARN ][o.e.m.j.JvmGcMonitorService] [server] [gc][412090] overhead, spent [709ms] collecting in the last [1.2s]
[2021-08-24T10:49:15,799][INFO ][o.e.m.j.JvmGcMonitorService] [server] [gc][young][412097][195906] duration [917ms], collections [1]/[1.8s], total [917ms]/[2h], memory [11.7gb]->[11.5gb]/[16gb], all_pools {[young] [280mb]->[0b]/[0b]}{[old] [11.4gb]->[11.4gb]/[16gb]}{[survivor] [59.5mb]->[68.6mb]/[0b]}
[2021-08-24T10:49:15,938][INFO ][o.e.m.j.JvmGcMonitorService] [server] [gc][412097] overhead, spent [917ms] collecting in the last [1.8s]
[2021-08-24T10:50:34,266][WARN ][o.e.m.j.JvmGcMonitorService] [server] [gc][young][412173][195946] duration [2.2s], collections [1]/[2.9s], total [2.2s]/[2h], memory [11.9gb]->[10.7gb]/[16gb], all_pools {[young] [416mb]->[0b]/[0b]}{[old] [11.5gb]->[10.6gb]/[16gb]}{[survivor] [29.2mb]->[29.7mb]/[0b]}
[2021-08-24T10:50:34,269][WARN ][o.e.m.j.JvmGcMonitorService] [server] [gc][412173] overhead, spent [2.2s] collecting in the last [2.9s]
[2021-08-24T11:08:15,341][WARN ][o.e.m.j.JvmGcMonitorService] [server] [gc][413227] overhead, spent [557ms] collecting in the last [1.1s]
[2021-08-24T11:08:26,363][INFO ][o.e.m.j.JvmGcMonitorService] [server] [gc][413238] overhead, spent [312ms] collecting in the last [1s]
[2021-08-24T11:09:28,925][INFO ][o.e.m.j.JvmGcMonitorService] [server] [gc][413300] overhead, spent [442ms] collecting in the last [1s]
[2021-08-24T11:10:32,369][WARN ][o.e.m.j.JvmGcMonitorService] [server] [gc][413363] overhead, spent [683ms] collecting in the last [1s]
[2021-08-24T11:12:18,165][INFO ][o.e.m.j.JvmGcMonitorService] [server] [gc][413467] overhead, spent [679ms] collecting in the last [1.5s]
[2021-08-24T11:12:51,359][WARN ][o.e.m.j.JvmGcMonitorService] [server] [gc][young][413499][196562] duration [1.1s], collections [1]/[2s], total [1.1s]/[2h], memory [11.5gb]->[10.9gb]/[16gb], all_pools {[young] [136mb]->[0b]/[0b]}{[old] [11.4gb]->[10.8gb]/[16gb]}{[survivor] [36mb]->[40mb]/[0b]}
[2021-08-24T11:12:51,359][WARN ][o.e.m.j.JvmGcMonitorService] [server] [gc][413499] overhead, spent [1.1s] collecting in the last [2s]
[2021-08-24T11:25:26,704][INFO ][o.e.m.j.JvmGcMonitorService] [server] [gc][414250] overhead, spent [326ms] collecting in the last [1s]
[2021-08-24T11:25:46,102][INFO ][o.e.m.j.JvmGcMonitorService] [server] [gc][414269] overhead, spent [609ms] collecting in the last [1.3s]

I am not sure if this is normal or not.

What is the output from the _cluster/stats?pretty&human API?

This is currently the status:


    "_nodes":  {
                   "total":  1,
                   "successful":  1,
                   "failed":  0
               },
    "cluster_name":  "elasticsearch",
    "cluster_uuid":  "g-123456-jursdfghw-x",
    "timestamp":  1629797385796,
    "status":  "yellow",
    "indices":  {
                    "count":  4106,
                    "shards":  {
                                   "total":  4106,
                                   "primaries":  4106,
                                   "replication":  0.0,
                                   "index":  "@{shards=; primaries=; replication=}"
                               },
                    "docs":  {
                                 "count":  309368027,
                                 "deleted":  4196
                             },
                    "store":  {
                                  "size":  "211.4gb",
                                  "size_in_bytes":  227037222741,
                                  "reserved":  "0b",
                                  "reserved_in_bytes":  0
                              },
                    "fielddata":  {
                                      "memory_size":  "0b",
                                      "memory_size_in_bytes":  0,
                                      "evictions":  0
                                  },
                    "query_cache":  {
                                        "memory_size":  "13.2kb",
                                        "memory_size_in_bytes":  13584,
                                        "total_count":  14292,
                                        "hit_count":  313,
                                        "miss_count":  13979,
                                        "cache_size":  1,
                                        "cache_count":  29,
                                        "evictions":  28
                                    },
                    "completion":  {
                                       "size":  "0b",
                                       "size_in_bytes":  0
                                   },
                    "segments":  {
                                     "count":  27959,
                                     "memory":  "902mb",
                                     "memory_in_bytes":  945838494,
                                     "terms_memory":  "749mb",
                                     "terms_memory_in_bytes":  785437936,
                                     "stored_fields_memory":  "13.4mb",
                                     "stored_fields_memory_in_bytes":  14121432,
                                     "term_vectors_memory":  "0b",
                                     "term_vectors_memory_in_bytes":  0,
                                     "norms_memory":  "104.5mb",
                                     "norms_memory_in_bytes":  109596992,
                                     "points_memory":  "0b",
                                     "points_memory_in_bytes":  0,
                                     "doc_values_memory":  "34.9mb",
                                     "doc_values_memory_in_bytes":  36682134,
                                     "index_writer_memory":  "317.8mb",
                                     "index_writer_memory_in_bytes":  333304232,
                                     "version_map_memory":  "3.5mb",
                                     "version_map_memory_in_bytes":  3751225,
                                     "fixed_bit_set":  "12.4mb",
                                     "fixed_bit_set_memory_in_bytes":  13014296,
                                     "max_unsafe_auto_id_timestamp":  1629764489972,
                                     "file_sizes":  ""
                                 },
                    "mappings":  {
                                     "field_types":  "                   "
                                 },
                    "analysis":  {
                                     "char_filter_types":  "",
                                     "tokenizer_types":  "",
                                     "filter_types":  "",
                                     "analyzer_types":  "",
                                     "built_in_char_filters":  "",
                                     "built_in_tokenizers":  "",
                                     "built_in_filters":  " ",
                                     "built_in_analyzers":  ""
                                 }
                },
    "nodes":  {
                  "count":  {
                                "total":  1,
                                "coordinating_only":  0,
                                "data":  1,
                                "data_cold":  1,
                                "data_content":  1,
                                "data_hot":  1,
                                "data_warm":  1,
                                "ingest":  1,
                                "master":  1,
                                "ml":  1,
                                "remote_cluster_client":  1,
                                "transform":  1,
                                "voting_only":  0
                            },
                  "versions":  [
                                   "7.10.1"
                               ],
                  "os":  {
                             "available_processors":  4,
                             "allocated_processors":  4,
                             "names":  "",
                             "pretty_names":  "",
                             "mem":  "@{total=31.2gb; total_in_bytes=33530023936; free=400.5mb; free_in_bytes=419991552; used=30.8gb; used_in_bytes=33110032384; free_percent=1; used_percent=9
9}"
                         },
                  "process":  {
                                  "cpu":  "@{percent=79}",
                                  "open_file_descriptors":  "@{min=22870; max=22870; avg=22870}"
                              },
                  "jvm":  {
                              "max_uptime":  "4.8d",
                              "max_uptime_in_millis":  417449522,
                              "versions":  "",
                              "mem":  "@{heap_used=10.9gb; heap_used_in_bytes=11771068080; heap_max=16gb; heap_max_in_bytes=17179869184}",
                              "threads":  186
                          },
                  "fs":  {
                             "total":  "299.9gb",
                             "total_in_bytes":  322065928192,
                             "free":  "85.1gb",
                             "free_in_bytes":  91467526144,
                             "available":  "85.1gb",
                             "available_in_bytes":  91467526144
                         },
                  "plugins":  [

                              ],
                  "network_types":  {
                                        "transport_types":  "@{security4=1}",
                                        "http_types":  "@{security4=1}"
                                    },
                  "discovery_types":  {
                                          "single-node":  1
                                      },
                  "packaging_types":  [
                                          "@{flavor=default; type=rpm; count=1}"
                                      ],
                  "ingest":  {
                                 "number_of_pipelines":  21,
                                 "processor_stats":  "@{conditional=; convert=; date=; foreach=; geoip=; grok=; gsub=; json=; lowercase=; pipeline=; remove=; rename=; script=; set=; user_agen
t=}"
                             }
              }
}

The only thing right is is the amount of shards; I set in all the index templates

"number_of_shards": "1",

So they SHOULD be using using one.

Ok, then this relates to Failed to execute progress listener on query failure.

Yes yes, this is all mostly related. I felt that maybe I could get info from here to solve the other issue.

Why do you have over 4000 shards for 211GB of data???

1 Like

It should be set to 1 shard.

The only thing I can think of is that those are "older" indexes before the index template was established to 1 but.....They are months ago.

Shards are not free and contribute to heap usage and overhead. You have far too many and should look to reduce that dramatically.