How to figure what what is using so much CPU and memory in Elastic Search?

I believe that is the information you are asking me

Im seeing this in the logs (please see the timestamps though)

[2021-08-24T10:46:39,250][INFO ][o.e.m.j.JvmGcMonitorService] [server] [gc][411943] overhead, spent [388ms] collecting in the last [1s]
[2021-08-24T10:48:55,464][INFO ][o.e.m.j.JvmGcMonitorService] [server] [gc][412078] overhead, spent [487ms] collecting in the last [1s]
[2021-08-24T10:49:07,862][INFO ][o.e.m.j.JvmGcMonitorService] [server] [gc][young][412090][195902] duration [709ms], collections [1]/[1.2s], total [709ms]/[2h], memory [12gb]->[11.5gb]/[16gb], all_pools {[young] [536mb]->[16mb]/[0b]}{[old] [11.4gb]->[11.4gb]/[16gb]}{[survivor] [62.3mb]->[39.5mb]/[0b]}
[2021-08-24T10:49:07,863][WARN ][o.e.m.j.JvmGcMonitorService] [server] [gc][412090] overhead, spent [709ms] collecting in the last [1.2s]
[2021-08-24T10:49:15,799][INFO ][o.e.m.j.JvmGcMonitorService] [server] [gc][young][412097][195906] duration [917ms], collections [1]/[1.8s], total [917ms]/[2h], memory [11.7gb]->[11.5gb]/[16gb], all_pools {[young] [280mb]->[0b]/[0b]}{[old] [11.4gb]->[11.4gb]/[16gb]}{[survivor] [59.5mb]->[68.6mb]/[0b]}
[2021-08-24T10:49:15,938][INFO ][o.e.m.j.JvmGcMonitorService] [server] [gc][412097] overhead, spent [917ms] collecting in the last [1.8s]
[2021-08-24T10:50:34,266][WARN ][o.e.m.j.JvmGcMonitorService] [server] [gc][young][412173][195946] duration [2.2s], collections [1]/[2.9s], total [2.2s]/[2h], memory [11.9gb]->[10.7gb]/[16gb], all_pools {[young] [416mb]->[0b]/[0b]}{[old] [11.5gb]->[10.6gb]/[16gb]}{[survivor] [29.2mb]->[29.7mb]/[0b]}
[2021-08-24T10:50:34,269][WARN ][o.e.m.j.JvmGcMonitorService] [server] [gc][412173] overhead, spent [2.2s] collecting in the last [2.9s]
[2021-08-24T11:08:15,341][WARN ][o.e.m.j.JvmGcMonitorService] [server] [gc][413227] overhead, spent [557ms] collecting in the last [1.1s]
[2021-08-24T11:08:26,363][INFO ][o.e.m.j.JvmGcMonitorService] [server] [gc][413238] overhead, spent [312ms] collecting in the last [1s]
[2021-08-24T11:09:28,925][INFO ][o.e.m.j.JvmGcMonitorService] [server] [gc][413300] overhead, spent [442ms] collecting in the last [1s]
[2021-08-24T11:10:32,369][WARN ][o.e.m.j.JvmGcMonitorService] [server] [gc][413363] overhead, spent [683ms] collecting in the last [1s]
[2021-08-24T11:12:18,165][INFO ][o.e.m.j.JvmGcMonitorService] [server] [gc][413467] overhead, spent [679ms] collecting in the last [1.5s]
[2021-08-24T11:12:51,359][WARN ][o.e.m.j.JvmGcMonitorService] [server] [gc][young][413499][196562] duration [1.1s], collections [1]/[2s], total [1.1s]/[2h], memory [11.5gb]->[10.9gb]/[16gb], all_pools {[young] [136mb]->[0b]/[0b]}{[old] [11.4gb]->[10.8gb]/[16gb]}{[survivor] [36mb]->[40mb]/[0b]}
[2021-08-24T11:12:51,359][WARN ][o.e.m.j.JvmGcMonitorService] [server] [gc][413499] overhead, spent [1.1s] collecting in the last [2s]
[2021-08-24T11:25:26,704][INFO ][o.e.m.j.JvmGcMonitorService] [server] [gc][414250] overhead, spent [326ms] collecting in the last [1s]
[2021-08-24T11:25:46,102][INFO ][o.e.m.j.JvmGcMonitorService] [server] [gc][414269] overhead, spent [609ms] collecting in the last [1.3s]

I am not sure if this is normal or not.

What is the output from the _cluster/stats?pretty&human API?

This is currently the status:


    "_nodes":  {
                   "total":  1,
                   "successful":  1,
                   "failed":  0
               },
    "cluster_name":  "elasticsearch",
    "cluster_uuid":  "g-123456-jursdfghw-x",
    "timestamp":  1629797385796,
    "status":  "yellow",
    "indices":  {
                    "count":  4106,
                    "shards":  {
                                   "total":  4106,
                                   "primaries":  4106,
                                   "replication":  0.0,
                                   "index":  "@{shards=; primaries=; replication=}"
                               },
                    "docs":  {
                                 "count":  309368027,
                                 "deleted":  4196
                             },
                    "store":  {
                                  "size":  "211.4gb",
                                  "size_in_bytes":  227037222741,
                                  "reserved":  "0b",
                                  "reserved_in_bytes":  0
                              },
                    "fielddata":  {
                                      "memory_size":  "0b",
                                      "memory_size_in_bytes":  0,
                                      "evictions":  0
                                  },
                    "query_cache":  {
                                        "memory_size":  "13.2kb",
                                        "memory_size_in_bytes":  13584,
                                        "total_count":  14292,
                                        "hit_count":  313,
                                        "miss_count":  13979,
                                        "cache_size":  1,
                                        "cache_count":  29,
                                        "evictions":  28
                                    },
                    "completion":  {
                                       "size":  "0b",
                                       "size_in_bytes":  0
                                   },
                    "segments":  {
                                     "count":  27959,
                                     "memory":  "902mb",
                                     "memory_in_bytes":  945838494,
                                     "terms_memory":  "749mb",
                                     "terms_memory_in_bytes":  785437936,
                                     "stored_fields_memory":  "13.4mb",
                                     "stored_fields_memory_in_bytes":  14121432,
                                     "term_vectors_memory":  "0b",
                                     "term_vectors_memory_in_bytes":  0,
                                     "norms_memory":  "104.5mb",
                                     "norms_memory_in_bytes":  109596992,
                                     "points_memory":  "0b",
                                     "points_memory_in_bytes":  0,
                                     "doc_values_memory":  "34.9mb",
                                     "doc_values_memory_in_bytes":  36682134,
                                     "index_writer_memory":  "317.8mb",
                                     "index_writer_memory_in_bytes":  333304232,
                                     "version_map_memory":  "3.5mb",
                                     "version_map_memory_in_bytes":  3751225,
                                     "fixed_bit_set":  "12.4mb",
                                     "fixed_bit_set_memory_in_bytes":  13014296,
                                     "max_unsafe_auto_id_timestamp":  1629764489972,
                                     "file_sizes":  ""
                                 },
                    "mappings":  {
                                     "field_types":  "                   "
                                 },
                    "analysis":  {
                                     "char_filter_types":  "",
                                     "tokenizer_types":  "",
                                     "filter_types":  "",
                                     "analyzer_types":  "",
                                     "built_in_char_filters":  "",
                                     "built_in_tokenizers":  "",
                                     "built_in_filters":  " ",
                                     "built_in_analyzers":  ""
                                 }
                },
    "nodes":  {
                  "count":  {
                                "total":  1,
                                "coordinating_only":  0,
                                "data":  1,
                                "data_cold":  1,
                                "data_content":  1,
                                "data_hot":  1,
                                "data_warm":  1,
                                "ingest":  1,
                                "master":  1,
                                "ml":  1,
                                "remote_cluster_client":  1,
                                "transform":  1,
                                "voting_only":  0
                            },
                  "versions":  [
                                   "7.10.1"
                               ],
                  "os":  {
                             "available_processors":  4,
                             "allocated_processors":  4,
                             "names":  "",
                             "pretty_names":  "",
                             "mem":  "@{total=31.2gb; total_in_bytes=33530023936; free=400.5mb; free_in_bytes=419991552; used=30.8gb; used_in_bytes=33110032384; free_percent=1; used_percent=9
9}"
                         },
                  "process":  {
                                  "cpu":  "@{percent=79}",
                                  "open_file_descriptors":  "@{min=22870; max=22870; avg=22870}"
                              },
                  "jvm":  {
                              "max_uptime":  "4.8d",
                              "max_uptime_in_millis":  417449522,
                              "versions":  "",
                              "mem":  "@{heap_used=10.9gb; heap_used_in_bytes=11771068080; heap_max=16gb; heap_max_in_bytes=17179869184}",
                              "threads":  186
                          },
                  "fs":  {
                             "total":  "299.9gb",
                             "total_in_bytes":  322065928192,
                             "free":  "85.1gb",
                             "free_in_bytes":  91467526144,
                             "available":  "85.1gb",
                             "available_in_bytes":  91467526144
                         },
                  "plugins":  [

                              ],
                  "network_types":  {
                                        "transport_types":  "@{security4=1}",
                                        "http_types":  "@{security4=1}"
                                    },
                  "discovery_types":  {
                                          "single-node":  1
                                      },
                  "packaging_types":  [
                                          "@{flavor=default; type=rpm; count=1}"
                                      ],
                  "ingest":  {
                                 "number_of_pipelines":  21,
                                 "processor_stats":  "@{conditional=; convert=; date=; foreach=; geoip=; grok=; gsub=; json=; lowercase=; pipeline=; remove=; rename=; script=; set=; user_agen
t=}"
                             }
              }
}

The only thing right is is the amount of shards; I set in all the index templates

"number_of_shards": "1",

So they SHOULD be using using one.

Ok, then this relates to Failed to execute progress listener on query failure.

Yes yes, this is all mostly related. I felt that maybe I could get info from here to solve the other issue.

Why do you have over 4000 shards for 211GB of data???

1 Like

It should be set to 1 shard.

The only thing I can think of is that those are "older" indexes before the index template was established to 1 but.....They are months ago.

Shards are not free and contribute to heap usage and overhead. You have far too many and should look to reduce that dramatically.

The issue is that not only did I not know I had too many BUT I dont know how to reduce them in my case scenario.

Have a look at this article.

By points:

How to reduce the number of shards of newly created indices

I have shards set to 1

Reduce the number of replica shards

I do not have replica shards

Reduce the number of primary shards

I cannot do this; Client asked us to delete indexes/records after 180 days. If I do this, index is recreated even if it is old that and the 180 days "limit" is reset.

Reduce the number of shards with the Shrink API

From what I understand, this also creates a NEW index.

Reduce the number of shards with the Reindex API

Same

Reducing the number of shards of your time-based indices

I think this is the only way BUT Im looking in the stack if there is a way to do this automatically or if I have write a month cron script.

Reducing the number of shards for multi-tenancy indices

N/A

Reducing the number of shards with Filtered Aliases

N/A

It sounds like you are creating over 20 indices per day. Merge these into fewer indices and switch to weekly or monthly indices instead of daily.

1 Like

Im pretty sure they are 20 or over.

Merge these into fewer indices and switch to weekly or monthly indices instead of daily.

When I was looking up information on how to setup Elastic , reading, I read that for daily login events and time based, I should always use daily based ones.

Switch over to monthly based ones doesnt affect me much because the name is variable based but I just want to know why and/or why not use daily ones. Once I get documentation and a explaination, switching them over is pretty much easy as I just adjust my Logstash configuration

Thank you

As you can see a large number of distinct daily indices combined with a long retention period results in a very large number of shards which is inefficient and potentially problematic. Look at some of the links I provided earlier that recommend enduring your shards are at least in the GB size range.

I took a look at my indexes and currently the largest one I have is 645.9mb

I took a look at the article you posted (thank you) and as I mentioned

  • The client wants a 180 day retention for logs
  • Of the article the only thing I can do is merge log.2021.08.24 , log.2021.08.05, etc. into log.2021.08 ; The only issue with that is that right now log.2021.08.24 is 2 days old while if I merde them right now, log.2021.08 would be zero days old.
  • Of the previously mentioned point, looking in Elastic, it seems there is no "automatic" way to do this on a monthly basis; Do I need to write a bash script and using cron run it the first day of each month?

Thanks

Change the way you index so you start indexing into monthly indices now. Over the next 6 months the daily indices will gradually be phased out of the system. If you want to reduce shard count quicker you will need to manually reindex the daily indices into monthly and then remove the daily indices.

Are there cons and pros (besides reducing shard size) to this?

Changing to monthly when it comes to actual configuration shouldnt effect almost anything as my Kibana patterns looks for log.-* and I would just have to change it in Logstash to log-yyyy-mm ....

I just want to make sure that down the road I dont have to change it BACK to daily.......

For this type of data it is often recommended to aim for an average shard size between 20 and 50 GB in size. Based on that you would need no more than 10 shards for the data volume currently in the cluster and you have 400 times that. Given that switching to monthly indices will just reduce the shard count by around a factor of 30 I do not see any risks at all. I would even recommend to also consolidate some of the smaller indices if possible.