ELK suddenly colapsed

Hi there !

Our env is 11 physical machines with 128 GB ram. A lot of storage in RAID6.
On 3 servers we have 1xmaster and 1x data node
On 8 server we run 2x data node.
Indexes have 8 primary shards + 2 replicas. New index is created everyday.
In the system we have 54 TB of data in primary shards. (~160 TB replicated)

Last time we had problems with fielddata. Related to kind suggest of @Igor_Motov (topic: ELK for 30k eps - memory problem) we started to use doc values as described in Using Doc Values.

Now we encountered another problem. JVM Heap usage was constant on 94-96%, new index for next day has been automaticly created and then OS CPU got about 80%, LOAD got 100%. After that whole custer started to be sluggish. Installed plugins were bearly responding or even not responding.

Do you have some ideas what could gone wrong, or how to tshoot this issues ?

Appreciate Your Help.

Hi, Are you still using ES 1.7.x? In 2.x there are much detailed information about memory usage.

My guess is that you have too much indexed data and terms in memory eat up all RAM.

Another concern is ES dafault java GC settings in this case forcing it to gc endlessly?
Defaults in /usr/share/elasticsearch/bin/elasticsearch.sh.in:

JAVA_OPTS="$JAVA_OPTS -XX:CMSInitiatingOccupancyFraction=75"
JAVA_OPTS="$JAVA_OPTS -XX:+UseCMSInitiatingOccupancyOnly"

Can you decrease replication factor for old indexes (say from 2 to 1) to decrease memory preassure?

Yes, we are still using ES 1.7.3 due to custom application which is not compatible with 2.x :frowning:. So we are unable to run newer ES for at least next 2 months.

Is there some way how I can check if we got too much index data and terms incexed ?

I've checked:

It is exactly same.

I will try to decrease replication in the evening, and check if it works better.

You can also can close unused indexes (their data become unavailable, but it would free up the memory)

Run curl -XGET "localhost:9200/_stats?pretty" and get a look at all - segments - memory_in_bytes. Divide memory_in_bytes by number of nodes and you'll get rough estimation of memory consumption on each node.

For more details, please, follow these links: topic1, topic2

Don't do that unless you have to though: closed indices are no longer managed by ES. Meaning if a node dies with closed index shards, they will not be recovered elsewhere. Say you close an index for 2 months and you had 2 node failures in that time .. it could just be that both primary and replica shard X were discarded and never recovered. You try to open the index one day and voila: it's Red and you lost data.

Just saying :wink:

1 Like

Unfortunately we cannot close any indices cuz they have to be available for searching :disappointed:.

I calculated this segment memory consumption per node it is 3323 MB per node. Each node has 30GB heap memory.

Right now (after cluster restart at night) 4 nodes have LOAD > 22% and OS CPU > 63% :fearful:.

Here I paste _all part of output from _stats:
{ "_shards" : { "total" : 5487, "successful" : 5487, "failed" : 0 }, "_all" : { "primaries" : { "docs" : { "count" : 74616308015, "deleted" : 4 }, "store" : { "size_in_bytes" : 52037605253264, "throttle_time_in_millis" : 580330 }, "indexing" : { "index_total" : 72596323, "index_time_in_millis" : 39291233, "index_current" : 43, "delete_total" : 0, "delete_time_in_millis" : 0, "delete_current" : 0, "noop_update_total" : 0, "is_throttled" : false, "throttle_time_in_millis" : 334250 }, "get" : { "total" : 58, "time_in_millis" : 252, "exists_total" : 32, "exists_time_in_millis" : 252, "missing_total" : 26, "missing_time_in_millis" : 0, "current" : 0 }, "search" : { "open_contexts" : 0, "query_total" : 73175, "query_time_in_millis" : 38312584, "query_current" : 0, "fetch_total" : 810, "fetch_time_in_millis" : 41673, "fetch_current" : 0 }, "merges" : { "current" : 13, "current_docs" : 24516380, "current_size_in_bytes" : 16638380846, "total" : 17391, "total_time_in_millis" : 19036354, "total_docs" : 189949914, "total_size_in_bytes" : 155261369220 }, "refresh" : { "total" : 24949, "total_time_in_millis" : 1861033 }, "flush" : { "total" : 4080, "total_time_in_millis" : 155363 }, "warmer" : { "current" : 1, "total" : 66962, "total_time_in_millis" : 55301 }, "filter_cache" : { "memory_size_in_bytes" : 239706956, "evictions" : 0 }, "id_cache" : { "memory_size_in_bytes" : 0 }, "fielddata" : { "memory_size_in_bytes" : 53669147240, "evictions" : 0 }, "percolate" : { "total" : 0, "time_in_millis" : 0, "current" : 0, "memory_size_in_bytes" : -1, "memory_size" : "-1b", "queries" : 0 }, "completion" : { "size_in_bytes" : 0 }, "segments" : { "count" : 33207, "memory_in_bytes" : 55759533906, "index_writer_memory_in_bytes" : 194051416, "index_writer_max_memory_in_bytes" : 6418345984, "version_map_memory_in_bytes" : 22627648, "fixed_bit_set_memory_in_bytes" : 0 }, "translog" : { "operations" : 4262642, "size_in_bytes" : 17 }, "suggest" : { "total" : 0, "time_in_millis" : 0, "current" : 0 }, "query_cache" : { "memory_size_in_bytes" : 0, "evictions" : 0, "hit_count" : 0, "miss_count" : 0 }, "recovery" : { "current_as_source" : 2, "current_as_target" : 0, "throttle_time_in_millis" : 48486663 } },

Can you see here something that looks bad ?

I add total part of _stats:

"total" : {
      "docs" : {
        "count" : 209731136290,
        "deleted" : 11
      },
      "store" : {
        "size_in_bytes" : 148792506624981,
        "throttle_time_in_millis" : 2346188
      },
      "indexing" : {
        "index_total" : 315778413,
        "index_time_in_millis" : 194708289,
        "index_current" : 884929,
        "delete_total" : 0,
        "delete_time_in_millis" : 0,
        "delete_current" : 0,
        "noop_update_total" : 0,
        "is_throttled" : true,
        "throttle_time_in_millis" : 8463144
      },
      "get" : {
        "total" : 113,
        "time_in_millis" : 478,
        "exists_total" : 65,
        "exists_time_in_millis" : 478,
        "missing_total" : 48,
        "missing_time_in_millis" : 0,
        "current" : 0
      },
      "search" : {
        "open_contexts" : 0,
        "query_total" : 197813,
        "query_time_in_millis" : 99386423,
        "query_current" : 0,
        "fetch_total" : 1615,
        "fetch_time_in_millis" : 60931,
        "fetch_current" : 0
      },
      "merges" : {
        "current" : 39,
        "current_docs" : 143235337,
        "current_size_in_bytes" : 118060045730,
        "total" : 194760,
        "total_time_in_millis" : 108891407,
        "total_docs" : 925389815,
        "total_size_in_bytes" : 726571892934
      },
      "refresh" : {
        "total" : 77029,
        "total_time_in_millis" : 6773391
      },
      "flush" : {
        "total" : 8221,
        "total_time_in_millis" : 592046
      },
      "warmer" : {
        "current" : 0,
        "total" : 348782,
        "total_time_in_millis" : 380957
      },
      "filter_cache" : {
        "memory_size_in_bytes" : 474864888,
        "evictions" : 0
      },
      "id_cache" : {
        "memory_size_in_bytes" : 0
      },
      "fielddata" : {
        "memory_size_in_bytes" : 137580122720,
        "evictions" : 0
      },
      "percolate" : {
        "total" : 0,
        "time_in_millis" : 0,
        "current" : 0,
        "memory_size_in_bytes" : -1,
        "memory_size" : "-1b",
        "queries" : 0
      },
      "completion" : {
        "size_in_bytes" : 0
      },
      "segments" : {
        "count" : 94408,
        "memory_in_bytes" : 160138599888,
        "index_writer_memory_in_bytes" : 263674008,
        "index_writer_max_memory_in_bytes" : 19762286592,
        "version_map_memory_in_bytes" : 31660372,
        "fixed_bit_set_memory_in_bytes" : 0
      },
      "translog" : {
        "operations" : 9047857,
        "size_in_bytes" : 17
      },
      "suggest" : {
        "total" : 0,
        "time_in_millis" : 0,
        "current" : 0
      },
      "query_cache" : {
        "memory_size_in_bytes" : 0,
        "evictions" : 0,
        "hit_count" : 0,
        "miss_count" : 0
      },
      "recovery" : {
        "current_as_source" : 2,
        "current_as_target" : 0,
        "throttle_time_in_millis" : 199816009
      }
    }
  },

Taking the total

data_nodes_count 19 = (3 + (8 * 2))

filter_cache.memory_size_in_bytes   474 864 888     | 0,02 GBytes/node
fielddata.memory_size_in_bytes      137 580 122 720 | 6,74 GBytes/node
segments.memory_in_bytes            160 138 599 888 | 7,85 GBytes/node

Total: 14,62 GBytes/node

So the biggest memory eaters are fielddata and segments. For enabling doc_values in old indexes you need to reindex with new template.

By reducing number of replica 2=>1 you'll save ~ 1/3 of segments.memory_in_bytes (say 2-2,3GB of heap).

Try to monitor _stats and see what's changes (maybe your fielddata grows rapidly).

As temporary workaround you can clear cache by cron.

Thank you guys for tips ! :slight_smile:

I'll monitor _stats and check it out. I've also started to reindex indices which still have field data.

Right now the only symptom that I can notise is sudden high peaks in CPU usage (100%) and LOAD (~80%). Have you got any idea what most commonly causes them ? :thinking:

My guess this is mainly GC (but it can be any heavy search on old data). If you cannot use Marvel (license restriction) then use Zabbix with this script.

A little concerned that you are stating your load as a percentage value which generally not but rather it is a load average, an average of the sum of runnable, running or blocked on io processes on the system. Perhaps a little reading up on this value will help you to understand it better. This is just a starting point:

I see from the messages here that you are looking at elastic and high level system metrics of course and load. Have you taken a more holistic approach to looking at the servers? Elastic is very demanding in its memory utilization and that is often the cause of problems. However elastic is also demanding in terms of cpu utilization beyond just GC as well as disk and network io. Short falls in disk io or network io capacity can drive load average up very quickly.

You stated that you have a heap of 30 on a box with 128 gb of memory. Is the other memory allocated to other processes or to just to disk cache? What does overall system health look like? Are you swapping? Is the system disk cache being impacted? Have you considered going beyond 30 gb of heap? I know this is controversial with the jvm particularly between 32 and 48 gb but you could reasonably go up to 64 gb. But it is a matter of balance. If hep pressure is causing to much go you need to expand heap but if your problem is io pressure then that will not help.

You have a decent size cluster with substantial boxes. However you are storing an even more substantial amount of data and by your comment on closing indexes expect all of that data to be live and accessible to a search. Based on this being an elk deployment a single search has the potential to touch every shard in your system. If this were to happen you should expect what you are seeing or worse. You have 54 TB of data with 1TB of memory. This is great as long as you only have an active set that is within that 1TB of memory but if someone has recently started doing queries outside of the norm. Doing longer term trend analysis that is loading more historical indexes then you should expect problems as you don't have enough memory for a live active index of that size. Especially considering you have old indexes that do not use doc values.

Are you logging slow queries? Has the query mix changes recently? I had one client with a similarly sized client who was having a similar problem with a very simple query, *, or match all. While this should be a simple query to answer, it was causing elastic to touch all shards creating very high load on the system. Not sure if elastic taken steps to optimize that simple query yet. It isn't a particularly meaningful query but it was causing the client significant concern over the efficacy Elastic. They also had the same data indexes in SOLR and it did not exhibit the same behavior.

Have you taken any steps to limit the potential for a user to do a query that touches all data, Especially a nice term facet across a field that has many unique values and touches your older non-doc value shards?

Finally, look deeper into your load average. Is it runnable processes that are driving this number or processes blocked on io? Vmstat can be helpful in determining this.

Good luck hope you are able to find and resolve the issue.

1 Like

The description of setup of ELK is not complete. I suspect you are using some kind of hypervisor and then using 64GB VMs. Are you using vSphere free hypervisor?

Can you provide detailed setup information please.

Do you have any node restart strategy? I decrease memory usage by restarting nodes every day at different times, in my specific application that is very tolerable. Also, I put some very conservative restrictions in circuit breakers to avoid users to be too eager.