Master Node gradually moves to Out Of Memory Exception


#1

Hello,

We are currently running version 2.4.0, with 20 nodes in the cluster containing ~49,000 shards with ~17TB of data. I'm consistently seeing an issue where it seems like the master node gradually runs out of heap memory (24GB) over a 10 day period. During each approximate 10 day cycle the heap gradually moves up until it can't handle any more. From logs, I could see that as the master got very close to 100% I saw a sudden increase in unassigned shards. I'm pretty sure those were the shards on the master. In this case the master switched and the previous master didn't crash. However I could also see very long GC collects occuring right before the node hits 100%. We have been hitting this issue for a months and I'm a little lost in terms of finding the root cause.

In the more recent past I've been trying to monitor various statistics available via the nodes API. Unfortunately I'm not seeing much other than I can see the heap percentage gradually skew towards 100%. (See attached image). I was hoping I could get some help with breaking down the issue.

Is there anything in the nodes statistics to help me understand what is gradually filling up the heap? I was looking at segment memory for the master and it looks pretty constant at ~4GB throughout the 10 day cycle. About a day before the master gave out, I could see a spike in query cache but it only peaked at 163MB so that's pretty small in comparison to the overall 24GB heap. The number of open file handles looks constant to me as well.

Other observations possible theories we have that I would like test.

  1. I understand this cluster is way over sharded at this point, so as a test we could turn off say half the shards in the cluster and run that way for a bit to see if we're still experiencing this gradual increase in heap usage on the master.

  2. As I understand the master is creating indexes and assigning them to nodes. Is it possible the master just can't keep up with the indexing load? Is there a way to measure index creation on the master? I don't see any bulk, index, or merge thread pool rejections.

Any thoughts on where to take this investigation to gain some more insights? I'd like to try decreasing the number shards, but in my mind it would be nice to have a measure of why the heap is slowly increasing.


(Christian Dahlqvist) #2

It sounds like your master node also holds data. For a cluster that size I would recommend creating 3 dedicated master nodes if you have not already. These should only be responsible for managing the cluster and not serve traffic. How much heap space these would need will depend on the size of your cluster state, so it may make sense to not start too small. Since these nodes do not hold any data apart from the cluster state, heap ban be set to a significantly higher proportion (75%-80%) than the usual 50% recommended for data nodes.

This separation of duties will also generally make it easier to identify bottlenecks and tune the cluster.


#3

Hi Christian,
This sounds like a great suggestion, a few follow up questions if I may:

What is the best way to measure cluster state? Is that via the _cluster/state API? I know at least one of our indexes has a pretty large cluster state I think possibly ~350MB. I measured by redirecting the cluster state for that index to a file and got the size that way.

So our cluster is configured vertically. We have 5 nodes per server defined on 4 servers for a total of 20 nodes. 2 of the servers (10 nodes) are taking on the bulk of the live traffic. At this point we have all of those nodes set to a 24GB heap size. We have configured 6 out of those 10 nodes to be master eligible and they are data nodes as well. The other two servers (less RAM/less disk space) are really just data nodes only. They can not become a master node.

Are you suggesting that we re-configure three of the nodes to be dedicated master only nodes? So this would leave us with 17 data nodes and 3 dedicated masters?

Also we have separated our cluster into 2 data stores (live vs. archived data). The 2 servers with master eligible nodes store only live data and they tend to have the most shards (~3000 at this point). I think you stated this before in a different post, but in your opinion do you think reducing this number to say 500-600 shards per node would improve things?

Here is a stack trace for the first OutOfMemory exception I saw in the log. This was a master node:

[2017-10-02 10:34:24,005][WARN ][index.engine ] [-es5-abc01] [abc-live-2017-09-24-16][0] failed engine [merge failed]
org.apache.lucene.index.MergePolicy$MergeException: java.lang.OutOfMemoryError: Java heap space
at org.elasticsearch.index.engine.InternalEngine$EngineMergeScheduler$1.doRun(InternalEngine.java:1237)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.OutOfMemoryError: Java heap space


(Christian Dahlqvist) #4

What is the specification of the servers you have got (CPU cores, RAM, storage)?

Are you using time-based indices? If so, how many indices/shards do you create per day (or other time period)? What is the average shard size? How long do you keep data?


#5

So a question about dedicated master nodes configuration, in a 3 dedicated master node scenario, do all three masters share the work load or would only one of those nodes be working on the master duties at any one time? In other words I'm just wondering if we move to 3 dedicated masters would that essentially make 2 nodes idle (since they are no longer data nodes) until something happens causing it to take over as master?

The two primary servers handling the live traffic are each:
396GB RAM, 36TB (SSD) dedicated to elastic search data split across 5 partitions (one partition per node), and 8 cores

The other two servers are:
196GB RAM, 4TB (SSD) dedicated to elastic search data split across 5 partitions (one partition per node)
and 10 cores

We have time based indices:
3 kinds of indices creating 1 index/shard per hour: 72 daily +
3 kinds of indexes created per day with 2 shards each: 6 daily +
the rest of indexes (5) with one shard/index so 5 for a total of 83 indexes per day on the live side

At the same time we're ingesting archive data which totals 72 daily + 6 daily for a total 78

That gives us around 160 shards per day.

We have different retention periods, but for the hourly indexes we wanted to keep that data for 3 years. That would be a huge amount of shards though. I recently took a 6 month sample and found hourly indexes average 600KB,16MB and 30MB in size. For the second group the indexes with 2 shards, they average 15GB,11GB,11GB. For the third group of 5 they average 53KB,2GB,48MB,16MB, 9KB.


(Christian Dahlqvist) #6

When using dedicated master nodes you should always aim to have 3 of them. Only one of these will be active as the master at any point in time while the others will monitor the current master, ready to take over should it fail. You should not send any traffic to these nodes the fact that they are less likely to suffer from memory pressure or long GC is what helps increase stability of the cluster. The good thing is that they usually do not need to be very powerful - they can be much smaller than your data nodes. 1-2 CPU cores and 4GB-8GB of heap is usually enough, assuming you do not have an excessively large cluster state. You might therefore be able to add these nodes to the servers without removing any data nodes.

When partitioning large nodes it is common to divide then up into 64GB units as this optimises the heap space (~32GB heap and 50% of RAM assigned to heap). You might therefore benefit from running fewer nodes on the smaller servers.

Hourly indices are useful when you either have huge data volumes coming in or need to cater for a very small retention period so that daily indices become impractical. Otherwise they are very inefficient and will quickly add to the size of the cluster state. If you want to keep this data around for 3 years, you might be better off with monthly rather than hourly indices.

It seems like you generally are having very small shards also for the rest of the indices, which is inefficient. Please read this blog post on shards and sharding and revisit how you organise your data.


(system) #7

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.