How does heap shortage in the masters affect datanodes?

Hi,

Versions:

  • Ubuntu 14.04
  • ES 1.4.2 from elastic repos
  • Java 1.7.0_95

Cluster:

Index:

  • Single index with 60 shards, 1 primary and 2 replicas, shards ranging from 10G to 40G
  • 260Mil documents
  • not using doc values (looking into it)

The problem presented itself after attempting to replace the r3.large masters with m3.large ( http://www.ec2instances.info/?filter=m3.large ) with 4G of heap configured. The 3 masters exhibited the following behavior:

In essence: over 75% heap usage and increased amount of GC and CPU (due to GC probably)

At the same time, every datanode exhibited the following behavior:

i.e. the same behavior. During the time we had the smaller masters in charge of the cluster we saw it directly affecting the rest of the cluster. We replaced the masters with m3.xlarge instances ( http://www.ec2instances.info/?filter=m3.xlarge ) because we weren't unsure at the time if the problem was CPU or Heap. Now it looks to me as if Heap pressure was the problem, you will notice the datanodes in the last image heap usage is back to around 75% and old GC back to almost nothing.

A few questions then:

  • Why the heap pressure in the masters affected the entire cluster? (in such a way that every datanode also showed heap pressure)
  • In this particular use case, what would be the ideal master setup? My hunch says memory is the factor here (as far as I can tell the active master's process is single-threaded so it doesn't really matter how many cores the machine has?) and having 7G of heap size configured seems to have "fixed" this issue
  • Is it particularly harmful for the active master to be serving requests? i.e. should I remove the active master from the configured URLs in the services using ES?
  • Other general suggestions with regards to index/shard size appreciated

Thank you all.

How big is your mapping for this monolithic index?

Hi @warkolm thanks for responding,

We have 9 pretty big mappings for that particular index. I would paste but:

curl -XGET http://localhost:9200/my_index/_mapping?pretty > /tmp/mappings.json
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 85.5M 100 85.5M 0 0 15.1M 0 0:00:05 0:00:05 --:--:-- 20.3M

So not easy to share. I can tell you though that the mappings have in average 75 properties each.

That's quite large, especially if there's 9 of those! ES needs to store that in cluster state, which is held in memory. It does compress is but that's still pretty big.

You're on a pretty old version, can you upgrade?

Looking into the upgrade path to possibly 1.7.5, which will help I am sure.
With regards to my original question, are you suggesting the heap needed by the master is directly tied to the number and size of the mappings?

It's likely, what was heap use like before the change? Or do you not have metrics from that time.

I do, actually, here's how the old r3.large masters looked like before we attempted to replace them with the smaller instances:

The graphing is kind of messed up because the sampling size is 1mo. You can tell the heap usage on the active master (green) is steady at 75% and the other 2 much lower.

If you are sending all requests through these nodes, they are not truly dedicated master nodes, but rather master/client nodes. It is recommended that dedicated master nodes do not serve traffic, but are allowed to just manage the cluster. This will prevent them from suffering from memory pressure intense garbage collection and make the cluster more stable.

If you instead set up dedicated client nodes, you may be able to reduce the size of the dedicated master nodes as they will be doing a lot less work.

Interesting. My initial thought was that resource usage in the masters was so low they could be used for routing, specially the passive masters since they are doing close to nothing. If your suggestion is that I can have routing (client) nodes then what would be the minimum requirements to run these?

Truly dedicated master nodes can often be considerable smaller than the ones you used and have heap as a larger portion of the available memory (~75%) than data nodes as they do not need as much file system cache as nodes that hold data. Instances suitable here may be t2.medium or m3.medium unless the cluster state is very large.

The same recommendation around heap size also applies to client nodes, as they also do not store data. In addition to acting as routers, these nodes also coordinate requests and perform the final aggregation steps once data from the underlying shards have been gathered, which is why they can suffer from garbage collection depending on the work load. These are more difficult to size, but a good starting point may be the instances types you previously used together with the higher heap setting.

1 Like

Ah, here we go we're converging back to size of cluster state, think that's my key problem. I believe I have 2 final questions:

  1. Is there a formula to size the JVM according to the cluster state? i.e. can I grab the mappings and some how get a number of Gb I absolutely need for my heap size?
  2. Is the active master process single-threaded? So far I think so but I haven't dug through the code to make sure.

Many thanks.

  1. Not really, cluster state is kept in binary form in memory, so the json output is simply a human readable version.
  2. What do you mean by master process exactly?

In my case I have 3 masters, but only one is active at any given time. This active master shows cpu usage on a single core all the time, that to me means that the ES master process is always bound to a single CPU, but I am not 100% positive.

Check hot_threads against that node to see what is happening.

sorry, I didn't mean to say 1 core is always pegged, I meant to say when there is activity in the active master it's always a) brief and b) only on a single core. Kinda hard to grab a hot_threads dump when it spikes so briefly.

Most of the work a dedicated master node does is probably linked to maintaining the cluster state, which could very well be single threaded in nature. It is generally not very CPU intensive, which is why I recommended using instances with relatively little CPU.

Cluster state is single threaded, to ensure consistency.