Shard size, count, how to stay fast and less GC?

Hey everybody!

We have a "tiered" setup where T1 has the warmest data and T2 keeps the rest.

There are 24 data nodes in the cluster with 3 masters (that are not being queried).

T1 has 6 machines and T2 has the rest (18) they are all m4.4xlarge EC2 instances. All the data nodes have 1000gb General Purpose SSD EBS disks

  • m4.4xlarge
    • 64gb ram
    • 16 cores

Elasticsearch has 30gb allocated for the heap with mlockall enabled.

We have 6 shards and 1 replica per shard per index

Our indices are split up by weeks and T1 currently serves the last 4 weeks, so the topology looks like this right now:

T1:

  • index-w-2015.31
    • shard size: 42.3gb
    • shard size: 42.2gb
    • shard size: 42.8gb
    • shard size: 42.5gb
    • shard size: 42.1gb
    • shard size: 42.2gb
  • index-w-2015.32
    • shard size: 47gb
    • shard size: 46.8gb
    • shard size: 45.8gb
    • shard size: 47.1gb
    • shard size: 47gb
    • shard size: 46.7gb
  • index-w-2015.33
    • shard size: 49.5gb
    • shard size: 48.8gb
    • shard size: 48.3gb
    • shard size: 48.5gb
    • shard size: 48.6gb
    • shard size: 48.7gb
  • index-w-2015.34 # Week 34 is not over yet so it's considerably smaller
    • shard size: 27.9gb
    • shard size: 25.8gb
    • shard size: 25gb
    • shard size: 25.3gb
    • shard size: 26.6gb
    • shard size: 26gb

T2:

  • index-w-2015.30
  • ...
  • index-w-2015.00
  • index-percolator

Indices are getting larger (doc count and storage wise) every week. The current plan is to add a node to T2 every month and slowly add more disks when it's needed. The current plan is to keep a years worth of data in the cluster (that will probably change in January though)

I have tried having 3 search nodes (data:false, master:false) in front of the cluster but that didn't help on response times, heap usage or GC times.

I am trying to figure out if maybe having more shards per index would decrease heap usage and GC times, because the shards are too big?

T1 is always GCing and T2 rarely does it resulting in heaps with T2 flatlining in the high 80s/mid 90's.

We're getting really slow response times from our app, and it correlates with GC/CPU spikes in the Elasticsearch cluster

I've attached some of the graphs that I've been looking at for the past long time

The durations and counters for GC are turned into something more useful with the derivate function from graphite http://graphite.readthedocs.org/en/latest/functions.html#graphite.render.functions.derivative

System metrics for both tiers (left is T1, right T2)

T1 GC/Heap info

T2 GC/heap info


A pretty normal view of the machines with high heap usage

I'm assuming you are using a lot of aggregations (either in your app, or via Kibana)? What do your queries look like?

Assuming you are using a lot of aggregations, the problem is likely that you just don't have enough memory for the volume of data and size of cluster. The pre-2.0 default behavior is to use Field Data, which is basically an in-memory columnar representation of the fields are you are aggregating on. It uses a lot of compression tricks to save space, but at the end of the day it's still basically just a naively loading data into memory.

More shards won't really help, since you'll just be splitting the same data into smaller pieces to spread around the cluster. The total amount of data loaded to memory will be the same regardless.

The only legitimate solutions are to either add more memory (either physically, or in the form of more nodes), reduce the number of fields loaded into memory (and verify you aren't loading crazy analyzed strings), or to switch over to Doc Values for future data. There are some other hacky approaches, but they are really just bandaids and you'll still OOM eventually.

I'd personally recommend Doc Values. They are essentially columnar data representation of Field Data that is created at index-time and serialized to disk. This allows them to stream on/off disk as you run low on memory. And the OS naturally keeps hot segments cached, so you'll still have good performance even if has to page out to disk now and then. And it is always preferable to hitting a long GC or OOM.

Doc Values are enabled by default in 2.0, and we view them as the long term solution. They should be continually improving with time. You can usually reduce your total heap size and give more to the OS (since the ~20gb used by FieldData is no longer needed), which also helps GC speed ... CMS is much happier with smaller heaps.

Soo....I'd personally swap your index templates to doc values so that future indices have them enabled, you should start to see memory pressure reduce over time.

I will chime in with some details.

You're right about the aggregations, we do a lot of them. However we have already switched to doc values a while ago, as we where having issues trying to fit all that data in memory. Max fielddata cache use over the last 24 hours is 2Mb, which I think is alright. That is probably not what is causing excessive gc'ing.

We do a fair number of aggregations, with several being requested by the client on each load, some nested some not, and quite a few timeseries.

I have an example query here, our queries are normally executed via msearch, as we very often want to have multiple separate queries executed at the same time. There is a few cases where a regular search would be fine, but the abstraction layer don't really take that into account.

All in all, doc values are not our solution, as we already use it, perhaps there is something else we can do?

Oh, well hmm. Glad you've switched to doc values, and yeah 2mb is wonderfully low! But that removes the easy answer :smile:

Would you be able to gist up a copy of your Node Stats API output? That's be the fastest way to home in on what's eating up your memory.

  • Do you control the size of your bulk requests? E.g. is it possible for a 1gb bulk to be sent?
  • Using parent/child?
  • What version of ES are you on?
  • Do you have very large mappings with a lot of full text fields (even if you don't aggregate on them)?

To add a question: are you using lots of scripts? If so, do you pass in named parameters or pass parameters in the script itself?

Thank you for your reply @polyfractal I've tried summing up your questions here. However I'm not sure what you mean with the first point, could you elaborate on that?

Node stats output https://gist.github.com/Dinoshauer/74da3795f75bf72a2180

Using parent/child?

  • Nope

What version of ES are you on?

  • 1.5.2

Do you have very large mappings with a lot of full text fields (even if you don't aggregate on them)?

  • All the mappings are minor variation on the same basic model. It contains one big text field, and a lot of smaller ones, most of which we keep unanalyzed, but a few we what to match somewhat fuzzy on, which we then also keep analyzed (via fields). We have the mapping here: Base mapping https://gist.github.com/Dinoshauer/810cebf5d4a1f45b00f2

@Clinton_Gormley we don't use any scripts, no - We have a custom plugin that is used for writes though

Sometimes people build bulk requests based on the number of documents. E.g. every 1000 documents send a bulk request. This can be dangerous, since 1000 docs that are 1kb each is very different from 1000 docs that are 1mb. First is a 1000kb request, the other is 1,000mb.

Internally, the entire bulk request has to sit in memory until it is sliced and shipped out to the associated nodes. So if you are accidentally sending very large bulk requests, it's possible to eat up a bunch of transient memory. E.g. imagine 10 concurrent threads sending bulks that are 1000mb each.

Not saying this is the problem, but I'd do an audit of the code just in case, to rule out poor memory usage / leaks. ES offers no protection for plugins...it assumes the plugin is working correctly.

I suspect the problem is related to norms and your nested documents. For example, the node listen-es-t20-c-05 is using 18gb in segment memory ("memory_in_bytes": 18687223100). All of the T2 nodes have similar usage (12-18gb), while the T1 nodes are around 8gb.

In Lucene 4.x (which Elasticsearch 1.x is based on), when segments are merged together, the norms data is loaded for all fields in the segment. If you have a very large mapping (which you do) with many different fields, all those norms are loaded en-masse for all segments in the merge. This can lead to pretty substantial memory usage if you have large segments.

This explains your lumpy heap usage and GCs on those indexing nodes, since merges are triggering the norms to be loaded. And on your T2 nodes, the norms are lazily-loaded for queries, which explains the relatively static heap.

The good news is that Lucene 5.x (which Elasticsearch 2.x is based on) fixes this issue. When merging, Lucene will only load the norms data for a single field at a time. There is also better reporting so you can more easily determine what's eating the memory. That'll help your T1 nodes, but your T2 nodes will still have high heap usage since the norms are being used for queries.

Until 2.0, the only options are to disable norms ( omit_norms: true) on fields that don't need it, or to get more memory/nodes. You could also split your types into more indices, since that will reduce the number of fields that need to be loaded for each merge. There's really no other workaround, since it's fundamentally how Lucene 4.x works.

Another issue that is affecting heap usage, although not much, is that complex nested heirarchies need multiple fixed bitsets to keep track of the relationships, and this can eat up heap / cause GC. Sorting nested objects can also be expensive, but that has been improved in 2.0 (see that ticket for more details) Based on your memory figures, it's a minor contributor but you're losing ~1gb per node to it.

Thanks for your tips @polyfractal - We're doing an audit on the plugin now and will see if we can disable norms on some fields :smile:

I'll get back to you with our findings as soon as we have some

@polyfractal we have been investigating into disabling norms, for practically all fields, but we have encountered some things that puzzles us a bit.

  1. Can we disable norms across the board and then explicitly enable them for just a few fields? We have only found documentation to disable them on single string fields.

  2. As far as we can see in the docs, norms is a string field attribute, and actually only something called terms length norms. However, that does not correspond to the results from our experiments. As it seems all fields have norms and results differ when we disable them, including nested, datetimes, booleans, and numbers. According to the docs, disabling norms is default to not_analyzed fields, but when we explicitly disable norms there too, scoring also differs. String only attributes should not really have an effect on non-string types, but they do, so what knowledge are we missing? Is it in the docs somewhere?

You can setup a dynamic template (reference docs) which automatically disables norms for all string fields. Dynamic templates are superseded by explicit mappings, so you can just set toggle norms on certain fields:

{
    "template_strings_no_norms" : {
        "mapping" : {
            "type" : "string",
            "omit_norms" : true,
            "index_options" : "docs"
        },
        "match" : "*",
        "match_mapping_type" : "string"
    }
}

Dynamic templates are matched in the order they are specified, so if you have other templates, you can just stick this at the bottom as a default fall-through.

Note: Dynamic templates only apply to newly created/mapped fields. I don't think there is a way to disable existing fields en masse without just writing a script or something to do it automatically (maybe open a ticket?)

Two factors here. First, once you disable norms...they don't actually get removed until they are merged out. From the reference docs:

Please however note that norms won’t be removed instantly, but will be removed as old segments are merged into new segments as you continue indexing new documents. Any score computation on a field that has had norms removed might return inconsistent results since some documents won’t have norms anymore while other documents might still have norms.

So when you disable norms on your T1 nodes, they'll be in a mixture of norms + no norms, which will skew scores for a while until they all get merged out. And your T2 nodes probably won't merge the norms out, since you aren't actively indexing to them. You'll have to do a force-merge to clean those up.

Second, it depends on your searches (and if they are going cross-index too). If your searches combine multiple fields (string + numeric, etc), you'll see the overall search results change. Crudely, you can think of Lucene as finding the score for each part of the query and then multiplying them together + some normalization.

When you remove norms from the string fields, their contribution to the weight changes, which will affect the overall score of each document and probably change how things are sorted. If before it was:

(string_score: 0.5 * string_norms: 0.5) * (numeric_score: 1) = 0.25

without norms;

(string_score: 0.5) * (numeric_score: 1) = 0.50

Now, if you are seeing changes in searches that only look at numerics, it could be something else and I'd have to followup with some more hardcore Lucene folks :slight_smile:

Hey @polyfractal just to follow up.

Everytime we've tested the norm "distribution" we've done it on a clean instance using docker, so we wouldn't have to wait on merges etc.

The flow was create container -> put template -> upload data -> run the same query

As far as I can tell from the docs we should be able to use the optimize api once we get this into production to almost "force" a merge of the segments, no?

Yep. Running the Optimize API will force Lucene to merge down segments.

Caveat if you have already optimized your index down to one segment in the past (e.g. old historical data). Lucene wont "re-merge" a single segment just because you disabled norms. You'll have to index a "dummy" document (or update an existing doc) to spawn the creation of a new segment, then run Optimize. Lucene will merge in the new segment and drop norms at the same time.

Hey @polyfractal - Thank you for your tips, they seem to have helped :slight_smile:

I haven't seen OOM errors or really high heap since we updated the schema a week or 2 ago across the cluster!
We're still auditing the plugin and I am currently upgrading the cluster from 1.5 to 1.7 for good measure.

But thank you so much for your help here!

Woo, glad to hear it! Always nice to hear a happy resolution to these tricky cases :slight_smile: