Choosing the right heap size for warm data nodes

Hi

One of our production clusters has 40 hot data nodes (8 cores, 64gb ram out of which 30.5gb heap and 1.5tb of local SSD storage), and 30 warm data nodes (8 cores, 32gb ram out of which 16gb heap and 5tb of attached HDD storage). Recently, as the warm data nodes have been filled up with more data and crossed the 1.6tb of used storage, we have begun seeing an interesting pattern: the java heap usage circulates around 14-16gb, CPU is around 15% and a lot of time is invested in JVM GC collection-old (almost none on collection-young).

As an experiment, we've allocated 24gb to the heap on one of the warm nodes (75% of the total memory), and let the cluster initialize and rebalance the unassigned shards. The results were remarkable - java heap usage returned to ~14gb, CPU dropped to ~2% and JVM collection dropped to 0 on collection-old and collection-young is ~1/3 of what collection-old used to be before increasing the jvm heap.

The following graphs describe the behavior above, where at 12:07 the settings were changed:

As discussed on Cold data node search performance, more than 50% of the ram should be allocated to warm nodes' jvm heap.
Does this setup make sense?
What are the penalties of allocating 75% of the total memory to the jvm heap?
What other concerns should we have with such a setup? e.g. query performance, cluster maintenance (deleting indices / moving shards to and from the warm nodes / changing indices settings etc.)

Your input is appreciated :slight_smile:

2 Likes

I believe the best practice is still to not allocate more than 50% of available RAM to heap. Do not know what the effects are of going beyond this as I have never tried it.

As you have noticed, optimising heap usage is an important when trying to add as much data to a data node as possible. Are you following these best practices?

  • Make sure that your average shard size is as large as possible, typically between 20GB and 50GB. See this blog post for a discussion on why.
  • Make sure you forcemerge indices no longer written to in order to reduce the number of segments down to 1 per shard
  • Not send queries directly to these high-density data nodes so less heap is required for query coordination (use hot or dedicated coordinating nodes instead)
  • Make sure your mappings are optimised so that you minimize the use of field data (used for text field) as this uses heap

Hi @Christian_Dahlqvist, thanks for your quick reply.

Doesn't elasticsearch automatically run a merge operation on inactive indices? I'd expect this behavior in a similar fashion to how it runs a synced flush (i.e. https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-synced-flush.html#indices-synced-flush).

Two other points -
Queries: we have set-up our queries to only run on dedicated client nodes without sniffing, so no queries should be sent to the warm nodes.
Field mapping: as we index our customer's data, and cannot control which fields / types are sent, we heavily rely on dynamic mapping. We've made sure we do not use field_data for unnecessary fields, but some of our indices indeed have large field-mappings. What optimizations do you point to?

1 Like

Elasticsearch automatically merges segments, but once you are done indexing into an index you can do a forcemerge to further consolidate segments, which can save heap.

Excellent.

If you rely on dynamic mapping for fields, they will be indexed both as text (used field data and therefore heap) and keyword. If you need to aggregate but not perform free-text search on fields, it is recommended to just index them as keyword.

Which version of Elasticsearch are you using? What is your average shard size?

You can also run the cluster stats API you can get an overview of some of the things using heap.

1 Like

We tightly defined our dynamic mapping and analyzers to fit our needs; as we utilize both aggregations and full-text search, we do keep both text and keyword representations. We have, though, reduced the number of characters kept in the keyword to 70 characters (which I guess only has an effect on disk size after truncating the values while keeping the same field-mapping size of the index).

We're using the (almost) latest, v6.2.4.

We have at least 1 shard per customer, and some of our customers only send very small amounts of data so the shard size variance is pretty high - all the way from almost empty shards, up to 40gb shards of larger indices; e.g., this is the shard size distribution across this cluster's warm nodes:

Here is the segments data reported though this API:

"segments": {
"count": 250739,
"memory": "99.3gb",
"memory_in_bytes": 106627431013,
"terms_memory": "82.7gb",
"terms_memory_in_bytes": 88805708430,
"stored_fields_memory": "12.6gb",
"stored_fields_memory_in_bytes": 13574176176,
"term_vectors_memory": "0b",
"term_vectors_memory_in_bytes": 0,
"norms_memory": "1gb",
"norms_memory_in_bytes": 1086384448,
"points_memory": "2.1gb",
"points_memory_in_bytes": 2299584355,
"doc_values_memory": "821.6mb",
"doc_values_memory_in_bytes": 861577604,
"index_writer_memory": "3.5gb",
"index_writer_memory_in_bytes": 3864972412,
"version_map_memory": "1.2kb",
"version_map_memory_in_bytes": 1248,
"fixed_bit_set": "0b",
"fixed_bit_set_memory_in_bytes": 0,
"max_unsafe_auto_id_timestamp": 1529361400851,
"file_sizes": {}
}

As the data represented here is cluster-wide, I am not sure how it should be used in the context of a specific type of node (e.g. only warm nodes and not client / master / hot nodes). Do you have any guidelines regarding some of these values?

Thanks!

It seems like you have a reasonably large portion ion your heap used for field data and that the vast majority of your shards are quite small. If that can not be changed, I can not think of much left to optimise apart from ensuring that older indices are force merged.

We'll look into our dynamic_mapping settings and try to optimize them accordingly. The same with shard sizes and rollover policy.

Do you know how is the force_merge command implemented?
I have 3 main concerns:

  1. Is it a cluster-task? If so, what priority is this task assigned? note: as discussed here, adding urgent cluster tasks can cause put_mapping of indexing hot nodes to fail. As this pull request was only merged into 6.3, we may experience replica inconsistencies and enlarged shards.

  2. How heavy is the task on nodes holding the merging shards? If a node holds more than one share of the index, does it perform both merges concurrently? (assuming we only have 1 merging thread per shard on indices residing on warm nodes)

  3. How is querying affected during the force_merge? Does the performance increase after the merge has finished?

Thanks!

Hey @Christian_Dahlqvist, any insights on these issues?

No, I am not familiar with internals, so will leave that for someone else. Forcemerging can be very I/O intensive, so can easily impact other activity on the cluster.

@Christian_Dahlqvist thanks for the reply. Do you have any suggestions about who could help here so we could tag and proceed?

@jpountz
@Clinton_Gormley
@aditya-agrawal

Hey,
You guys seem to have handled some issues / pull requests regarding force_merge. I was wondering whether any of you has any insight regarding the force_merge questions above, namely:

Thanks and best regards,
Lior.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.