ElasticSearch Heap memory free up

Hi,

We are doing some load testing on ElasticSearch. Out data is log data.

The testing server has 1T data, and two ElasticSearch nodes with 15g heap for each.

When doing testing with some random generated queries, I noticed the heap memory is keeping increasing. And when I stop the queries, the memory is not freed (the old gen data) after about an hour.

I would like to know what is the memory usage algorithm in ES or Lusence. And when would it free the old gen data?

Thanks,

Any one can help?

It usually garbage collects once it gets to 75% used. Is there anything in the Elasticsearch logs about garbage collection?

Thanks for replying. I can see gc is keep doing. The problem seems it does not able to free the old gen data.

{[old] [10.7gb]->[10.9gb]/[10.9gb]}

[2017-11-01T02:19:30,680][INFO ][o.e.m.j.JvmGcMonitorService] [host1-1] [gc][586] overhead, spent [291ms] collecting in the last [1s]
[2017-11-01T02:20:45,799][INFO ][o.e.m.j.JvmGcMonitorService] [host1-1] [gc][661] overhead, spent [321ms] collecting in the last [1s]
[2017-11-01T02:23:15,016][INFO ][o.e.m.j.JvmGcMonitorService] [host1-1] [gc][810] overhead, spent [287ms] collecting in the last [1s]
[2017-11-01T02:35:51,645][INFO ][o.e.m.j.JvmGcMonitorService] [host1-1] [gc][1566] overhead, spent [342ms] collecting in the last [1s]
[2017-11-01T02:44:42,902][INFO ][o.e.m.j.JvmGcMonitorService] [host1-1] [gc][2097] overhead, spent [257ms] collecting in the last [1s]
[2017-11-01T02:46:53,986][INFO ][o.e.m.j.JvmGcMonitorService] [host1-1] [gc][2228] overhead, spent [257ms] collecting in the last [1s]
[2017-11-01T02:48:01,021][INFO ][o.e.m.j.JvmGcMonitorService] [host1-1] [gc][2295] overhead, spent [265ms] collecting in the last [1s]
[2017-11-01T02:48:02,034][INFO ][o.e.m.j.JvmGcMonitorService] [host1-1] [gc][2296] overhead, spent [262ms] collecting in the last [1s]
[2017-11-01T02:53:44,505][INFO ][o.e.m.j.JvmGcMonitorService] [host1-1] [gc][2637] overhead, spent [289ms] collecting in the last [1s]
[2017-11-01T02:55:08,319][INFO ][o.e.m.j.JvmGcMonitorService] [host1-1] [gc][old][2702][18] duration [18s], collections [2]/[18.9s], total [18s]/[19.8s], memory [11.5gb]->[10.9gb]/[11.8gb], all_pools {[young] [696.4mb]->[14.4mb]/[865.3mb]}{[survivor] [108.1mb]->[0b]/[108.1mb]}{[old] [10.7gb]->[10.9gb]/[10.9gb]}
[2017-11-01T02:55:08,319][WARN ][o.e.m.j.JvmGcMonitorService] [host1-1] [gc][2702] overhead, spent [18s] collecting in the last [18.9s]
[2017-11-01T02:55:32,287][WARN ][o.e.m.j.JvmGcMonitorService] [host1-1] [gc][old][2704][19] duration [21.9s], collections [1]/[22.9s], total [21.9s]/[41.8s], memory [11.3gb]->[11gb]/[11.8gb], all_pools {[young] [476.7mb]->[119.3mb]/[865.3mb]}{[survivor] [0b]->[0b]/[108.1mb]}{[old] [10.9gb]->[10.9gb]/[10.9gb]}
[2017-11-01T02:55:32,287][WARN ][o.e.m.j.JvmGcMonitorService] [host1-1] [gc][2704] overhead, spent [21.9s] collecting in the last [22.9s]
[2017-11-01T02:55:50,875][WARN ][o.e.m.j.JvmGcMonitorService] [host1-1] [gc][old][2705][20] duration [18.2s], collections [1]/[18.5s], total [18.2s]/[1m], memory [11gb]->[11.2gb]/[11.8gb], all_pools {[young] [119.3mb]->[278.7mb]/[865.3mb]}{[survivor] [0b]->[0b]/[108.1mb]}{[old] [10.9gb]->[10.9gb]/[10.9gb]}
[2017-11-01T02:55:50,888][WARN ][o.e.m.j.JvmGcMonitorService] [host1-1] [gc][2705] overhead, spent [18.2s] collecting in the last [18.5s]
[2017-11-01T02:56:12,513][WARN ][o.e.m.j.JvmGcMonitorService] [host1-1] [gc][old][2706][21] duration [21.3s], collections [1]/[21.6s], total [21.3s]/[1.3m], memory [11.2gb]->[11gb]/[11.8gb], all_pools {[young] [278.7mb]->[80.1mb]/[865.3mb]}{[survivor] [0b]->[0b]/[108.1mb]}{[old] [10.9gb]->[10.9gb]/[10.9gb]}
[2017-11-01T02:56:12,513][WARN ][o.e.m.j.JvmGcMonitorService] [host1-1] [gc][2706] overhead, spent [21.3s] collecting in the last [21.6s]
[2017-11-01T02:56:30,924][WARN ][o.e.m.j.JvmGcMonitorService] [host1-1] [gc][old][2707][22] duration [17.9s], collections [1]/[18.4s], total [17.9s]/[1.6m], memory [11gb]->[10.9gb]/[11.8gb], all_pools {[young] [80.1mb]->[40.6mb]/[865.3mb]}{[survivor] [0b]->[0b]/[108.1mb]}{[old] [10.9gb]->[10.9gb]/[10.9gb]}
[2017-11-01T02:56:30,924][WARN ][o.e.m.j.JvmGcMonitorService] [host1-1] [gc][2707] overhead, spent [17.9s] collecting in the last [18.4s]
[2017-11-01T02:56:52,476][WARN ][o.e.m.j.JvmGcMonitorService] [host1-1] [gc][old][2708][23] duration [21.2s], collections [1]/[21.5s], total [21.2s]/[2m], memory [10.9gb]->[11gb]/[11.8gb], all_pools {[young] [40.6mb]->[62.2mb]/[865.3mb]}{[survivor] [0b]->[0b]/[108.1mb]}{[old] [10.9gb]->[10.9gb]/[10.9gb]}
[2017-11-01T02:56:52,487][WARN ][o.e.m.j.JvmGcMonitorService] [host1-1] [gc][2708] overhead, spent [21.2s] collecting in the last [21.5s]
[2017-11-01T02:57:10,696][WARN ][o.e.m.j.JvmGcMonitorService] [host1-1] [gc][old][2709][24] duration [17.8s], collections [1]/[18.2s], total [17.8s]/[2.3m], memory [11gb]->[10.9gb]/[11.8gb], all_pools {[young] [62.2mb]->[28.6mb]/[865.3mb]}{[survivor] [0b]->[0b]/[108.1mb]}{[old] [10.9gb]->[10.9gb]/[10.9gb]}
[2017-11-01T02:57:10,696][WARN ][o.e.m.j.JvmGcMonitorService] [host1-1] [gc][2709] overhead, spent [17.8s] collecting in the last [18.2s]
[2017-11-01T02:57:31,420][WARN ][o.e.m.j.JvmGcMonitorService] [host1-1] [gc][old][2710][25] duration [20.2s], collections [1]/[20.7s], total [20.2s]/[2.6m], memory [10.9gb]->[10.9gb]/[11.8gb], all_pools {[young] [28.6mb]->[54.2mb]/[865.3mb]}{[survivor] [0b]->[0b]/[108.1mb]}{[old] [10.9gb]->[10.9gb]/[10.9gb]}
[2017-11-01T02:57:31,420][WARN ][o.e.m.j.JvmGcMonitorService] [host1-1] [gc][2710] overhead, spent [20.2s] collecting in the last [20.7s]

What is the full output of the cluster stats API?

I have the node status result:

"_shards": {
"total": 1157,
"successful": 1157,
"failed": 0
},
"_all": {
"primaries": {
"docs": {
"count": 1124663379,
"deleted": 0
},
"store": {
"size_in_bytes": 1222205430485,
"throttle_time_in_millis": 0
},
"indexing": {
"index_total": 0,
"index_time_in_millis": 0,
"index_current": 0,
"index_failed": 0,
"delete_total": 0,
"delete_time_in_millis": 0,
"delete_current": 0,
"noop_update_total": 0,
"is_throttled": false,
"throttle_time_in_millis": 0
},
"get": {
"total": 4,
"time_in_millis": 6,
"exists_total": 4,
"exists_time_in_millis": 6,
"missing_total": 0,
"missing_time_in_millis": 0,
"current": 0
},
"search": {
"open_contexts": 0,
"query_total": 27582,
"query_time_in_millis": 1973771,
"query_current": 0,
"fetch_total": 27582,
"fetch_time_in_millis": 3205556,
"fetch_current": 0,
"scroll_total": 0,
"scroll_time_in_millis": 0,
"scroll_current": 0,
"suggest_total": 0,
"suggest_time_in_millis": 0,
"suggest_current": 0
},
"merges": {
"current": 0,
"current_docs": 0,
"current_size_in_bytes": 0,
"total": 0,
"total_time_in_millis": 0,
"total_docs": 0,
"total_size_in_bytes": 0,
"total_stopped_time_in_millis": 0,
"total_throttled_time_in_millis": 0,
"total_auto_throttle_in_bytes": 24243077120
},
"refresh": {
"total": 1156,
"total_time_in_millis": 28470,
"listeners": 0
},
"flush": {
"total": 1156,
"total_time_in_millis": 70
},
"warmer": {
"current": 0,
"total": 2312,
"total_time_in_millis": 2866
},
"query_cache": {
"memory_size_in_bytes": 273355040,
"total_count": 449953,
"hit_count": 160211,
"miss_count": 289742,
"cache_size": 27168,
"cache_count": 27168,
"evictions": 0
},
"fielddata": {
"memory_size_in_bytes": 0,
"evictions": 0
},
"completion": {
"size_in_bytes": 0
},
"segments": {
"count": 22998,
"memory_in_bytes": 3395618190,
"terms_memory_in_bytes": 2422319263,
"stored_fields_memory_in_bytes": 109814856,
"term_vectors_memory_in_bytes": 0,
"norms_memory_in_bytes": 274366848,
"points_memory_in_bytes": 41837863,
"doc_values_memory_in_bytes": 547279360,
"index_writer_memory_in_bytes": 0,
"version_map_memory_in_bytes": 0,
"fixed_bit_set_memory_in_bytes": 0,
"max_unsafe_auto_id_timestamp": -1,
"file_sizes": {}
},
"translog": {
"operations": 0,
"size_in_bytes": 99416
},
"request_cache": {
"memory_size_in_bytes": 0,
"evictions": 0,
"hit_count": 0,
"miss_count": 0
},
"recovery": {
"current_as_source": 0,
"current_as_target": 0,
"throttle_time_in_millis": 0
}
},
"total": {
"docs": {
"count": 1124663405,
"deleted": 0
},
"store": {
"size_in_bytes": 1222205525166,
"throttle_time_in_millis": 0
},
"indexing": {
"index_total": 0,
"index_time_in_millis": 0,
"index_current": 0,
"index_failed": 0,
"delete_total": 0,
"delete_time_in_millis": 0,
"delete_current": 0,
"noop_update_total": 0,
"is_throttled": false,
"throttle_time_in_millis": 0
},
"get": {
"total": 10,
"time_in_millis": 11,
"exists_total": 10,
"exists_time_in_millis": 11,
"missing_total": 0,
"missing_time_in_millis": 0,
"current": 0
},
"search": {
"open_contexts": 0,
"query_total": 28084,
"query_time_in_millis": 1973872,
"query_current": 0,
"fetch_total": 28084,
"fetch_time_in_millis": 3205612,
"fetch_current": 0,
"scroll_total": 0,
"scroll_time_in_millis": 0,
"scroll_current": 0,
"suggest_total": 0,
"suggest_time_in_millis": 0,
"suggest_current": 0
},
"merges": {
"current": 0,
"current_docs": 0,
"current_size_in_bytes": 0,
"total": 0,
"total_time_in_millis": 0,
"total_docs": 0,
"total_size_in_bytes": 0,
"total_stopped_time_in_millis": 0,
"total_throttled_time_in_millis": 0,
"total_auto_throttle_in_bytes": 24264048640
},
"refresh": {
"total": 1157,
"total_time_in_millis": 28472,
"listeners": 0
},
"flush": {
"total": 1157,
"total_time_in_millis": 70
},
"warmer": {
"current": 0,
"total": 2314,
"total_time_in_millis": 2866
},
"query_cache": {
"memory_size_in_bytes": 273355040,
"total_count": 449953,
"hit_count": 160211,
"miss_count": 289742,
"cache_size": 27168,
"cache_count": 27168,
"evictions": 0
},
"fielddata": {
"memory_size_in_bytes": 0,
"evictions": 0
},
"completion": {
"size_in_bytes": 0
},
"segments": {
"count": 23004,
"memory_in_bytes": 3395645708,
"terms_memory_in_bytes": 2422340287,
"stored_fields_memory_in_bytes": 109816728,
"term_vectors_memory_in_bytes": 0,
"norms_memory_in_bytes": 274369280,
"points_memory_in_bytes": 41837877,
"doc_values_memory_in_bytes": 547281536,
"index_writer_memory_in_bytes": 0,
"version_map_memory_in_bytes": 0,
"fixed_bit_set_memory_in_bytes": 0,
"max_unsafe_auto_id_timestamp": -1,
"file_sizes": {}
},
"translog": {
"operations": 0,
"size_in_bytes": 99459
},
"request_cache": {
"memory_size_in_bytes": 0,
"evictions": 0,
"hit_count": 0,
"miss_count": 0
},
"recovery": {
"current_as_source": 0,
"current_as_target": 0,
"throttle_time_in_millis": 0
}
}
},

That does not seem to be the full output. Can you provide the full response in a Gist (if it does not fit here)?

From what I can see from the part above, it looks like you have a lot of quite small shards, resulting in a lot of small segments. You may want to read this blog post on shards and sharding for some guidance.

1 Like

Thanks, this is helpful.

But do you mean the shards info in taking the old gen space and never get free?

I have the cluster health info here, hope it can reveal more.

As outlined in the blog post I linked to, each shard ahs some memory overhead that uses heap. By reducing the number of shards and segments you should be able to reduce heap usage. At the moment it seems like it is near the limit and not able to free much when it does GC.

Thanks Christian, we will try to improve it.

BTW, is there a way to get the detailed heap usage for elasticseach?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.