The following is not a part of aws elasticsearch managed service.
I have ES6.8 cluster on m4.2xlarge (32GB RAM) centos7 machines on aws.
GET _cat/nodes?v&s=name
ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
10.1x.x.x31 64 99 10 1.32 1.25 1.24 di - data-1
10.1x.x.x6 30 99 6 0.72 0.80 0.84 di - data-2
10.1x.x.x34 68 99 36 1.12 1.08 1.18 di - data-3
10.1x.x.x03 49 99 17 1.36 1.40 1.40 di - data-4
10.1x.x.x33 44 99 49 1.54 1.68 1.67 di - data-5
10.1x.x.x10 44 99 13 1.26 1.45 1.57 di - data-6
10.1x.x.x8 32 99 8 1.39 1.17 1.17 di - data-7
10.1x.x.x7 43 71 2 0.42 0.31 0.26 mi * master-3
GET _cat/allocation?v&s=node
shards disk.indices disk.used disk.avail disk.total disk.percent host ip node
347 58.5gb 65.5gb 958.4gb 1023.9gb 6 10.1x.x.x31 10.1x.x.x31 data-1
240 42.3gb 48.1gb 975.8gb 1023.9gb 4 10.1x.x.x6 10.1x.x.x6 data-2
304 55gb 61.6gb 962.2gb 1023.9gb 6 10.1x.x.x34 10.1x.x.x34 data-3
382 57gb 64.4gb 959.5gb 1023.9gb 6 10.1x.x.x03 10.1x.x.x03 data-4
391 60.1gb 67.1gb 956.8gb 1023.9gb 6 10.1x.x.x33 10.1x.x.x33 data-5
391 55.8gb 63.1gb 960.8gb 1023.9gb 6 10.1x.x.x10 10.1x.x.x10 data-6
287 49.3gb 59.4gb 964.5gb 1023.9gb 5 10.1x.x.x8 10.1x.x.x8 data-7
we do bulk inserts/bulk updates on it every morning and searching during the day.
it's pretty stable. I set it up 4 months ago using official ansible role. So only elasticsearch and datadog agent is installed there on top of default centos7 image.
node.ml: false
bootstrap.memory_lock: true
thread_pool.write.queue_size: 1000
thread_pool.search.queue_size: 10000
indices.queries.cache.size: 5%
es_heap_size: 20g <-- !!! m4.2xlarge has 32GB RAM
No issues with ES6.8 at all, no timeouts, no downtime since the cluster started.
As a part of migration to ES7 I started a new cluster with the same ansible role, same configs, same hardware.
everything is the same.
Initial load is usually pretty high but no issues happened as expected.
Then the second process that sort of merges some documents using aggregation queries. Got an error. The culprit was the size of buckets is set to 2000000 (different topic why it's set to this number). Though only a small portion returned back due to bucket_selector pipeline agg.
Example with nested query which is a bit more complex than aggs without nested we have:
GET index_id1/_search
{
"size": 0,
"aggs": {
"byNested": {
"nested": {
"path": "nestedObjects"
},
"aggs": {
"sameIds": {
"terms": {
"script": {
"lang": "painless",
"source": "return doc['nestedObjects.id'].value"
},
"size": 2000000 <-- too big
},
"aggs": {
"byId": {
"reverse_nested": {}
},
"byId_bucket_filter": {
"bucket_selector": {
"buckets_path": {
"totalCount": "byId._count"
},
"script": {
"source": "params.totalCount > 1"
}
}
}
}
}
}
}
}
}
Before adjusting the client and switching to composite aggregation I wanted to let it finish with the test next day and changed "search.max_buckets" to 2000000. The rest is the same as on ES6.8
Next day I found out that although the test es7 cluster hadn't been used at all, on 4 nodes elasticsearch service died. checked RAM and looked like this. jvm ate up everything:
Restarted the service on all nodes with es_heap_size 50% of ram = 16g and continued the process with aggregations. Got this:
type: circuit_breaking_exception Reason: "[parent] Data too large, data for [<http_request>] would be [16518949320/15.3gb], which is larger than the limit of [16254631936/15.1gb], real usage: [16518948880/15.3gb], new bytes reserved: [440/440b], usages [request=16440/16kb, fielddata=317/317b, in_flight_requests=154603580/147.4mb, accounting=540721543/515.6mb]"
Ok, found out that between ES7.4 and 6.8 the difference is that the parent breaker takes real memory usage into account. Set it to indices.breaker.total.use_real_memory: false as on ES6.8 there is no such thing.
Ran the process again - no more issues.
So the only difference in terms of cluster config is that ES6.8 es_heap_size = 20g (or 63%) vs ES7.4 es_heap_size = 16g ( or 50%).
Next 10 hrs nobody touched the cluster though i see this pattern:
Memory graph for ES7 for the past 10hrs NOT being used:
Memory graph for ES6 for the whole week being used:
I understand the issue with the aggregation and as I mentioned will switch to composite to get buckets - different topic.
My question is why ES7 RAM consumption pattern is so different than on ES6.8 (same configs/envs/hardware)? I see these zig-zags on ES7 but it's growing. After every clean up consumed memory is higher then it was after previous cleanup. Should I expect it to consume the RAM again and die? Maybe there is anything to track what is exactly causing it grow while not used?
Is there anything else I could adjust before diving into aggregations update? Maybe GC settings or..?
Also why switching off real memory usage for the parent breaker did the trick?