Hi. I have been reading docs learning about elasticsearch. I would like definitive help and guidance to quickly assess and provide "actual" guidance for optimization. I am really hoping no one says "it depends" as I have read docs after docs not getting anywhere.
I inherited a 4 node cluster which has recently started to crash about once a week from what i suspect was super high memory.heap usage. I read that once it goes to 80's i can expect trouble. 3 of the nodes would go into high 80's, one in particular and i would need to restart that node and memory would go back down to high 70's and continue to climb. i checked my logs for errors, none that i could see.
I know our load has been increasing this year, and last year our cluster was fine so logically i assumed we need to scale out. Not being able to ascertain exactly how much I need to scale up by I randomly picked to nodes to add to our cluster and it seems healthier. but have have really want numbers and facts and not wishfull thinking issue will go away. 1) i dont know by how much data load has increased, and now idea how to determine that.
I read different tid bits from official and forum docs and am no closer to getting a clue how to assess current setup and plan performance tweaking...i see many others have that same issue.
Previous setup:
4x 8 CPU, 32GB RAM VMs (storage dynamically can grow on NFS mount)
1x Master/Data/Client node
3x Data nodes
Replicas: 1
Shards: 5
discovery.zen.minimum_master_nodes: 1
Java memory min/max: 16GB
here's some of the past output (apologies i did not capture all data well before hand)
shards disk.indices disk.used node
??? ???tb 5.6tb hostl0104
??? ???tb 5.6tb hostl0025
??? ???tb 5.6tb hostl0103
??? ???tb 5.6tb hostl0026
heap.percent ram.percent cpu node.role master name
85 68 34 d - hostl0104
86 99 36 d - hostl0026
88 91 32 md * hostl0025
82 99 34 d - hostl0103
cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
es_myapp_test green 4 4 3202 1601 0 0 0 0 - 100.0%
As mentioned I took a guess based on what i read so far and added 2 nodes.
New setup:
six 8 CPU, 32GB RAM VMs (storage dynamically can grow on NFS mount)
1x Master/Data/Client node
2x Master/Data
3x Data nodes
Replicas: 1
Shards: 5
discovery.zen.minimum_master_nodes: 2
Java memory min/max: 16GB
Syncing data is still in progress, I have observed 1) heap size went down to safer levels 2) disk space usage across all previous hosts is decreasing due to data being distributed across all six nodes. hardware wise I increased by 50% actual number in gains i have no idea. I am a bit disappointed that heap.memory is hovering currently in the mid-70s on the old nodes. I observed that heap.memory is proportional to disk size. when sync first started and data was insignificant heap.memory size was just ~5%. as the volume grows so does the heap size.
shards disk.indices disk.used node
616 4.2tb 3.4tb hostl0104
616 4.2tb 3.5tb hostl0025
368 2.5tb 2.1tb hostl0110
617 4.2tb 3.4tb hostl0103
617 4.2tb 3.5tb hostl0026
368 2.2tb 1.8tb hostl0111
heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
72 68 3 2.10 1.92 1.93 d - hostl0104
70 99 3 1.98 1.80 1.74 md * hostl0026
61 75 3 1.00 1.00 0.98 d - hostl0110
74 91 2 1.82 1.83 1.78 md - hostl0025
71 99 4 1.50 1.59 1.71 md - hostl0103
56 99 3 1.49 1.30 1.21 d - hostl0111
cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
es_myapp_test green 6 6 3202 1601 2 0 0 0 - 100.0%
Once fully synced I plan on moving the Client onto either its own client own configuration install or set it up on one of the data nodes, moving off of the master/data node.
I really would like guidance as to how to proceed next with analyzing my my cluster. I can get a ton of data from ES' API but those numbers are meaningless to me, i have no idea what good looks like, or what to look for or what a healthy benchmark should be. I read somewhere that my shards should math the number of nodes I have.
Also not sure if this helps but when accessing the client via Kibana it takes 2.7 minutes to load the first screen.
Aside from setting the master/data zen nodes settings my installs are generic.
/etc/security/limits.conf
user soft memlock unlimited
user hard memlock unlimited
user - nofile 65536
/etc/sysctl.conf
vm.max_map_count=262144
Below I am pasting my allocation output. from what i know I have quite a bit of indexes that are HUGE. Largest being 1.5TB. I have no idea where to go from here, any solid guidance would be greatly appreciated.
If it helps any, the below data I believe is from the last 2 years approximately. forums limit didnt allow me to paste all now 3.6 TB x 6 nodes.
health status index pri rep docs.count docs.deleted store.size pri.store.size
green open my_index_name_goes_here 5 1 1682791 0 2.9gb 1.4gb
green open my_index_name_goes_here 5 1 9374637 0 8.8gb 4.4gb
green open my_index_name_goes_here 5 1 8675140 0 14gb 7gb
green open my_index_name_goes_here 5 1 10521171 0 17.1gb 8.5gb
green open my_index_name_goes_here 5 1 893644249 0 1.4tb 748.2gb
green open my_index_name_goes_here 5 1 11890171 0 19.8gb 9.9gb
green open my_index_name_goes_here 5 1 13164680 0 20.6gb 10.3gb
green open my_index_name_goes_here 5 1 129384775 0 207.1gb 103.5gb
green open my_index_name_goes_here 5 1 13530013 0 21.9gb 10.9gb
green open my_index_name_goes_here 5 1 1317183222 0 1.5tb 779gb