I have issues in ElasticSearch. Though we have written applications to read and write for ES specific standard purposes. We face unpredictable issues with it.
Firstly we are in Elasticsearch version 2.3. I know we are atleast 3 years back.
Before we make a solid decision to go to ES 6. We would like to clarify few of our problems we are undergoing now and ensure that it doesn't exist anymore in 6.X. If not we would like to see if any other software can meet its requirement.
31 Data nodes * (4T disk + 20CPU + 125G RAM (Max heap size is 32G))
3 Master nodes * (2CPU + 7G RAM + Disk 30G)
5 Query nodes ( 41CPU + 30G RAM + 32G Disk)
- We often face GC issues. Hence we have implemented auto-restarts. Never found a solid reason why OLD GC happens, anyway we are living with it. Hence auto-restarts are introduced.
- We often get below error, incase of restarts after GC. It takes more than 20-30mins to join the cluster. Not sure why. All we get is below in log on datanode.
[2018-06-21 14:50:54,392][DEBUG][action.admin.cluster.health] [prd-es116] no known master node, scheduling a retry
- We sometimes get reached lwm in master logs. And get shards stuck forever.
Esmaster logs are showing es131 has hit lwm that is 85%, I increased to 89% and restarted data node.
Changed to - > cluster.routing.allocation.disk.watermark.low: 89%
still esmaster shows low disk watermark [85%] exceeded on[prd-es131]
(Why ES can't allocate further for other nodes who have more only 65% occupied)?
- Some weird errors from applications while writing and immediately I see OLD GC kicking in. And then restart starts.
Our analytics is completely based on ES. We are having talks to go on some other Enterprise version of Data analytics other than ES. But we would like to give another chance for ES.
If ES 6.X can solve these problems, we would definitely like to give it a shot.