I am building an elastic search (6.6.1) cluster (deployed on Azure Kubernetes), with my current config is -
- Data Volume 25 GB per day (with 1 month of retention )
- Search scenarios (dashboards only few aggregation queries, max 50 users a day)
Current Configuration -
- 3 master eligible nodes (8 GB RAM , 2 Core CPU, 100 GB Hard Disk each each)
- 10 Data Nodes (8 GB RAM , 2 Core CPU, 512 GB (total to retain data for 1 month) )
- No Ingest nodes
Other settings -
- The Java process XMX is set to 3500m (as 50% was recommended).
- Shards per indexes = 5 (default)
- Replication per shard = 2
- Master nodes are NOT data node
- refresh interval for indices set to be 30s
- For each day, data will be stored in separate index.
The data nodes are connected to a Kubernetes ingress service, which is backend of an Azure Application gateway.
The data ingestion to ES is done by Rest API provided by ES.
Now, after running this cluster for 15 days, I suddenly start getting timeouts. The client which was calling ES, receiving 504 from App Gateway of Azure.
The search query which we usually run is -
Index = prefix-2019-08* (which will take all indices created in august month), and a term query was getting run.
Is it supposed to take so much time (more than 20s)??.
During the timeouts, few observations -
- None of the data nodes / master nodes logs was having any error.
- One or two bulk rejections
- Azure App gateway was showing 5xx errors
- RAM uses were close to 50% of system (whatever was XMX given, complete utilisation was there by ES java process)
- CPU usage per pod was very less, around 0.1%
- Cluster health on nodes was Green.
My guess is -
The query was running for more time, as number of queries were increasing, the request queuing was increased, so the requests started timing outs.
How can we solve this? Do I need to increase number of data nodes?
Recommendations from blogs I read -
- Use of rolling indices (decrease number of shards to 1, once write get stopped)
- Check for frozen indices (should not be in my case, as I am regularly doing search with * (all indices))
Please suggest me what things I should check in my config, and what changes can be done in the config?