Hi,
I was hoping to get some assistance with the health of a cluster I am running in a prod environment.
I am relatively new to the Elastic Stack but have done quite a lot of reading on the various aspects of the Elastic Stack configuration (trying to use best practice etc) but there are a lot of elements to consider and it is a bit overwhelming so so any assistance or guidance from an expert would be much appreciated.
Here's our current setup:
- 1 ElasticSearch Node running on a VM in Azure with 4 cores and 32gb memory (8gb heap size). No replica nodes
- 2 Logstash servers running on VMs in Azure. Server 1: 2cores 8gb memory (2gb heap size). Server 2: 2cores 4gb memory (2gb heap size)
Server 1 is running 5 pipelines with the biggest pipeline ingesting a maximum of just under 2k events per second and the others ingesting negligible amounts of data.
We currently have a total of 89 indices and 233 shards. There are daily indices being created for the large pipeline (roughly 50m-100m records in each index. Largest index size is 120gb but average around 50gb). These indices have 5 primary shards each. The other indices have 1 shard each and get rotated monthly (these indices are no bigger than 2gb each).
The health of the cluster is reported to be green in Kibana however, the Elasticsearch node is using over 94% JVM Heap on average and there are frequent garbage collection warnings in Elasticsearch logs such as these:
[2019-03-01T16:10:35,926][WARN ][o.e.m.j.JvmGcMonitorService] [9uxYeLa] [gc][21210] overhead, spent [10s] collecting in the last [10.3s]
Another concern I have is that after restarting Elasticsearch, it takes around 30 minutes to fully recover all shards with lots of errors as such:
[2019-03-01T10:37:00,890][DEBUG][o.e.a.s.TransportSearchAction] [9uxYeLa] All shards failed for phase: [query]
Eventually they all turn green though. Moreover, performing certain queries in Kibana causes an error message saying "x amount of shards failed" even though they all report to be green.
Here's the output of GET _cluster/health/?level=shards?
{
"cluster_name" : "elasticsearch",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 233,
"active_shards" : 233,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 100.0
}
Here are metric details for the Elasticsearch node for the past 4 hours:
Any indication of what else I can look at? Please let me know if any other information would be helpful.
Gen.