Hello @stephenb !
Thank you for your kind welcome and I am looking forward to being a great contributor as much as I am a consumer and adding some value as well.
Thank you for confirming my doubt of the field not being available, I know and understand that fact of Elasticsearch, but then this is as you said I am trying to debug and find a root cause to an issue that is listed below:
So basically we have an high end mid-sized cluster with (4 Hot/Warm/Cold + 2 Frozen + 5 Ingest(8CPU/10GBRAM)
+ 3 Master + 3 Clients). Data is being sent from various sources to Logstash, which in return sends the data to ingest nodes on elastic to run pipelines that build the final document based on the source. The whole cluster is containerized (Running as Stateful sets over AWS EKS). The ingesting / client nodes are load-balanced through two different ALBs, and all consumers are connecting to those LBs. The current ingest rate is almost ~8K event/sec
on avg with 12~15K event/sec
upon peak times.
The area of concern is related to the ingest nodes & pipelines, as frequently we face an issue where the ingestion completely stops and when tracing we found that a couple of ingesting nodes runs at 100% CPU and request insane amount of CPU extra ( we can see xx t
and xx b
numbers in throttled CPU in Kibana Stack Monitoring). however, what puzzles me is that there are other one or two ingest nodes that are running at 60- 71
while the others are at their 100
and sometimes not event being able to send their monitoring data to Kibana (we can see N/A) at some times. The issue normally is solved by completely stopping and then restarting the logstashes.
My first suspect would be that the ingest nodes are undersized, however, when we restart the logstatshes the utilization does not go above 25% for a whole good hour then it starts going high gradually again till the nodes are stuck infinitely at 100% causing the issue again and again.
Here is a sample graph of one of the behaviors on one of ingesting nodes:
Now, the logs do not show anything they are sol clean with almost no errors, the only error you get is
"message": "failed to download database [GeoLite2-City.mmdb]",
"stacktrace": ["org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];"
but even those are not correlated to the times the issue happens. from logstash perspective, it reports the following error, though I can call the ingest node URL and access ELK APIs at the time of the issue.
"message":"Encountered a retryable error (will retry with exponential backoff)","code":504,"url":"https://ingest-xxx.xxx.net:443/_bulk","content_length":4962342
so my guess is that there is a certain pipeline that when invoked at scale causes this chaos, however in order to get to that with hundreds of pipelines, I need to correlate the pipeline runs with the node that actually executed them and sees matches when the node goes down.
I already tried GET _nodes/stats/ingest?filter_path=nodes.*.ingest
stats but they are not helpful and with the amounts of events and pipelines you cannot get much useful information (especially there is no visualization in Kiabna for this ), I also tried to create dashboards for data sources sums logged per timeslot but still not a certain pattern, especially that it seems the data is not sent when this happens, just stuck in some pipeline.
any leads to follow from your experience or ideas are welcome!