We are using ELK version 8.13.4 currently and from last 8 months we always used to encounter with lag issue we could not able to find the solution that time , the only thing we can do is restart the coordinate node , logstash instances . This temporary solution has helped us for eight ninth month.
Today at afternoon 1:30 we saw 5 minutes(60k record/minute) lag i.e at 1:25 the log count was 2lac and after 1:25 pm it was decreasing and at 2:00 the log count got sync for 1:30 , after some hours lag was increased from 5 minutes to 1 hours , data was indexing in elasticsearch but it was not synching properly , it will sync after 45 minutes .
We haven't observed any CPU saturation and slow disk , high heap/memory consumption .
We have 3 master nodes , 6 data nodes(each with 6tb ssd:125gb of memory:32gb of heap size:8 core cpu) ,warm nodes(each with 15tb ssd:125gb of memory:32gb of heap size:8 core cpu) and 1 coordinate nodes(each with 500gb ssd:125gb of memory:32gb of heap size:8 core cpu).
In hot nodes we are keeping 150 shards per nodes and size is 70gb and on warm nodes on around 250 shards per nodes .
On every beginning of the month from 1 to 7 days the indexing rate will always increases so one day before we increase the size from 6 to 9 because we get 1tb of data per day and after 8th day we will decrease the shards .
Does anyone knows where i can check and what are the keypoints i can check . Although i haven't seen any abnormality in logstash instances also cpu was 44%.