Slow indexing rate

We are using ELK version 8.13.4 currently and from last 8 months we always used to encounter with lag issue we could not able to find the solution that time , the only thing we can do is restart the coordinate node , logstash instances . This temporary solution has helped us for eight ninth month.

Today at afternoon 1:30 we saw 5 minutes(60k record/minute) lag i.e at 1:25 the log count was 2lac and after 1:25 pm it was decreasing and at 2:00 the log count got sync for 1:30 , after some hours lag was increased from 5 minutes to 1 hours , data was indexing in elasticsearch but it was not synching properly , it will sync after 45 minutes .

We haven't observed any CPU saturation and slow disk , high heap/memory consumption .

We have 3 master nodes , 6 data nodes(each with 6tb ssd:125gb of memory:32gb of heap size:8 core cpu) ,warm nodes(each with 15tb ssd:125gb of memory:32gb of heap size:8 core cpu) and 1 coordinate nodes(each with 500gb ssd:125gb of memory:32gb of heap size:8 core cpu).

In hot nodes we are keeping 150 shards per nodes and size is 70gb and on warm nodes on around 250 shards per nodes .

On every beginning of the month from 1 to 7 days the indexing rate will always increases so one day before we increase the size from 6 to 9 because we get 1tb of data per day and after 8th day we will decrease the shards .

Does anyone knows where i can check and what are the keypoints i can check . Although i haven't seen any abnormality in logstash instances also cpu was 44%.

How are you indexing data? What bulk size are you using? Are you specifying external document IDs or are you letting Elasticsearch automatically assign IDs? Do you perform any updates or deletes?

We are the getting the data from various application and it is pushing to kafka and logstash which pull the data and send it to elasticsearch.

We are not specifying any _id and we are not performing any update and delete operations.

What does CPU usage and disk I/O and await look like on the hot nodes when you are seeing slow indexing?

Is this locally attached SSDs or networked SSD based storage?

Is this describing a change in the indexing load the cluster need to cope with or a change in indexing performance? How does this correlate with the indxing slowness you are experiencing? When you decrease the shards, are you using the shrink index API?