Hi we are using Elasticsearch to store business data for Sportsbook events, and its a critical cluster in our stack, we have atleast 4-5 Elasticsearch clusters in multiple projects in our Company. but only in 2 clusters we could see that on random occasions like once in a month or 2 months , not necessarily on the high volume times, we see that the REST Client which we use to write the data to Elasticsearch gets stuck for 15 minutes. and it auto recovers . we have looked all the config that might result in it and found that a reindexing config or moving the index from one machine to another machine config which has a 15 minute timeout. but at the time of this issue -no such thing happened.
This issue also happens in our logging cluster, where data is written by filebeats and logstash into the Elasticsearch, we could see 15 minutes of logs missing in the cluster at random times.
We are using Elasticsearch version 7.17, and we deploy our Elasticsearch in virtual machines where VMotion is disabled in Oracle linux operating system.
Can any one suggest how to debug this or fix this issue
I can't think of any 15-minute-long timeouts in ES itself, but the default Linux TCP retransmission timeout is approximately 15 minutes so that'd be my first guess. The docs recommend a much shorter timeout.
Thank you David for point to this, we will do this config change in our logging cluster and observe for couple of months and see if this improves the behaviour
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.