we are using a service on-prem to post data to our elastic cluster in the cloud...hosted by elastic. Not so long ago we had 1 instance running on AWS in frankfurt to which we pushed data. We never experienced timeouts or anything on the api calls. A couple of weeks ago we switched to a new Azure instance. Immediately after the switch we noticed quite a bad performance of the api. We see frequent response times of 10 to 20 seconds. At high time we push about 8k of small messages per minute. The health of the cluster all seems well...no out of memory, no weird cpu spikes but still i get these massive timeouts which cause hundreds of messages to fail. I'm using a datastream with a template to roll over from hot to warm but I'm still in my first index. Anyone might have an idea why the api calls are so slow sometimes?
|Oct 31, 2020, 2:01:08 AM UTC|ERROR|i5@westeurope-3|[instance-0000000005] collector [job_stats] timed out when collecting data|
|Oct 31, 2020, 2:01:06 AM UTC|ERROR|i2@westeurope-2|[instance-0000000002] collector [node_stats] timed out when collecting data|
|Oct 31, 2020, 1:23:18 AM UTC|ERROR|i5@westeurope-3|[instance-0000000005] collector [cluster_stats] timed out when collecting data|
After i resized the warm part of the cluster to something larger....4 Gb to 15 Gb the response times dropped and returned to normal....and I didn't see the ERROR messages anymore. Could it be that with a fresh provisioning the erronous instances were discarded?
I browsed through all the monitoring stats but I couldn't see any of it which seemed concerning. Is there something of a matrix which shows available threads/iops or something per size of the node. Now I can see only the amount of memory in Gb which also translates to storage.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.