I am trying to understand some intermittent slowness in my stack. The symptom is this: I go into Kibana and run a query filtering only by time - "last 24 hours", for example - and I get results in 2 seconds. Ten minutes later I try the same query, but the query runs beyond the timeout, taking around 90 seconds to respond.
Looking at the stack management graphs in Kibana, I see no spikes at the time when this happened (except for the "Search Latency" graph, which I suppose is the aforementioned problem, in itself). As far as I can tell, the Elasticsearch nodes seem to have plenty of free resources (CPU, JVM heap, disk, etc.).
The query time seems to be proportional to the number of results. To check this, I have made different queries and plotted "number of results" vs. "response time", which gave me a line. For example, using a query that gives me 1/2 the results gives me a response time that is also roughly 1/2.
I further tested using different sample sizes (the number of results actually returned to Kibana), and it didn't seem to significantly impact the results. I also tested with/without a scripted field I use, and had similar results. In other words, these do not seem to be the issue.
Some more information about the stack setup:
The Elasticsearch nodes are running in docker containers in separate hosts. Each node has 16 GB of JVM heap, and utilization stays around 60% on all nodes. The Elasticsearch data folder is bind-mounted to the host. The cluster is working in "production" mode.
Logstash is being used to collect syslog from multiple different network devices. The syslog runs through a series of Grok filters to parse different formats of syslog message. Then, the parsed fields are pushed into a "syslog-*" index in Elasticsearch. There are no other steps beyond that (in other words, I did not explicitly set up anything in Elasticsearch).
Any suggestions or assistance you can provide to understand this problem would be greatly appreciated.