Can you share a specific example of two searches via the API that do the same thing but exhibit the performance difference we're discussing, cutting Kibana out of the equation? Can you then share the results for each (at least the took and _shards fields as above)? Then can you profile those searches to help us see where the time is actually being spent? There'll be more output from that than this forum can cope with, so use https://gist.github.com (or similar) to share it.
The time range for each shard is already (effectively) cached for quick access: this is how the pre-filter phase mentioned above works. In addition, Elasticsearch optimises a range-based search into a match_none search on shards that don't contain any docs that match the range, and match_none searches should run pretty quickly. The profiler output should help to show why these two mechanisms are not doing what we want.
Also did these two searches yield the same documents? The time range you used, 1567627030000 to 1567627040000, occurred on 2019-09-04 whereas the one-day search you did was for 2019-08-04.
A colleague pointed me to #40263 which was merged into 6.7.2 and which might speed up queries that hit a lot of indices. Can you upgrade and see if this helps?
I would say that you have too many shards here by about a factor of 4. None of these indices is large enough to warrant three primaries. There's only two that exceed 40GB (suggesting a need for ≥ 2 primaries) and none exceed 80GB (suggesting ≥ 3). Many of them are measured in MB not GB and those could reasonably be monthly indices instead.
Yes, and now that we have access to ILM (we've recently updated to 6.x), we'll be able to clean that up. That's not the issue.
The issue is that elastic spends way too much time determining that (for example) the only index that might have documents in the desired time range is log_syslog-2019.09.04. It spends 20 times the time checking all the other log_syslog-* indices, when the pre-search should short circuit that.
If I open two months of logs instead of just one, that is be 40 times longer. After three months, things just time out.
My best idea for a next step is to start looking at the timings of the individual messages that make up a search using the transport tracer. This will help to clarify exactly where the time is going, although it is sometimes tricky to get a clean trace from a busy production system, and trace logging can slow things down a bit too so proceed with caution. Ideally you want to use a coordinating node that's not being used by anything else. I'm not going to be available for the next few days to give more detailed guidance, but will be back later next week if nobody else has stepped in.
This enables logging for the transmission and receipt of every message involved in a search, like this:
[2019-10-09T08:39:31,242][TRACE][o.e.t.T.tracer ] [node-0] [12012][indices:data/read/search[phase/query]] sent to [{node-3}{unhUmLKIRMS55s6dBS2qPQ}{e09vwfZhTuWicQyWzeuk-Q}{127.0.0.1}{127.0.0.1:9303}{ml.machine_memory=17179869184, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}] (timeout: [null])
[2019-10-09T08:39:31,243][TRACE][o.e.t.T.tracer ] [node-3] [12012][indices:data/read/search[phase/query]] received request
[2019-10-09T08:39:31,244][TRACE][o.e.t.T.tracer ] [node-3] [12012][indices:data/read/search[phase/query]] sent response
[2019-10-09T08:39:31,244][TRACE][o.e.t.T.tracer ] [node-0] [12012][indices:data/read/search[phase/query]] received response from [{node-3}{unhUmLKIRMS55s6dBS2qPQ}{e09vwfZhTuWicQyWzeuk-Q}{127.0.0.1}{127.0.0.1:9303}{ml.machine_memory=17179869184, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}]
Here node-0 is the coordinating node and node-3 is one of the data nodes. I'd like to look at the timings of these messages for a single problematic search. The best way to isolate a single search is to use a coordinating node that isn't being used for any other searches at the time. Ideally do this when there aren't too many other searches happening elsewhere either, because the data nodes will be logging messages about all ongoing searches. We can pick out the bits we need, but the less noise there is the better. Once the search has completed you can disable the tracer again:
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant logo are trademarks of the Apache Software Foundation in the United States and/or other countries.