Why multi-index queries are so slow? And how to troubleshoot?

Elasticsearch 5.6.4 on centos

We recently noticed that multi-index queries against our production Logshash database became very slow (5-10+ seconds). After some investigation the problem looks pretty weird

OBSERVATIONS:

  • The mutli-index query is slow no matter which field are you matching. It will be slow with any single "keyword"-type field and even with "_id" query (example below)
  • The query is slow only when touching multiple indices. If date range specified in the query OR the query converted to search one index only - then it becomes fast
  • The weirdest part is that query only slow when it returns data. For example the "_id" query below will be slow if record with such "_id" exists (10+ sec), and will be fast (300ms) if such record does not exist.
  • At the same time we could not find a single ElasticSearch performance metric that would show that anything at all is "slow" average query latency shows in a 40ms range, memory usage is normal.

QUESTIONS:

  • What performance counters should we use to see that we do have queries that run for 10+ seconds?
  • Why query that returns "0" documents runs in 300ms, but similar query that returns 1 document runs for 10+ seconds?
  • Can performance of the multi-index query be improved? Is that expected that a multi-index query on a single "keyword" is extremely slow?

QUERY EXAMPLE:

logstash-2018.03.*/_search
{
	"from": 0,
	"size": 10,
	"query": {
		"match": {
			"_id": "mdupTmIBax2WfuBpAVv4"
		}
	}
}

Thank you!
Alex

How much data do you have? How many indices and shards are you querying when it gets slow? What does disk I/O and iowait look like when a slow query is run compared to a fast?

Thank you for looking into this one!

How much data do you have?

~500GB. The machine has 30GB of memory, of which ~10GB is reported as MemAvailable by "cat /proc/meminfo"

How many indices and shards are you querying when it gets slow?

Single shard with ~25 indices.

What does disk I/O and iowait look like when a slow query is run compared to a fast?

iostat shows that reads jump to ~300-350 mB/sec during both "slow" and "fast" queries. But stay there much longer for "slow" queries (for the whole duration of the slow query).

Does that help to shed the light? Any ideas on what our next could be ?

Thank you!

Sounds like you may be suffering from slow storage. What type of storage are you using? What does ‘iostat-x’ show during slow queries?

The storage is a "High-Performance (SSD)" block storage on Rackspace Cloud.

Here is an iostat -x response during slow query

[root@logstash1b-rscloud ~]# iostat -x 1 5
Linux 3.10.0-693.17.1.el7.x86_64 (logstash1b-rscloud)   03/27/2018      _x86_64_        (4 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.39    2.37    0.77    0.86    0.07   92.54

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
xvda              0.00     0.52    0.37    0.83     6.45     7.36    22.95     0.00    3.24    2.58    3.54   0.68   0.08
xvdb              0.02    46.03   28.86   62.01   782.00  1672.02    54.01     1.22   13.40   10.27   14.85   0.32   2.89

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          13.85   14.87    5.38   57.69    0.77    7.44

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
xvda              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
xvdb              2.00   209.00 9080.00  221.00 300508.00  4468.00    65.58   145.41   15.69   15.87    8.25   0.10  96.40

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           4.35    0.26    4.35   74.42    1.02   15.60

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
xvda              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
xvdb              1.00    65.00 9976.00   58.00 375148.00   740.00    74.92   147.80   14.78   14.82    8.05   0.10 100.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.28    0.25    3.80   53.42    1.01   39.24

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
xvda              0.00     0.00    1.00    0.00    28.00     0.00    56.00     0.00    2.00    2.00    0.00   2.00   0.20
xvdb              1.00     0.00 8700.00    0.00 291568.00     0.00    67.03   138.27   15.83   15.83    0.00   0.11  99.90

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.57    0.00    5.10   50.00    0.77   40.56

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
xvda              0.00     0.00    0.00    9.00     0.00    36.00     8.00     0.01    1.00    0.00    1.00   0.11   0.10
xvdb              1.00     0.00 9652.00    8.00 388356.00   216.00    80.45   138.02   14.34   14.35    6.88   0.10 100.00

Any ideas/suggestions?

I am new to ElasticSearch, so maybe the numbers I am seeing are "normal" ? But from point of view - querying by single unique value on a 500GB database on a hot machine should never run 20sec. That makes the ElasticSearch unusable for all practical applications.

And if there are no built-in tools to troubleshoot this - then I am not even sure how ES got that far at all :frowning:

Alex

It looks like you have a lot of iowait, so the storage (or the connection to it if it is not local) could be the cause of the slowness.

It looks like you have a lot of iowait, so the storage could be the cause of the slowness.

This does not make sense to me. Hardware slowness should affect all queries equally.

As I mentioned on the first original post the query is always fast if I search for a value that does not exist in database. The query is only slow when it returns one or few documents (which are small ~few KB).

The "slow connection/slow storage" would explain if all the queries were slow OR if the query that returns NO result is slow (because it needs to search ALL indices in the database before giving up). But why the query that returns no documents is 10-50x faster than query that returns 1 document (same query, same DB, same connection) cannot be explained by general hardware slowness.

My common sense tells me this has to be an ES-related issue. There is no other reasonable explanation for a 50x+ performance difference (300ms vs 20seconds)

Alex

Elasticsearch can efficiently keep track of the range of values a specific field covers in each index. It can therefore efficiently determine which indices that may hold relevant data without actually having to search through the index.

If you only query a small number of indices or include a time range which most indices doc not cover, a very small number of indices need to be searched in both cases. This will result in a lot less disk I/O compared to the case where a majority of indices may contain data within the queried time range and have to be searched.

Querying also results in a lot of random access I/O rather than large sequential reads, so I would not rule out latency to storage playing a part in the slowness.

If you are using time-based indices, have you run force merge on indices that are no longer written to in order to reduce the number of segments (set the max_num_segments parameter to e.g. 1) Elasticsearch need to search? This is a very I/O intensive operation, but can give improved query performance.

I have run "_forcemerge" and see no measurable improvements in the search time.

Christian, could you help me to understand what is my next step? Right now I have unusable query performance on a relatively small database. My options are either to abandon ES and go back to MS SQL Server or try to troubleshoot slow ES query.

  • What is the expected ES performance on a single unique keyword field on 500GB database with 1.5B documents on a SSD-backed machine with 30GB of ram? I hope the answer is not "10 seconds".

  • If what I see is not an expected ES performance, then where can I find a documentation or a person who can help solve this?

Alex

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.