Hello,
Our team recently upgraded from ES 1.1.2 to 1.3.2 and are happy with the
improvements ... except for one perplexing situation.
We are running on Azure worker roles with Oracle Java 1.8u11 and using the
G1 gc. It is possible this is due to G1, but please consider all of the
data below before you pull out a pat response on g1.
Our cluster has 18 nodes, 3 of which are dedicated masters. We have three
indexes, 5 shards and one replica each. The primary index is about 30gb
total (5.9gb per shard and the shards are the same size). We have five
types in the main index and are about 10 fields each, a mix of strings,
dates, bools, longs. None of the strings are analyzed.
All of the 18 nodes are client nodes and Azure is set up to round robin
requests. We have considered creating dedicated client nodes, but haven't
done so yet.
The query I have been using is a combination of a non-trivial filter, a
terms aggregation and two sum aggregations nested beneath the terms
aggregation:
{ "query": { "filtered": { "filter": { "bool": { … } } } },
"aggs": { "name1": { "terms": { "field": "stringfield1" },
"aggs": { "sum1": { "sum": { "field": "longfield1" } },
"sum2 ": { "sum": { "field": "longfield2" } } } } } }
I have run the tests on the cluster when it was lightly loaded (some
indexing plus lightweight metrics queries) and run the tests when there was
no load. I’ll be the first to admit I can be even more systematic, but the
results I have are consistent enough and hard to explain enough that wanted
to write this community.
The primary test uses a filter which always results in an empty set. The
filter contains two must terms, one must range and three mustnot terms. Since
I only care about the aggregation results, this is a search_type=count
query.
If I run the query/filter without the aggregations, the time taken
(results.took from ES) is ~0 (sometimes as high as 15ms). That makes
sense.
The case that doesn’t make sense is that I run the same filter on the same
cluster under the same condition this time WITH the aggregations and I get
anywhere from 200ms to 40000ms. Yes, a factor of 200x. I could believe
200ms to account for some overhead of the aggregations machinery, but
40000ms? And there is no pattern that I can tell as to when 200ms is
returned .vs. 40000ms.
Given that Azure round robins the queries, I can imagine that depending on
which nodes are involved, the query might take more or less of the time. In
fact, I would expect some variations.
The other piece of data is that in trying to debug this I restarted ES on
some of the nodes. By the time I had restarted the third node the
query/filter + all agregations case now returned 200ms consistently.
My question is how it is possible for an empty filter + aggregations to
result in 40000ms time. I tried the same filter and only the terms
aggregation (not the sums); the result was in the 3500-4000ms range – in
case that matters.
Hopefully this makes sense to someone. I’m pulling my hair out and my
colleagues on our internal ES alias are stumped as well.
Thanks for any help,
Craig.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7f16f9e6-52e7-4d5c-854a-a7bd409e2040%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.