Highly variable query performance with ES 1.3.2 (filter + aggregations)

Hello,

Our team recently upgraded from ES 1.1.2 to 1.3.2 and are happy with the
improvements ... except for one perplexing situation.

We are running on Azure worker roles with Oracle Java 1.8u11 and using the
G1 gc. It is possible this is due to G1, but please consider all of the
data below before you pull out a pat response on g1.

Our cluster has 18 nodes, 3 of which are dedicated masters. We have three
indexes, 5 shards and one replica each. The primary index is about 30gb
total (5.9gb per shard and the shards are the same size). We have five
types in the main index and are about 10 fields each, a mix of strings,
dates, bools, longs. None of the strings are analyzed.

All of the 18 nodes are client nodes and Azure is set up to round robin
requests. We have considered creating dedicated client nodes, but haven't
done so yet.

The query I have been using is a combination of a non-trivial filter, a
terms aggregation and two sum aggregations nested beneath the terms
aggregation:

{ "query": { "filtered": { "filter": { "bool": { … } } } },

"aggs": { "name1": { "terms": { "field": "stringfield1" },

        "aggs": { "sum1": { "sum": { "field": "longfield1" } },

            "sum2 ": { "sum": { "field": "longfield2" } } } } } }

I have run the tests on the cluster when it was lightly loaded (some
indexing plus lightweight metrics queries) and run the tests when there was
no load. I’ll be the first to admit I can be even more systematic, but the
results I have are consistent enough and hard to explain enough that wanted
to write this community.

The primary test uses a filter which always results in an empty set. The
filter contains two must terms, one must range and three mustnot terms. Since
I only care about the aggregation results, this is a search_type=count
query.

If I run the query/filter without the aggregations, the time taken
(results.took from ES) is ~0 (sometimes as high as 15ms). That makes
sense.

The case that doesn’t make sense is that I run the same filter on the same
cluster under the same condition this time WITH the aggregations and I get
anywhere from 200ms to 40000ms. Yes, a factor of 200x. I could believe
200ms to account for some overhead of the aggregations machinery, but
40000ms? And there is no pattern that I can tell as to when 200ms is
returned .vs. 40000ms.

Given that Azure round robins the queries, I can imagine that depending on
which nodes are involved, the query might take more or less of the time. In
fact, I would expect some variations.

The other piece of data is that in trying to debug this I restarted ES on
some of the nodes. By the time I had restarted the third node the
query/filter + all agregations case now returned 200ms consistently.

My question is how it is possible for an empty filter + aggregations to
result in 40000ms time. I tried the same filter and only the terms
aggregation (not the sums); the result was in the 3500-4000ms range – in
case that matters.

Hopefully this makes sense to someone. I’m pulling my hair out and my
colleagues on our internal ES alias are stumped as well.

Thanks for any help,

Craig.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7f16f9e6-52e7-4d5c-854a-a7bd409e2040%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

You said you restarted a node, and since then, the effect has gone, is that
true? Such high spikes like 40sec response might indicate problems in the
hardware (disk failure, network latency or timeout, DNS), is this possible?

Jörg

On Mon, Sep 15, 2014 at 11:51 PM, Craig Wittenberg craigwi@microsoft.com
wrote:

Hello,

Our team recently upgraded from ES 1.1.2 to 1.3.2 and are happy with the
improvements ... except for one perplexing situation.

We are running on Azure worker roles with Oracle Java 1.8u11 and using the
G1 gc. It is possible this is due to G1, but please consider all of the
data below before you pull out a pat response on g1.

Our cluster has 18 nodes, 3 of which are dedicated masters. We have
three indexes, 5 shards and one replica each. The primary index is about
30gb total (5.9gb per shard and the shards are the same size). We have
five types in the main index and are about 10 fields each, a mix of
strings, dates, bools, longs. None of the strings are analyzed.

All of the 18 nodes are client nodes and Azure is set up to round robin
requests. We have considered creating dedicated client nodes, but
haven't done so yet.

The query I have been using is a combination of a non-trivial filter, a
terms aggregation and two sum aggregations nested beneath the terms
aggregation:

{ "query": { "filtered": { "filter": { "bool": { … } } } },

"aggs": { "name1": { "terms": { "field": "stringfield1" },

        "aggs": { "sum1": { "sum": { "field": "longfield1" } },

            "sum2 ": { "sum": { "field": "longfield2" } } } } } }

I have run the tests on the cluster when it was lightly loaded (some
indexing plus lightweight metrics queries) and run the tests when there was
no load. I’ll be the first to admit I can be even more systematic, but
the results I have are consistent enough and hard to explain enough that
wanted to write this community.

The primary test uses a filter which always results in an empty set. The
filter contains two must terms, one must range and three mustnot terms. Since
I only care about the aggregation results, this is a search_type=count
query.

If I run the query/filter without the aggregations, the time taken
(results.took from ES) is ~0 (sometimes as high as 15ms). That makes
sense.

The case that doesn’t make sense is that I run the same filter on the same
cluster under the same condition this time WITH the aggregations and I get
anywhere from 200ms to 40000ms. Yes, a factor of 200x. I could believe
200ms to account for some overhead of the aggregations machinery, but
40000ms? And there is no pattern that I can tell as to when 200ms is
returned .vs. 40000ms.

Given that Azure round robins the queries, I can imagine that depending on
which nodes are involved, the query might take more or less of the time. In
fact, I would expect some variations.

The other piece of data is that in trying to debug this I restarted ES on
some of the nodes. By the time I had restarted the third node the
query/filter + all agregations case now returned 200ms consistently.

My question is how it is possible for an empty filter + aggregations to
result in 40000ms time. I tried the same filter and only the terms
aggregation (not the sums); the result was in the 3500-4000ms range – in
case that matters.

Hopefully this makes sense to someone. I’m pulling my hair out and my
colleagues on our internal ES alias are stumped as well.

Thanks for any help,

Craig.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/7f16f9e6-52e7-4d5c-854a-a7bd409e2040%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/7f16f9e6-52e7-4d5c-854a-a7bd409e2040%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoG-TuLrUOMqpt24vu2FE-%3D320%3DOSfTbsaYyRCiswaN9cg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Hardware problems in Azure? Hmm. I suppose so :slight_smile:

In this case I restarted just ES (i.e., shutdown ES using
http://localhost:9200/_cluster/nodes/_local/_shutdown and then ran
elasticsearch.bat again) and thus didn't reboot the VM or anything so
dramatic. All of the ES node files remained in place and the node
reconnected with the cluster with no issue. My rdp session remained. I
suspect if this were a hardware issue a) we would have seen this long ago
and b) it would have persisted after I restarted ES.

Related, we actually have two clusters of the same size and same data that
have exhibited the same behavior. The cluster in which I restarted three
ES nodes remains responsive. The other cluster in which I have not (yet)
restarted nodes remains with the highly variable response time (200-40000ms
with 34415ms being a recent one).

Thanks for replying.

Craig.

On Monday, September 15, 2014 2:58:22 PM UTC-7, Jörg Prante wrote:

You said you restarted a node, and since then, the effect has gone, is
that true? Such high spikes like 40sec response might indicate problems in
the hardware (disk failure, network latency or timeout, DNS), is this
possible?

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/9d74b9a8-cdcc-474f-9a34-9e2fa8240c74%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Is it possible that the JVM is doing large GC's ?

Peter

Op dinsdag 16 september 2014 02:24:42 UTC+2 schreef Craig Wittenberg:

Hardware problems in Azure? Hmm. I suppose so :slight_smile:

In this case I restarted just ES (i.e., shutdown ES using
http://localhost:9200/_cluster/nodes/_local/_shutdown and then ran
elasticsearch.bat again) and thus didn't reboot the VM or anything so
dramatic. All of the ES node files remained in place and the node
reconnected with the cluster with no issue. My rdp session remained. I
suspect if this were a hardware issue a) we would have seen this long ago
and b) it would have persisted after I restarted ES.

Related, we actually have two clusters of the same size and same data that
have exhibited the same behavior. The cluster in which I restarted three
ES nodes remains responsive. The other cluster in which I have not (yet)
restarted nodes remains with the highly variable response time (200-40000ms
with 34415ms being a recent one).

Thanks for replying.

Craig.

On Monday, September 15, 2014 2:58:22 PM UTC-7, Jörg Prante wrote:

You said you restarted a node, and since then, the effect has gone, is
that true? Such high spikes like 40sec response might indicate problems in
the hardware (disk failure, network latency or timeout, DNS), is this
possible?

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/87e682c1-717e-4d29-b9bf-82078616744f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hi Peter,

No, old or long gcs are not showing up here. I use bigdesk among other
tools and also inspect the logs. Bigdesk shows a very regular sawtooth
pattern and nearly always young gcs of less than 1s. The logs show the
same, of course. It is very rare that we ever get a gc of the old
generation, much less a long one.

As a side note, that's one of the reasons why I switched to g1. When we
were on ES 1.1.2 using g1 we would get an old gc over a minute when we had
lots of incremental snapshots and occasionally after a lot of merges. With
ES 1.3.2 I don't have enough experience yet on the cases, but they are rare
to date.

Craig.

On Monday, September 15, 2014 8:37:47 PM UTC-7, Peter van der Weerd wrote:

Is it possible that the JVM is doing large GC's ?

Peter

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1d01653f-c7e1-4b1c-8e53-048d90ab0726%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

If you are sure the spikes are caused by the JVM, I recommend to attach a
profiler to the JVM, then you can monitor the code.

On JVM level, it is hard to trace queries, so maybe you want to test out
bleeding edge? Here is a query profiler:

Jörg

On Tue, Sep 16, 2014 at 2:24 AM, Craig Wittenberg craigwi@microsoft.com
wrote:

Hardware problems in Azure? Hmm. I suppose so :slight_smile:

In this case I restarted just ES (i.e., shutdown ES using
http://localhost:9200/_cluster/nodes/_local/_shutdown and then ran
elasticsearch.bat again) and thus didn't reboot the VM or anything so
dramatic. All of the ES node files remained in place and the node
reconnected with the cluster with no issue. My rdp session remained. I
suspect if this were a hardware issue a) we would have seen this long ago
and b) it would have persisted after I restarted ES.

Related, we actually have two clusters of the same size and same data that
have exhibited the same behavior. The cluster in which I restarted three
ES nodes remains responsive. The other cluster in which I have not (yet)
restarted nodes remains with the highly variable response time (200-40000ms
with 34415ms being a recent one).

Thanks for replying.

Craig.

On Monday, September 15, 2014 2:58:22 PM UTC-7, Jörg Prante wrote:

You said you restarted a node, and since then, the effect has gone, is
that true? Such high spikes like 40sec response might indicate problems in
the hardware (disk failure, network latency or timeout, DNS), is this
possible?

Jörg

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/9d74b9a8-cdcc-474f-9a34-9e2fa8240c74%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/9d74b9a8-cdcc-474f-9a34-9e2fa8240c74%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEjd2um%2BfXC3LMk-6P0ZZqsF83PM1%3Dtz4JLJyeC08YPqQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Just saw that the query profiler can not show what the shard execution
times are, so maybe this is not a big help.

Jörg

On Tue, Sep 16, 2014 at 9:24 AM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

If you are sure the spikes are caused by the JVM, I recommend to attach a
profiler to the JVM, then you can monitor the code.

On JVM level, it is hard to trace queries, so maybe you want to test out
bleeding edge? Here is a query profiler:
https://github.com/elasticsearch/elasticsearch/pull/6699

Jörg

On Tue, Sep 16, 2014 at 2:24 AM, Craig Wittenberg craigwi@microsoft.com
wrote:

Hardware problems in Azure? Hmm. I suppose so :slight_smile:

In this case I restarted just ES (i.e., shutdown ES using
http://localhost:9200/_cluster/nodes/_local/_shutdown and then ran
elasticsearch.bat again) and thus didn't reboot the VM or anything so
dramatic. All of the ES node files remained in place and the node
reconnected with the cluster with no issue. My rdp session remained. I
suspect if this were a hardware issue a) we would have seen this long ago
and b) it would have persisted after I restarted ES.

Related, we actually have two clusters of the same size and same data
that have exhibited the same behavior. The cluster in which I restarted
three ES nodes remains responsive. The other cluster in which I have not
(yet) restarted nodes remains with the highly variable response time
(200-40000ms with 34415ms being a recent one).

Thanks for replying.

Craig.

On Monday, September 15, 2014 2:58:22 PM UTC-7, Jörg Prante wrote:

You said you restarted a node, and since then, the effect has gone, is
that true? Such high spikes like 40sec response might indicate problems in
the hardware (disk failure, network latency or timeout, DNS), is this
possible?

Jörg

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/9d74b9a8-cdcc-474f-9a34-9e2fa8240c74%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/9d74b9a8-cdcc-474f-9a34-9e2fa8240c74%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHmfhQXKpPp5uPszVsp-zoHkE3s_v%3DnZUw0b9w%3DMr_-WA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Some new data. When I said that we upgraded to 1.3.2 recently, the cluster
which exhibited this behavior had recently been loaded with a snapshot from
the old 1.1.2 cluster. More importantly, I loaded that same 1.1.2 snapshot
into a small cluster of 1.3.2 (just two nodes) and have been trying to
reproduce the "restart a node and it worked just fine" situation. That
hasn't worked out. I also tried segment merging and clearing the caches to
no avail.

I then installed jprofiler and captured an instrumented trace of the slow
query. In this small two node cluster, the whole query appears to execute
on one node.

Here is the interesting part: according to the trace, 87% of the time was
spent in 13 calls to ValuesSource$Metadata.load. There are three fields
involved and five shards and so one would expect this to be 15 calls, but
the cache in AggregationContext.numericField appears to be working a little
bit.

However, the issue appears to be that the global ordinals for the three
fields were recreated many times and it appears the core issue here is that
the ordinals are not saved and reused. Could this be true? Are the
ordinals in 1.1.2 different than in 1.3.2? If so, when is the new version
created and saved?

Does this ring a bell for anyone? Who would have some time to look at the
trace? Send me an email directly (craigwi@microsoft.com) if want the trace
file.

Craig.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/81f2569e-bac5-470a-bc1c-da2403ad54ec%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Problem solved. "indices.fielddata.cache.size:2GB" doesn't mean use 2GB
for caching it means use up to 134MB (2GB / 16) for each of the 16 segments
within the cache. Due to the particular combination of queries, shard /
node and the eviction pattern, I can explain all of the perf results I have
documented.

As a short term solution I made the "size" 4.5GB, but that will mean about
300mb per segment. The right solution is to fix GUAVA to allow segments to
be overcommitted and then do an async eviction across segments to ensure
the overall size remains close to the target.

Craig.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7d1281de-55a6-48c1-a2a9-567ab8605373%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

If this is the real explanation can this be better explained in the
documentation?
It seems I went into the same
problem: https://groups.google.com/forum/#!topic/elasticsearch/2wJG5L9A8cs
This is very unintuitive behavior.

On Thursday, September 18, 2014 6:48:09 PM UTC-3, Craig Wittenberg wrote:

Problem solved. "indices.fielddata.cache.size:2GB" doesn't mean use 2GB
for caching it means use up to 134MB (2GB / 16) for each of the 16 segments
within the cache. Due to the particular combination of queries, shard /
node and the eviction pattern, I can explain all of the perf results I have
documented.

As a short term solution I made the "size" 4.5GB, but that will mean about
300mb per segment. The right solution is to fix GUAVA to allow segments to
be overcommitted and then do an async eviction across segments to ensure
the overall size remains close to the target.

Craig.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6658f55f-2eb5-435b-8769-8782640955eb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Yes, it looks like you ran into the same problem. I proposed that the
documentation be fixed in
https://groups.google.com/forum/#!topic/elasticsearch/0sSYSFfmmXM and that
should be done for old versions. I have coded a fix in Guava as described
below and on another thread.

I'm hopeful ES 1.3.3 could include this fix. I'm open to advice on how
best to proceed.

Craig.

On Monday, September 22, 2014 10:25:22 AM UTC-7, Felipe Hummel wrote:

If this is the real explanation can this be better explained in the
documentation?
It seems I went into the same problem:
https://groups.google.com/forum/#!topic/elasticsearch/2wJG5L9A8cs
This is very unintuitive behavior.

On Thursday, September 18, 2014 6:48:09 PM UTC-3, Craig Wittenberg wrote:

Problem solved. "indices.fielddata.cache.size:2GB" doesn't mean use 2GB
for caching it means use up to 134MB (2GB / 16) for each of the 16 segments
within the cache. Due to the particular combination of queries, shard /
node and the eviction pattern, I can explain all of the perf results I have
documented.

As a short term solution I made the "size" 4.5GB, but that will mean
about 300mb per segment. The right solution is to fix GUAVA to allow
segments to be overcommitted and then do an async eviction across segments
to ensure the overall size remains close to the target.

Craig.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/14703918-0450-42c1-810b-edf3967951b6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.