Highly variable query performance with ES 1.3.2 (filter + aggregations)

craigwi · September 15, 2014, 9:51pm

Hello,

Our team recently upgraded from ES 1.1.2 to 1.3.2 and are happy with the
improvements ... except for one perplexing situation.

We are running on Azure worker roles with Oracle Java 1.8u11 and using the
G1 gc. It is possible this is due to G1, but please consider all of the
data below before you pull out a pat response on g1.

Our cluster has 18 nodes, 3 of which are dedicated masters. We have three
indexes, 5 shards and one replica each. The primary index is about 30gb
total (5.9gb per shard and the shards are the same size). We have five
types in the main index and are about 10 fields each, a mix of strings,
dates, bools, longs. None of the strings are analyzed.

All of the 18 nodes are client nodes and Azure is set up to round robin
requests. We have considered creating dedicated client nodes, but haven't
done so yet.

The query I have been using is a combination of a non-trivial filter, a
terms aggregation and two sum aggregations nested beneath the terms
aggregation:

{ "query": { "filtered": { "filter": { "bool": { … } } } },

"aggs": { "name1": { "terms": { "field": "stringfield1" },

        "aggs": { "sum1": { "sum": { "field": "longfield1" } },

            "sum2 ": { "sum": { "field": "longfield2" } } } } } }

I have run the tests on the cluster when it was lightly loaded (some
indexing plus lightweight metrics queries) and run the tests when there was
no load. I’ll be the first to admit I can be even more systematic, but the
results I have are consistent enough and hard to explain enough that wanted
to write this community.

The primary test uses a filter which always results in an empty set. The
filter contains two must terms, one must range and three mustnot terms. Since
I only care about the aggregation results, this is a search_type=count
query.

If I run the query/filter without the aggregations, the time taken
(results.took from ES) is ~0 (sometimes as high as 15ms). That makes
sense.

The case that doesn’t make sense is that I run the same filter on the same
cluster under the same condition this time WITH the aggregations and I get
anywhere from 200ms to 40000ms. Yes, a factor of 200x. I could believe
200ms to account for some overhead of the aggregations machinery, but
40000ms? And there is no pattern that I can tell as to when 200ms is
returned .vs. 40000ms.

Given that Azure round robins the queries, I can imagine that depending on
which nodes are involved, the query might take more or less of the time. In
fact, I would expect some variations.

The other piece of data is that in trying to debug this I restarted ES on
some of the nodes. By the time I had restarted the third node the
query/filter + all agregations case now returned 200ms consistently.

My question is how it is possible for an empty filter + aggregations to
result in 40000ms time. I tried the same filter and only the terms
aggregation (not the sums); the result was in the 3500-4000ms range – in
case that matters.

Hopefully this makes sense to someone. I’m pulling my hair out and my
colleagues on our internal ES alias are stumped as well.

Thanks for any help,

Craig.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7f16f9e6-52e7-4d5c-854a-a7bd409e2040%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jprante · September 15, 2014, 9:58pm

You said you restarted a node, and since then, the effect has gone, is that
true? Such high spikes like 40sec response might indicate problems in the
hardware (disk failure, network latency or timeout, DNS), is this possible?

Jörg

On Mon, Sep 15, 2014 at 11:51 PM, Craig Wittenberg craigwi@microsoft.com
wrote:

Hello,

Our team recently upgraded from ES 1.1.2 to 1.3.2 and are happy with the
improvements ... except for one perplexing situation.

We are running on Azure worker roles with Oracle Java 1.8u11 and using the
G1 gc. It is possible this is due to G1, but please consider all of the
data below before you pull out a pat response on g1.

Our cluster has 18 nodes, 3 of which are dedicated masters. We have
three indexes, 5 shards and one replica each. The primary index is about
30gb total (5.9gb per shard and the shards are the same size). We have
five types in the main index and are about 10 fields each, a mix of
strings, dates, bools, longs. None of the strings are analyzed.

All of the 18 nodes are client nodes and Azure is set up to round robin
requests. We have considered creating dedicated client nodes, but
haven't done so yet.

The query I have been using is a combination of a non-trivial filter, a
terms aggregation and two sum aggregations nested beneath the terms
aggregation:

{ "query": { "filtered": { "filter": { "bool": { … } } } },
"aggs": { "name1": { "terms": { "field": "stringfield1" },

        "aggs": { "sum1": { "sum": { "field": "longfield1" } },

            "sum2 ": { "sum": { "field": "longfield2" } } } } } }
I have run the tests on the cluster when it was lightly loaded (some
indexing plus lightweight metrics queries) and run the tests when there was
no load. I’ll be the first to admit I can be even more systematic, but
the results I have are consistent enough and hard to explain enough that
wanted to write this community.

The primary test uses a filter which always results in an empty set. The
filter contains two must terms, one must range and three mustnot terms. Since
I only care about the aggregation results, this is a search_type=count
query.

If I run the query/filter without the aggregations, the time taken
(results.took from ES) is ~0 (sometimes as high as 15ms). That makes
sense.

The case that doesn’t make sense is that I run the same filter on the same
cluster under the same condition this time WITH the aggregations and I get
anywhere from 200ms to 40000ms. Yes, a factor of 200x. I could believe
200ms to account for some overhead of the aggregations machinery, but
40000ms? And there is no pattern that I can tell as to when 200ms is
returned .vs. 40000ms.

Given that Azure round robins the queries, I can imagine that depending on
which nodes are involved, the query might take more or less of the time. In
fact, I would expect some variations.

The other piece of data is that in trying to debug this I restarted ES on
some of the nodes. By the time I had restarted the third node the
query/filter + all agregations case now returned 200ms consistently.

My question is how it is possible for an empty filter + aggregations to
result in 40000ms time. I tried the same filter and only the terms
aggregation (not the sums); the result was in the 3500-4000ms range – in
case that matters.

Hopefully this makes sense to someone. I’m pulling my hair out and my
colleagues on our internal ES alias are stumped as well.

Thanks for any help,

Craig.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/7f16f9e6-52e7-4d5c-854a-a7bd409e2040%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/7f16f9e6-52e7-4d5c-854a-a7bd409e2040%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoG-TuLrUOMqpt24vu2FE-%3D320%3DOSfTbsaYyRCiswaN9cg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

craigwi · September 16, 2014, 12:24am

Hardware problems in Azure? Hmm. I suppose so

In this case I restarted just ES (i.e., shutdown ES using
http://localhost:9200/_cluster/nodes/_local/_shutdown and then ran
elasticsearch.bat again) and thus didn't reboot the VM or anything so
dramatic. All of the ES node files remained in place and the node
reconnected with the cluster with no issue. My rdp session remained. I
suspect if this were a hardware issue a) we would have seen this long ago
and b) it would have persisted after I restarted ES.

Related, we actually have two clusters of the same size and same data that
have exhibited the same behavior. The cluster in which I restarted three
ES nodes remains responsive. The other cluster in which I have not (yet)
restarted nodes remains with the highly variable response time (200-40000ms
with 34415ms being a recent one).

Thanks for replying.

Craig.

On Monday, September 15, 2014 2:58:22 PM UTC-7, Jörg Prante wrote:

You said you restarted a node, and since then, the effect has gone, is
that true? Such high spikes like 40sec response might indicate problems in
the hardware (disk failure, network latency or timeout, DNS), is this
possible?

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/9d74b9a8-cdcc-474f-9a34-9e2fa8240c74%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Peter_van_der_Weerd · September 16, 2014, 3:37am

Is it possible that the JVM is doing large GC's ?

Peter

Op dinsdag 16 september 2014 02:24:42 UTC+2 schreef Craig Wittenberg:

Hardware problems in Azure? Hmm. I suppose so

In this case I restarted just ES (i.e., shutdown ES using
http://localhost:9200/_cluster/nodes/_local/_shutdown and then ran
elasticsearch.bat again) and thus didn't reboot the VM or anything so
dramatic. All of the ES node files remained in place and the node
reconnected with the cluster with no issue. My rdp session remained. I
suspect if this were a hardware issue a) we would have seen this long ago
and b) it would have persisted after I restarted ES.

Related, we actually have two clusters of the same size and same data that
have exhibited the same behavior. The cluster in which I restarted three
ES nodes remains responsive. The other cluster in which I have not (yet)
restarted nodes remains with the highly variable response time (200-40000ms
with 34415ms being a recent one).

Thanks for replying.

Craig.

On Monday, September 15, 2014 2:58:22 PM UTC-7, Jörg Prante wrote:

You said you restarted a node, and since then, the effect has gone, is
that true? Such high spikes like 40sec response might indicate problems in
the hardware (disk failure, network latency or timeout, DNS), is this
possible?

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/87e682c1-717e-4d29-b9bf-82078616744f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

craigwi · September 16, 2014, 3:45am

Hi Peter,

No, old or long gcs are not showing up here. I use bigdesk among other
tools and also inspect the logs. Bigdesk shows a very regular sawtooth
pattern and nearly always young gcs of less than 1s. The logs show the
same, of course. It is very rare that we ever get a gc of the old
generation, much less a long one.

As a side note, that's one of the reasons why I switched to g1. When we
were on ES 1.1.2 using g1 we would get an old gc over a minute when we had
lots of incremental snapshots and occasionally after a lot of merges. With
ES 1.3.2 I don't have enough experience yet on the cases, but they are rare
to date.

Craig.

On Monday, September 15, 2014 8:37:47 PM UTC-7, Peter van der Weerd wrote:

Is it possible that the JVM is doing large GC's ?

Peter

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1d01653f-c7e1-4b1c-8e53-048d90ab0726%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jprante · September 16, 2014, 7:24am

If you are sure the spikes are caused by the JVM, I recommend to attach a
profiler to the JVM, then you can monitor the code.

On JVM level, it is hard to trace queries, so maybe you want to test out
bleeding edge? Here is a query profiler:

github.com/elastic/elasticsearch

Feature: Add ability to profile queries

elastic:1.x ← polyfractal:features/profile_query

opened 01:02AM - 03 Jul 14 UTC

$polyfractal$ polyfractal

+2683 -7

This PR adds the ability to profile queries and determine which portions of the …query are the slowest. All timings are relative and displayed in microseconds. Timings are not guaranteed to match the `took` time of the query...in fact, they will likely be larger. This is because timings are aggregated from all shards, often executing in parallel, so the profiled time is more akin to CPU time than wallclock time. It is also important to understand that Profile times represetn relative costs, rather than absolute times. The act of profiling adds considerable overhead to the querying process since all components in the query must be wrapped and timed. ### Enabling Profiling Profling can be enabled with a query param, URI param or dedicated endpoint: ``` bash # Query param curl -XGET localhost:9200/my_index/my_type/_search -d '{ "profile": true, "query" : { ... } }' # URI param curl -XGET localhost:9200/my_index/my_type/_search?profile=true # Endpoint curl -XGET localhost:9200/my_index/my_type/_profile -d '{ "query" : { ... } }' ``` ### Response output When enabled, the response will contain an additional `profile` element. This will contain a tree of "profile components", where each component represents a portion of the query. Often, this tree of components will be larger than you expect - Elasticsearch and Lucene will expand and rewrite portions of the query, which introduces more pieces than you might expect. Each component contains the following: - **type** - The name of the Elasticsearch/Lucene class for this part of the query, e.g. `TermQuery` - **time** - The total aggregated time at this level in the tree, which includes all children and all shards. Displayed as microseconds - **relative** - Relative contribution of this level in the tree. This metric is `time / total_time`. Displayed as a percentage - **lucene** - A rough approximation of the Lucene syntax for the component, which is often helpful in differentiating many similar components (e.g. differentiating `term: { "field1": "fred" }` from `term: { "field1": "bob" }`, since both are `TermQuery`s) - **components** - Zero or more additional components which are children to this level. For example, a Bool will have children components representing the various clauses. ``` json { "took": 99, "timed_out": false, "_shards": { ... }, "hits": { ... }, "profile": [ { "type": "XFilteredQuery", "time": 153996, "relative": "100.00%", "lucene": "filtered(filtered(my_field:the my_field:quick my_field:fox)->cache(input1:[1 TO 1]))->cache(_type:test)", "components": [ { "type": "XFilteredQuery", "time": 108036, "relative": "70.155%", "lucene": "filtered(my_field:the my_field:quick my_field:fox)->cache(input1:[1 TO 1])", "components": [ { "type": "BooleanQuery", "time": 62053, "relative": "40.295%", "lucene": "my_field:the my_field:quick my_field:fox", "components": [ { "type": "TermQuery", "time": 20366, "relative": "13.225%", "lucene": "my_field:the" }, { "type": "TermQuery", "time": 1117, "relative": "0.72534%", "lucene": "my_field:quick" }, { "type": "TermQuery", "time": 391, "relative": "0.25390%", "lucene": "my_field:fox" } ] }, { "type": "FilterCacheFilterWrapper", "time": 3983, "relative": "2.5864%", "lucene": "cache(input1:[1 TO 1])" } ] }, { "type": "FilterCacheFilterWrapper", "time": 3835, "relative": "2.4903%", "lucene": "cache(_type:test)" } ] } ] } ``` ### Todo / Known Issues - More tests! The current tests check for basic validity and overt failure, but I think we need a better way to check the contents of the profile response. + Unit tests for things like the ProfileVisitors - Shards can rewrite queries differently (e.g. if the term is not present on one shard), which means chunks of the output could potentially be skipped. Naively handling this by just ignoring components which don't match. - Potentially gross output, especially with numerics being expressed as byte strings (e.g. `\0001`) - Fairly abusive use of Visitor pattern. In particular, I'm not overly happy with the ProfileCollapsingVisitor which is used to collect timings from the query tree...but I have no idea how to better structure it. The biggest problem is that all query types need to be hardcoded, so this will be a maintenance tar pit. - Audit of the supported Query/Filter types...I imagine there are others that I'm missing which need to be added to both visitors - `/_profile` endpoint could use some love, ideally extending the search endpoint instead of c/p

Jörg

On Tue, Sep 16, 2014 at 2:24 AM, Craig Wittenberg craigwi@microsoft.com
wrote:

Hardware problems in Azure? Hmm. I suppose so

In this case I restarted just ES (i.e., shutdown ES using
http://localhost:9200/_cluster/nodes/_local/_shutdown and then ran
elasticsearch.bat again) and thus didn't reboot the VM or anything so
dramatic. All of the ES node files remained in place and the node
reconnected with the cluster with no issue. My rdp session remained. I
suspect if this were a hardware issue a) we would have seen this long ago
and b) it would have persisted after I restarted ES.

Related, we actually have two clusters of the same size and same data that
have exhibited the same behavior. The cluster in which I restarted three
ES nodes remains responsive. The other cluster in which I have not (yet)
restarted nodes remains with the highly variable response time (200-40000ms
with 34415ms being a recent one).

Thanks for replying.

Craig.

On Monday, September 15, 2014 2:58:22 PM UTC-7, Jörg Prante wrote:

You said you restarted a node, and since then, the effect has gone, is
that true? Such high spikes like 40sec response might indicate problems in
the hardware (disk failure, network latency or timeout, DNS), is this
possible?

Jörg

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/9d74b9a8-cdcc-474f-9a34-9e2fa8240c74%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/9d74b9a8-cdcc-474f-9a34-9e2fa8240c74%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEjd2um%2BfXC3LMk-6P0ZZqsF83PM1%3Dtz4JLJyeC08YPqQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

jprante · September 16, 2014, 7:27am

Just saw that the query profiler can not show what the shard execution
times are, so maybe this is not a big help.

Jörg

On Tue, Sep 16, 2014 at 9:24 AM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

If you are sure the spikes are caused by the JVM, I recommend to attach a
profiler to the JVM, then you can monitor the code.

On JVM level, it is hard to trace queries, so maybe you want to test out
bleeding edge? Here is a query profiler:
Feature: Add ability to profile queries by polyfractal · Pull Request #6699 · elastic/elasticsearch · GitHub

Jörg

On Tue, Sep 16, 2014 at 2:24 AM, Craig Wittenberg craigwi@microsoft.com
wrote:

Hardware problems in Azure? Hmm. I suppose so

In this case I restarted just ES (i.e., shutdown ES using
http://localhost:9200/_cluster/nodes/_local/_shutdown and then ran
elasticsearch.bat again) and thus didn't reboot the VM or anything so
dramatic. All of the ES node files remained in place and the node
reconnected with the cluster with no issue. My rdp session remained. I
suspect if this were a hardware issue a) we would have seen this long ago
and b) it would have persisted after I restarted ES.

Related, we actually have two clusters of the same size and same data
that have exhibited the same behavior. The cluster in which I restarted
three ES nodes remains responsive. The other cluster in which I have not
(yet) restarted nodes remains with the highly variable response time
(200-40000ms with 34415ms being a recent one).

Thanks for replying.

Craig.

On Monday, September 15, 2014 2:58:22 PM UTC-7, Jörg Prante wrote:

You said you restarted a node, and since then, the effect has gone, is
that true? Such high spikes like 40sec response might indicate problems in
the hardware (disk failure, network latency or timeout, DNS), is this
possible?

Jörg

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/9d74b9a8-cdcc-474f-9a34-9e2fa8240c74%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/9d74b9a8-cdcc-474f-9a34-9e2fa8240c74%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHmfhQXKpPp5uPszVsp-zoHkE3s_v%3DnZUw0b9w%3DMr_-WA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

craigwi · September 17, 2014, 4:36pm

Some new data. When I said that we upgraded to 1.3.2 recently, the cluster
which exhibited this behavior had recently been loaded with a snapshot from
the old 1.1.2 cluster. More importantly, I loaded that same 1.1.2 snapshot
into a small cluster of 1.3.2 (just two nodes) and have been trying to
reproduce the "restart a node and it worked just fine" situation. That
hasn't worked out. I also tried segment merging and clearing the caches to
no avail.

I then installed jprofiler and captured an instrumented trace of the slow
query. In this small two node cluster, the whole query appears to execute
on one node.

Here is the interesting part: according to the trace, 87% of the time was
spent in 13 calls to ValuesSource$Metadata.load. There are three fields
involved and five shards and so one would expect this to be 15 calls, but
the cache in AggregationContext.numericField appears to be working a little
bit.

However, the issue appears to be that the global ordinals for the three
fields were recreated many times and it appears the core issue here is that
the ordinals are not saved and reused. Could this be true? Are the
ordinals in 1.1.2 different than in 1.3.2? If so, when is the new version
created and saved?

Does this ring a bell for anyone? Who would have some time to look at the
trace? Send me an email directly (craigwi@microsoft.com) if want the trace
file.

Craig.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/81f2569e-bac5-470a-bc1c-da2403ad54ec%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

craigwi · September 18, 2014, 9:48pm

Problem solved. "indices.fielddata.cache.size:2GB" doesn't mean use 2GB
for caching it means use up to 134MB (2GB / 16) for each of the 16 segments
within the cache. Due to the particular combination of queries, shard /
node and the eviction pattern, I can explain all of the perf results I have
documented.

As a short term solution I made the "size" 4.5GB, but that will mean about
300mb per segment. The right solution is to fix GUAVA to allow segments to
be overcommitted and then do an async eviction across segments to ensure
the overall size remains close to the target.

Craig.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7d1281de-55a6-48c1-a2a9-567ab8605373%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Felipe_Hummel · September 22, 2014, 5:25pm

If this is the real explanation can this be better explained in the
documentation?
It seems I went into the same
problem: Redirecting to Google Groups
This is very unintuitive behavior.

On Thursday, September 18, 2014 6:48:09 PM UTC-3, Craig Wittenberg wrote:

Problem solved. "indices.fielddata.cache.size:2GB" doesn't mean use 2GB
for caching it means use up to 134MB (2GB / 16) for each of the 16 segments
within the cache. Due to the particular combination of queries, shard /
node and the eviction pattern, I can explain all of the perf results I have
documented.

As a short term solution I made the "size" 4.5GB, but that will mean about
300mb per segment. The right solution is to fix GUAVA to allow segments to
be overcommitted and then do an async eviction across segments to ensure
the overall size remains close to the target.

Craig.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6658f55f-2eb5-435b-8769-8782640955eb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

craigwi · September 22, 2014, 5:44pm

Yes, it looks like you ran into the same problem. I proposed that the
documentation be fixed in
Redirecting to Google Groups and that
should be done for old versions. I have coded a fix in Guava as described
below and on another thread.

I'm hopeful ES 1.3.3 could include this fix. I'm open to advice on how
best to proceed.

Craig.

On Monday, September 22, 2014 10:25:22 AM UTC-7, Felipe Hummel wrote:

If this is the real explanation can this be better explained in the
documentation?
It seems I went into the same problem:
Redirecting to Google Groups
This is very unintuitive behavior.

On Thursday, September 18, 2014 6:48:09 PM UTC-3, Craig Wittenberg wrote:

Problem solved. "indices.fielddata.cache.size:2GB" doesn't mean use 2GB
for caching it means use up to 134MB (2GB / 16) for each of the 16 segments
within the cache. Due to the particular combination of queries, shard /
node and the eviction pattern, I can explain all of the perf results I have
documented.

As a short term solution I made the "size" 4.5GB, but that will mean
about 300mb per segment. The right solution is to fix GUAVA to allow
segments to be overcommitted and then do an async eviction across segments
to ensure the overall size remains close to the target.

Craig.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/14703918-0450-42c1-810b-edf3967951b6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Query Performance Elasticsearch	10	1917	July 6, 2017
Suspect GC sync'ed between nodes cause simultaneous performance hit Elasticsearch	14	543	July 6, 2017
Elasticsearch always crashes with query aggregations Elasticsearch	21	8149	July 5, 2017
ES hangs after some time Elasticsearch	4	567	July 6, 2017
Performance issues Elasticsearch	9	464	July 6, 2017

Highly variable query performance with ES 1.3.2 (filter + aggregations)

Related topics