Significant terms - avoiding out of memory errors

The significant terms aggregation is a really great feature that allows for
some really interesting data analysis. We quite often experience out of
memory errors, "CircuitBreakingException: Data too large, data would be
larger than limit"
Which is not hard to understand, due to the amount of data and the speed
requirements.

I think it would be interesting if it was possible to "trade off" speed to
allow deeper analysis. To run significant terms, and possibly other
aggregations, allow them to run for as long as needed, just to return some
(presumably correct) results.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/43e654ad-76c0-40a0-b718-0c99ec6de872%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Christoffer,

How much JVM heap are you giving ES and what are the size of the sets?
According to
this Elasticsearch Platform — Find real-time answers at scale | Elastic
it looks like in 1.4 you will be able to control the circuit breaker more
via config. However, depending on your data set size I am guessing you are
still going to have to worry what you can allocate to the ES heap since
that page seems to indicate the circuit breakers are defaulted to
reasonably high %.

I am trying to look into the scalability characteristics of this feature
myself because it is iterating for some goals I have, but I don't see any
information about how it scales or what it is bound by. In my case I would
like to be able to analyse foreground sets of 10s to 100s of thousands of
documents against a bg set of millions. Without finding anything
documented your #s might give me an idea if my use is crazy or reasonable
prior getting some testing done with it.

Kevin

On Friday, September 5, 2014 3:19:13 AM UTC-5, Christoffer Vig wrote:

The significant terms aggregation is a really great feature that allows
for some really interesting data analysis. We quite often experience out of
memory errors, "CircuitBreakingException: Data too large, data would be
larger than limit"
Which is not hard to understand, due to the amount of data and the speed
requirements.

I think it would be interesting if it was possible to "trade off" speed to
allow deeper analysis. To run significant terms, and possibly other
aggregations, allow them to run for as long as needed, just to return some
(presumably correct) results.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/00fb6efa-e869-4672-afd6-673c995f1506%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

A short-term and longer-term option for this:

  1. Short-term - use "Doc Values" in your index mappings to hit disk instead
    of using es FieldData caches that cause the CircuitBreakingException (you
    are then more reliant on OS file-system caches for speed)
  2. Longer-term - we're working on a sample-based option for significant
    terms [1]

[1] significant_terms agg new sampling option. by markharwood · Pull Request #6796 · elastic/elasticsearch · GitHub

On Friday, September 5, 2014 9:19:13 AM UTC+1, Christoffer Vig wrote:

The significant terms aggregation is a really great feature that allows
for some really interesting data analysis. We quite often experience out of
memory errors, "CircuitBreakingException: Data too large, data would be
larger than limit"
Which is not hard to understand, due to the amount of data and the speed
requirements.

I think it would be interesting if it was possible to "trade off" speed to
allow deeper analysis. To run significant terms, and possibly other
aggregations, allow them to run for as long as needed, just to return some
(presumably correct) results.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/88920eb2-a924-4295-bfb4-cb95d4c37173%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.