Circuit breaker while using sampler?

Hey there,

I've got an Elasticsearch cluster with three nodes (each 28 GB of RAM with ~14 GB for the heap), running version 2.3.4.

I'm trying to run a significant_terms aggregation against a full-text field with a significant amount of data (more than Tweets, less than Wikipedia).

Unsurprisingly, this consistently fails when run bare, giving a circuit-breaker exception.

To work on resolving this, I tried wrapping my terms aggregation in a sampler, but it's been giving me the very same error.

Copied and pasted from the docs (although I removed the field parameter), I ran the following query:

{
    "query": {
        "match": {
            "text": "iphone"
        }
    },
    "aggs": {
        "sample": {
            "sampler": {
                "shard_size": 200
            },
            "aggs": {
                "keywords": {
                    "significant_terms": {
                        "field": "text"
                    }
                }
            }
        }
    }
}

This fails consistently with exactly the same error as it did without the sampler. I tried changing around shard_size (down to 1 or 2, even), but it's always complaining.

[fielddata] Data too large, data for [text] would be larger than limit of [8718817689/8.1gb]

Am I misunderstanding how the sampler aggregation works, or is there something additional I need to do?

Thanks,
Matthew

Sampling is about focusing on only the highest quality matched docs and when used with significant terms is about reducing the number of background frequency checks done on terms to reduce time.
The memory issue you are facing is likely the cost of loading all text into FieldData in advance of running significant terms or sampler. This is just how aggregations in general work. I think significant_terms + sampler could usefully have an option to load the text of only the top matching docs on the fly for each request rather than relying on FieldData loading all values up front for all queries

Ah, interesting. Thanks for the response! Okay, that makes sense. I guess I just never noticed it outside of those aggregations, because few of the others make sense to run against full-text. Fair enough.

Is there a good work-around for this sort of thing? Is there another sort of aggregation that would help avoid this, of does FieldData have to be computed at the start, regardless of the aggs hierarchy? I'm just thinking out loud--and I haven't tried this--but I'm wondering, for example, whether a filter aggregation with even something as rudimentary as an ids query would let me run it.

My business-case here is just to find the most commonly used words in this field, for documents that match a given query. I'd like to throw significant_terms at it, since I think that would actually be more useful than just a terms, but either way, I need a result that tells me "'bicycle' shows up in 20% of these documents, 'air' shows up in 15%, 'dirt' shows up in 2%, 'the' is in 99%," etc.. Accuracy is important as always, but it's not overwhelmingly critical; just being able to communicate "a lot of these documents talk about "bicycle" would be a good start.

Can you think of another way to get that data out?

I could just scroll through all the query's results and manually tokenize them application-side, but clearly that's not an awesome solution, since Elasticsearch already has that processing done. That brings me to the other question I posted, which would help me do this, but again, not an awesome solution, since I'm performing an aggregation application-side--lots of wasted bandwidth and time there.

I did have a PR for this some time ago [1] some of which got reworked into sampler agg but the idea of loading individual doc content rather than accessing FieldData (which is for all docs) was not picked up.
We may yet return to this issue.

[1] significant_terms agg new sampling option. by markharwood · Pull Request #6796 · elastic/elasticsearch · GitHub