I have a couple questions regarding running significant terms aggregations on an analyzed text field.
- I'm aware of all the warnings on not running significant terms on analyzed fields, especially on larger indices.
- I'm aware I can adjust the memory/circuit breaker limits.
- I'm also aware I can set field data frequency filters.
I have an index in ES 2.x where I was running significant terms on filtered subsets of the index on an analyzed text field with the default memory settings. It ran okay (not super fast but it ran). I upgraded to ES 5.0 and I reindexed the same dataset on the same hardware configuration. For the mapping, I set the fielddata=True and eager_global_ordinals=True on the analyzed text field. While reindexing, I was getting the "Field is too large" error, as well as when I tried to run significant terms after it finished reindexing. Why would I be getting this error in ES5 now on index time and query time when I wasn't getting it in ES 2.x? Have some of the default settings for memory management changed in a way that would affect this? Did setting eager_global_ordinals=True prompt this error?
Is this type of memory error related to the size of the analyzed field per document? Or size overall? That is, suppose I get an error (hypothetically) with 10,000 documents that on average have about 300 analyzed terms in the text field. Am I just as likely to get the error with 100,000 documents that on average have 30 analyzed terms? Assume that I would try to run it on the whole index.
Suppose I successfully indexed the index somehow. Would something like the sampler aggregation paired with the significant terms aggregation reduce the likelihood of a memory error? Would it even not matter if I'm getting the error on index time?
Suppose instead of running the significant terms aggregation on the analyzed text field, I first used the termvectors API to get the analyzed terms. Then I store the analyzed terms for each document in a separate tokens field which is an array of analyzed tokens mapped presumably as a keyword field. Would running a significant terms aggregation on this tokens field instead be more memory friendly? The index would get larger, but would the memory imprint be significantly less on query time?
Is there development down the pipeline to make significant terms more memory friendly in some way?
Just wanted to plug that the significant terms aggregation on an analyzed text field is super useful. I know that the warning is always to not use it that way, but the potential use cases are compelling. For example, if a user wants to know how a set of documents is different from the rest of the index, they can use it to get distinctive words to help understand context or to help them craft better queries.