I have a couple questions regarding running significant terms aggregations on an analyzed text field.
Caveats:
I'm aware of all the warnings on not running significant terms on analyzed fields, especially on larger indices.
I'm aware I can adjust the memory/circuit breaker limits.
I'm also aware I can set field data frequency filters.
Questions:
I have an index in ES 2.x where I was running significant terms on filtered subsets of the index on an analyzed text field with the default memory settings. It ran okay (not super fast but it ran). I upgraded to ES 5.0 and I reindexed the same dataset on the same hardware configuration. For the mapping, I set the fielddata=True and eager_global_ordinals=True on the analyzed text field. While reindexing, I was getting the "Field is too large" error, as well as when I tried to run significant terms after it finished reindexing. Why would I be getting this error in ES5 now on index time and query time when I wasn't getting it in ES 2.x? Have some of the default settings for memory management changed in a way that would affect this? Did setting eager_global_ordinals=True prompt this error?
Is this type of memory error related to the size of the analyzed field per document? Or size overall? That is, suppose I get an error (hypothetically) with 10,000 documents that on average have about 300 analyzed terms in the text field. Am I just as likely to get the error with 100,000 documents that on average have 30 analyzed terms? Assume that I would try to run it on the whole index.
Suppose I successfully indexed the index somehow. Would something like the sampler aggregation paired with the significant terms aggregation reduce the likelihood of a memory error? Would it even not matter if I'm getting the error on index time?
Suppose instead of running the significant terms aggregation on the analyzed text field, I first used the termvectors API to get the analyzed terms. Then I store the analyzed terms for each document in a separate tokens field which is an array of analyzed tokens mapped presumably as a keyword field. Would running a significant terms aggregation on this tokens field instead be more memory friendly? The index would get larger, but would the memory imprint be significantly less on query time?
Is there development down the pipeline to make significant terms more memory friendly in some way?
Just wanted to plug that the significant terms aggregation on an analyzed text field is super useful. I know that the warning is always to not use it that way, but the potential use cases are compelling. For example, if a user wants to know how a set of documents is different from the rest of the index, they can use it to get distinctive words to help understand context or to help them craft better queries.
Significant terms actually started life as a free-text search-refinement feature before elasticsearch. Ported to work inside elasticsearch aggregations, the central significance algo has shown to be useful on structured data but can be costly on free-text.
Currently the memory costs in significant terms are two-types:
a) Fixed cost - the FieldData required by aggs framework for all docs
b) Variable cost - the set of terms produced by docs that match a query
The sampler aggregation only helps reduce cost b) by narrowing the set of docs from all docs to just the top-matching docs. However, for relevance-ranked, sloppy queries such as the typical free-text query, sampling search results is not only much more efficient but produces measurably higher-quality suggestions [1]. It is good practice.
Memory cost a) is arguably one that was designed for supporting the sort of structured analytics that the aggregations framework was designed for where all documents are assumed to be potentially interesting e.g. "show me all website traffic broken down by day and status code..." etc. If we are doing "unstructured analytics" and sampling then it may make more sense to load the small sample of top-ranking texts on-the-fly for each query rather than relying on a heavy data structure of all text for all docs.
This is certainly how significant_terms was implemented in its life before elasticsearch. Further, it had some additional free-text features that helped turn low-level index terms into more useful/readable outputs for humans. This involved:
Near-duplicate/boilerplate text removal from search results using a dedup filter
De-stemming/case detection (what is the most common form of representation of an indexed significant term in the results - e.g. "us" vs "US" , "elect" vs "election" ?)
Phrase detection - significant terms ["clinton", "never" "election", "trump", "us" ] can be summarised more usefully as ["US election", "never Trump"]
Some of this stuff is in an old PR - significant_terms agg new sampling option. by markharwood · Pull Request #6796 · elastic/elasticsearch · GitHub but it was too bulky to bring in as a single change and sampler agg was one simplified output of that work. We probably need to re-think the above free-text analysis steps as processor components in a new pipeline ("free-text result analysis"?), but designed perhaps more like Lucene Analyzer assemblies than the mostly-structured Aggregations framework where sig_terms sits today. These are just notions at this stage.
Thanks! That is very helpful! Just wanted to follow up on the idea I mentioned in 4) about storing an array of tokens in addition to the free text and then running the significant terms on that field instead.
So if the free text is "The quick brown fox jumps." and we store an additional field of ["quick", "brown", "fox", "jump"] and run the aggregation on that field instead, my understanding is that the tokens field would be a keyword field and the fixed cost you mention in a) would be with doc_values instead of fielddata. And my understanding is that doc_values cost less in terms of memory since it's all stored in disk.
Would this approach potentially save on memory compared to aggregating on the free text field given the current framework?
I've not tried it but it looks like it would be a way of working around current mapping restrictions to have disk-backed access to tokens rather than on-heap storage, albeit with some extra work in your client, extra disk space etc.
Would be interested to hear how you get on with that approach!
Yes, at least for the time being this approach does avoid the fielddata is too large memory problem without needing to adjust any memory settings, albeit with more disk space and upfront work on ingest. We did some initial testing and it seems to run 2-4x "faster" (just runtime, not memory) using significant terms on the keyword field rather than on the analyzed text field, although it's still rather slow once your index gets bigger. We created the keyword field by hitting the mtermvector API and then storing only the unique tokens for each document.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.