I have billions of documents and am trying to do a Terms Aggregation on a string field. The field is not analyzed. However, when I try to perform the aggregation, it uses a significant amount of heap on my "aggregator" node, far more than should be required. I was able to perform similar aggregations on numeric fields with no issues.
Since it's not stored, it's my understanding that every time it looks at the field in question, it has to load the entirety of the full source back into memory so that it can read through the values for that given field. Is that true here as well? Or will it use the global ordinal once, and only look-up the value once-per-bucket? Likewise, if I mark the field as "store:true", should I see an increase in performance (and a decrease in heap usage)?
Unfortunately, this behavior is expected for string fields + field data. String fields use a lot of memory in pre-2.0 due to field data: strings themselves are largish (relative to numerics), and are often-multivalued, which requires a different data-structure which is a little less compressed in memory.
Post 2.0 we've switched to doc_values by default. This is an on-disk, column store that requires drastically less memory. Doc_values run almost entirely "off-heap" (relying on OS cache), so you can drastically reduce your heap too.
You can use doc_values in <2.0 as well, but you have to enable it yourself.
Finally, not_analyzed strings tend to behave better than analyzed strings in field data, since they usually have lower cardinality as opposed to things like shingles, ngrams, etc. But sometimes, like with UUIDs or user names or product codes or IPv6, not_analyzed strings can have huge cardinality. Higher cardinality increases memory usage in field data.
When you set a field to not_analyzed, it is still indexed into the inverted index...it just doesn't go through the analysis phase. So when you execute an aggregation, the data is loaded from the inverted index, not the original source, so you don't have to worry about re-parsing the original JSON, touching disk n times for each doc, etc.
You're correct about the global ordinals though, once the terms are populated into memory from the inverted index they will generate ordinals for the segments and the global ordinal map, and use that. As you update/add documents though, parts or all of the ordinal map will need to be refreshed.
Nope! But don't worry, it's a very common misconception
Stored fields simply take the original _source value for that field and stick it in a new field. It doesn't affect the inverted index or aggregations at all, it simply saves the pre-analysis version of the field separately for retrieval.
It's main use-case is when you have a 5mb PDF and want to display the title field to the user: you'd prefer not to parse the 5mb JSON just to get the title. So saving it separately can save some disk IO and parsing time.
I actually already switched to doc_values some time ago, because the buckets were so large they were overloading the heap on my data nodes and causing the entire server to come to a grinding halt! But based on your explanation of how "store" works, it doesn't sound like there would be much of an impact from my original question.
I've also specifically made all my string fields not_analyzed (by design--I genuinely don't want them analyzed). Fortunately, there aren't a lot of new ordinals being added, especially not in bulk.
In this case, I'm using a string because performing an aggregation on multiple numeric values is expensive, because it requires new ordinals to be created on the fly (i.e., via a script), or to an intermediate string (copy_to). In this case, I have "long:long", when I could have multiple keys instead. Is there any way to take advantage of using multiple doubles as a "single" ordinal for buckets aggregation?
The advantage here is that you can leverage breadth_first to prune the search tree aggressively, and avoid building tuples on the fly. It'll require a bit more processing in your app (since the data is now split over two aggs, instead of a single agg), but it could be faster / less memory.
Alternatively, copy_to might be a legit strategy if this is a common operation. Adding that data straight into the index will help performance tremendously (for relatively little impact on storage requirements)
Or if you don't need to know the actual tuple itself (and can tolearate a potential collision), you could hash the value and store that, which will be faster because it can be stored as a long instead of a string