Difference between aggregating on analyzed text field (using field data), compared to aggregating on high cardinality non-analyzed field

Hi,

We are seeing performance problems with our elasticsearch cluster.
Our setup is running ES 6.6 with 6 nodes total where 2 are master and 4 are data. Each data node has 32GB memory and 16 available to the heap. We have one index spread over these 4 data nodes with a total of 10 primary shards and 10 replicas.
Total doc count is just below 1 billion at a size of 186GB (not counting replicas)
The heap consists of a high amount of field data, and cpu is always working with rebuilding global ordinals which can be seen in hot_threads api

We have a product in which we display a wordcloud, this wordcloud is currently built using a terms aggregation on an analyzed text field, meaning we are using fielddata:true on this field.
This text field contains peoples text answers on different questions given to them, meaning, the field contains sentences varying both in length and content.

Now, every guide, article or forum post I've read, strongly discourages this, due to the large amount of field data hogging the heap.
Therefore my initial plan was then to simply turn this field into a keyword field, and do the splitting of words myself on our side before I index them into elastic, basically doing what elastic does when analyzing a field, thus allowing us to not having to run fielddata:true.

However, the more I read about this, I get the feeling that this might not solve the underlying issues we are seeing with huge heaps and refreshes taking up to a minute (eager global ordinal rebuilds).
I'm getting the feeling the issue is more related to the high cardinality regardless of the field being declared as fielddata:true.

Am I correct in assuming that changing to a keyword field won't solve the actual issue, and if so, what are my options here?

Worth mentioning is that while the field is high cardinality, we never run this aggregation on the whole index but always on a sub set of documents, due to this I've thought about disabling eager global ordinals, thinking this will only build global ordinals for the relevant matching documents?

I'd recommend taking a look at the significant_text aggregation, used in conjunction with the sampler agg.
It improves on terms aggs in several ways:

  • Doesn't rely on field data
  • The most popular terms are often not the most interesting ones ("the" is always popular).
  • Near-duplicate noisy text can be removed

Indexing with 2 word shingles is good for finding phrases too.

Thanks for the reply!

I've taken a look at the significant_text agg, but I'm not sure I can cover our requirements with it as it stands today.

There are a few points I haven't fully understood while reading about field data and global ordinals, so if I may?

  • Is it correct that global ordinals are part of the field data that is reported back from _stats?

  • If I were to change the field with fielddata:true into a keyword field, would this lower the ammount of field data occupying the heap? Or is it reasonable to assume that the global ordinals built from this field, will occoupy a similar ammount of the heap anyway If I were to do the analyzation of the field prior to adding it to the index (adding keyword terms)? Currently field data seems to occupy about 2/3 of our heap.

  • Where does doc_values which are stored on disc come into play here? does it effect global ordinals in any way?

  • If I were to disable eager_global_ordinals, when they are later built during query time, will they be built for all documents or just for the ones matching my filter? My thinking here is that most docs are irrelevant when eagerly building ordinals because only a very small amount of docs will match the filter in the agg.

I realize that common words such as "the" etc, basically stop words, are not relevant, however we have other means of dealing with those before presenting the result to the user, basically users can customize the list of blacklisted words.

Really appreciate the help, I'm at a loss here.

Would be interested to know why as it potentially solves the issue of having to blacklist words manually.

If I were to change the field with fielddata:true into a keyword field, would this lower the ammount of field data occupying the heap?

If you pre-tokenized the data into arrays of values indexed as keyword fields that would shift the storage from on-heap (field data) to disk (doc values). Given your queries are very selective, when you use the terms aggregation set the execution_hint to map to avoid global ordinals.