About Doc Values feature and out of memory during term aggregation

Hello Elastic Experts,

I am reading article about out of memory during aggregation, for example, below articles, I do not quite understand why there is out of memory issue if we do filter + term aggregation? My feeling is after filter, the sub data set is small and term aggregation on that data set should not use too much memory. Looks like the out of memory is caused by Elastic Search is loading unnecessary data into memory during term aggregation -- if so what are the unnecessary data and how doc value helps here? Thanks.

thanks in advance,
Lin

Even with a filter you can still try to pull too much data into memory.

Are you having issues?

Hi Mark,

Suppose I index all students in an index, I filter just by a single student ID (student ID are unique). Do you mean in this case, even if single student ID is simple filter, Elastic Search needs to load much more data into memory? If so, in this case, what are the much more data? :smile:

regards,
Lin

No.

I mean that if you have millions or billions of records, even after filtering those down you can still have a lot of data, which can overwhelm the heap.

Thanks Mark,

But how does DocValue helps here? I mean if after filter, we have millions or billions of record? My use case is, after filter, there are not too much records, but I am doing term aggregation on the results, and term vocabulary size may be around 100k. Your advice is highly appreciated.

regards,
Lin

Doc values is fielddata but on disk, so it removes it from heap which gives you more space to do things like this.

Thanks Mark,

If we want to do term aggregation, we need to put term (as a fielddata in an index) on disk as DocValue? Or we need to put other fielddata on disk as DocValue?

regards,
Lin

Take a look at https://www.elastic.co/guide/en/elasticsearch/guide/current/doc-values.html and https://www.elastic.co/guide/en/elasticsearch/guide/current/doc-values.html, it might explain it better.

Hi Mark,

Actually my question comes from the links. :smile:

For term aggregation scenario, what are the differences between making term data field as DocValue or not?

BTW, you share the same documents twice? Or you want to share two different ones?

regards,
Lin

Doc values lives on disk.
Field data exists in heap.

The former is more efficient.

Thanks Mark,

But for term aggregation scenario, you mean make the term field data on disk?

regards,
Lin

Yes that is correct.

Thanks Mark,

In the document you referred, it is mentioned, "While in-memory fielddata has to be built on the fly at search time by uninverting the inverted index", what means uninverting the inverted index? Could you show an example of in the context of term aggregation, what does it mean?

regards,
Lin