I am reading article about out of memory during aggregation, for example, below articles, I do not quite understand why there is out of memory issue if we do filter + term aggregation? My feeling is after filter, the sub data set is small and term aggregation on that data set should not use too much memory. Looks like the out of memory is caused by Elastic Search is loading unnecessary data into memory during term aggregation -- if so what are the unnecessary data and how doc value helps here? Thanks.
Suppose I index all students in an index, I filter just by a single student ID (student ID are unique). Do you mean in this case, even if single student ID is simple filter, Elastic Search needs to load much more data into memory? If so, in this case, what are the much more data?
I mean that if you have millions or billions of records, even after filtering those down you can still have a lot of data, which can overwhelm the heap.
But how does DocValue helps here? I mean if after filter, we have millions or billions of record? My use case is, after filter, there are not too much records, but I am doing term aggregation on the results, and term vocabulary size may be around 100k. Your advice is highly appreciated.
If we want to do term aggregation, we need to put term (as a fielddata in an index) on disk as DocValue? Or we need to put other fielddata on disk as DocValue?
In the document you referred, it is mentioned, "While in-memory fielddata has to be built on the fly at search time by uninverting the inverted index", what means uninverting the inverted index? Could you show an example of in the context of term aggregation, what does it mean?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.