I could not find this information while reading either Elasticsearch Definitive Guide or Elasticsearch Reference, which led me to trying out the Community.
I would like to learn how does Elasticsearch perform aggregations - particularly calculating top N most frequently occurring terms - with the knowledge that its segments are Lucene indexes. Does it make use of Lucene Facets API when creating Lucene documents? What Lucene queries are performed by Elasticsearch when calculating aggregations?
I have a local Elasticsearch node running with debugger attached using IntelliJ, so any hints on where to look in the source code would be beneficial. All kind of explanation is also welcome.
Lucene does have join queries since version 4.x [1] and Elasticsearch
abstracts the details of using those queries with the various joining
queries such as Nested and Parent/Child.
Aggregations are done using custom code within Elasticsearch and does not
use the Facets API. I believe that Solr, like Elasticsearch, also does not
use the Facets API, but my knowledge of Solr is several versions old. I
have not explored the aggregations code in detail, but I am assuming the
Elasticsearch leverages the Lucene doc_values/fielddata APIs and does not
use queries to calculate the aggregations.
I have not heard yet of the Lucene join queries, which is an interesting topic for me to investigate on its own.
From what I have recently found there are no taxonomy directories created for Lucene indexes, which sounds like a proof that Facets API is not used by Elastic.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.