Why do we need field-data?

Inverted index is the mapping of terms to document whereas field-data/doc values is the mapping of documents to terms. In inverted index for a field, unique values of that field in the index is the key whereas in field-data document IDs is the key. Now while aggregations we create buckets based on the field values which we don't know beforehand.

My question is isn't there a function like getKeys() to get all the keys of the inverted index which will be the unique values of a field in that index and then using each value of this set to access an entry in the inverted index and traverse through the list of documents which are mapped to this key in the inverted index and update the buckets ?

I know this is not how it happens today and I'm guessing that there is a good reason for not doing so, I'm interested in knowing that reason.


That's pretty much it. To use a book analogy - if you want to know what's on pages 1, 5 and 7 of a book you turn to pages 1, 5 and 7 directly and read them. You don't go to the index at the back of the book, scan the alphabetic list of all words and for each word scan their list of page-mentions to see if they include 1, 5 or 7.

Mark, this makes sense in case of multi-level aggregation where on the first level documents will be divided into buckets and in the second level buckets if we have an index based on document IDs it's much cheaper but in case single level aggregations how will field-data make a difference ? Let's say we had 3 docs with only one field X which can have two unique values A & B, so inverted index would look like,

A-> 1,3
B-> 2

Field-data will look like,

Now if I do an aggregation, I have to traverse through all 3 documents to retrieve from field-data and similarly will require 3 operations to get the same result from inverted index.

The most common use case is user searches for smartphone in field text and then sorts matching docs by doc values in the field popularity or groups up ranges of doc values in the field price or manufacturer.

We use the inverted index to quickly find the list of doc IDs that contain "smartphone" but then use doc values (previously using fielddata) to quickly retrieve those price and popularity values for the thousands of documents that match.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.