Why inverted index is not good at Aggregation?

https://www.elastic.co/guide/en/elasticsearch/reference/current/fielddata.html#fielddata
this section of docs say sort and aggregation is another pattern to search:

search with inverted index:
Which documents contain this term it is easy to understand to search term -> documents contains this term

sort and aggregate:
What is the value of this field for this document? so, the inverted index is not good at this.

My question is why sort and aggregate need to answer this question?

Thinking about the case below:

Inverted index:

Terms | doc_1 | doc_2 | doc_3
abc | x | |
bcd | | x |
cde | | | x

sort :Because Terms have been sorted, just scan the inverted index with rows and list the doc:
sort Terms result: doc_1, doc_2, doc_3

Inverted index

Terms | doc_1 | doc_2 | doc_3 |
abc | x | | x |
bcd | | x | |
cde | x | | x |

aggregate: scan the inverted index:
aggregate count Terms:
abc bucket: doc_1, doc_3 -> 2
bcd bucket: doc_2 -> 1
cde bucket: doc_1, doc_3 -> 2

In these cases , why sort and aggregate need to know the document id -> terms mapping ?

You typically search on one field and sort/aggregate on a different one.

Eg search text field for “iPad” but sort or aggregate on price field.

@Mark_Harwood

Thanks for your reply.

How about search this field and sort/aggregate this field?
if not search any field, just sort this text field or aggregate this text field , is it to say that sort/aggregate do not need to answer the doc id -> terms question, just scan the text field inverted index to implement sort and aggregate?

The inverted index is designed for searching rather than sorting.Even not search any field, Es will fetch all documents id from inverted index and sort documents by id->terms mapping(column store).

Technically, yes, the information you need is potentially already in the inverted index - however it does not give you the O(1) lookup of the doc values structure

@wangqinghuan

In this special case, I think the "fetch all documents id from inverted index and sort documents by id->terms mapping" is not necessary?

@Mark_Harwood

Overall, in this special case, the sort and aggregation is O(n)

sort process:
scan per terms get doc id, because terms is sorted, so sort the docs by terms can easily loop the terms and print the associated doc_id

aggregation:
same as sort, loop the terms and count of associated doc_id, we can get count of docs for aggregation of this filed term

I think , in this case, it is better than inverted index -> doc id -> access doc values to get terms -> bucket agg etc.

PRs with benchmarks always welcome :slight_smile:

1 Like

Each index contains multiple segments,where a segment is an inverted index. In your special case, you can sort documents by loop sorted term within one inverted index, though, how do you sort documents across mutiple segments?

@wangqinghuan

My context is in single one inverted index. about across multiple segments sort or aggregate, what is the elasticsearch's solution to this?
In my special case, just merge the two sorted results into fixed one result

There are other search engines which aggregate terms through inverted index see slide 42 of https://www.slideshare.net/lucenerevolution/seeley-solr-facetseurocon2011 Although it gains only when number of terms and segments is small.
Still not getting for to sort with inv index.

1 Like

@Mikhail_Khludnev

Fantastic! Thanks your sharing resource.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.