Why inverted index is not good at Aggregation?

hackerwin7 · August 22, 2019, 10:40am

https://www.elastic.co/guide/en/elasticsearch/reference/current/fielddata.html#fielddata
this section of docs say sort and aggregation is another pattern to search:

search with inverted index:
Which documents contain this term it is easy to understand to search term -> documents contains this term

sort and aggregate:
What is the value of this field for this document? so, the inverted index is not good at this.

My question is why sort and aggregate need to answer this question?

Thinking about the case below:

Inverted index:

Terms | doc_1 | doc_2 | doc_3
abc | x | |
bcd | | x |
cde | | | x

sort :Because Terms have been sorted, just scan the inverted index with rows and list the doc:
sort Terms result: doc_1, doc_2, doc_3

Inverted index

Terms | doc_1 | doc_2 | doc_3 |
abc | x | | x |
bcd | | x | |
cde | x | | x |

aggregate: scan the inverted index:
aggregate count Terms:
abc bucket: doc_1, doc_3 -> 2
bcd bucket: doc_2 -> 1
cde bucket: doc_1, doc_3 -> 2

In these cases , why sort and aggregate need to know the document id -> terms mapping ?

Mark_Harwood · August 22, 2019, 12:15pm

You typically search on one field and sort/aggregate on a different one.

Eg search text field for “iPad” but sort or aggregate on price field.

hackerwin7 · August 23, 2019, 7:46am

@Mark_Harwood

Thanks for your reply.

How about search this field and sort/aggregate this field?
if not search any field, just sort this text field or aggregate this text field , is it to say that sort/aggregate do not need to answer the doc id -> terms question, just scan the text field inverted index to implement sort and aggregate?

wangqinghuan · August 23, 2019, 8:14am

The inverted index is designed for searching rather than sorting.Even not search any field, Es will fetch all documents id from inverted index and sort documents by id->terms mapping(column store).

Mark_Harwood · August 23, 2019, 8:20am

Technically, yes, the information you need is potentially already in the inverted index - however it does not give you the O(1) lookup of the doc values structure

hackerwin7 · August 23, 2019, 8:21am

@wangqinghuan

In this special case, I think the "fetch all documents id from inverted index and sort documents by id->terms mapping" is not necessary?

hackerwin7 · August 23, 2019, 8:31am

@Mark_Harwood

Overall, in this special case, the sort and aggregation is O(n)

sort process:
scan per terms get doc id, because terms is sorted, so sort the docs by terms can easily loop the terms and print the associated doc_id

aggregation:
same as sort, loop the terms and count of associated doc_id, we can get count of docs for aggregation of this filed term

I think , in this case, it is better than inverted index -> doc id -> access doc values to get terms -> bucket agg etc.

Mark_Harwood · August 23, 2019, 8:34am

PRs with benchmarks always welcome

wangqinghuan · August 23, 2019, 8:47am

Each index contains multiple segments,where a segment is an inverted index. In your special case, you can sort documents by loop sorted term within one inverted index, though, how do you sort documents across mutiple segments?

hackerwin7 · August 23, 2019, 8:59am

@wangqinghuan

My context is in single one inverted index. about across multiple segments sort or aggregate, what is the elasticsearch's solution to this?
In my special case, just merge the two sorted results into fixed one result

Mikhail_Khludnev · August 23, 2019, 11:01am

There are other search engines which aggregate terms through inverted index see slide 42 of https://www.slideshare.net/lucenerevolution/seeley-solr-facetseurocon2011 Although it gains only when number of terms and segments is small.
Still not getting for to sort with inv index.

hackerwin7 · August 23, 2019, 11:21am

@Mikhail_Khludnev

Fantastic! Thanks your sharing resource.

system · September 20, 2019, 11:21am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Why do we need field-data? Elasticsearch	4	585	May 22, 2017
Iterating through large datasets in various orders efficiently Elasticsearch	4	141	July 9, 2024
Grouping search results in Elasticsearch Elasticsearch	4	2180	July 6, 2017
Solving aggregating/sorting on numerical fields Elasticsearch	2	376	February 3, 2021
Need inverted nested mapping Elasticsearch	1	211	December 8, 2021

Why inverted index is not good at Aggregation?

Related topics