Accessing the inverse index

ftr · October 17, 2019, 8:43am

I am currently researching the contents of an index on an old 1.7.1 ES cluster, in relation to GDPR related issues. In other words, I need to know exactly what can be retrieved from the cluster.

I have come up against a text field, which is analyzed with default tokenization and so on, but excluded from the _source field - however one can still search this text field via query_string_query requests.

My understanding (which may be incorrect or incomplete), is that in the above setup, the inverted index is created, but the field is thrown away, rather than included in the document explicitly, i.e. in the _source field. So, for example, if i have two inputs:
{'mytext': 'hello world'}
{'mytext': 'hello mars'}
I would (given the above configuration for "mytext") end up with an inverted index containing:
{'hello': [doc1, doc2], 'world': [doc1], 'mars': [doc2]}
but neither document would actually contain the text "hello world" or "hello mars" respectively.

My question, then, is this: Can I retrieve the inverse index directly from elasticsearch, and thus rebuild (at least to some degree) the documents, i.e. learn that "hello" and "world" was part of the original first document, and "hello" and "mars" was part of the second, even if I cannot necessarily guess the order of the tokens?

DavidTurner · October 17, 2019, 10:32am

Can I retrieve the inverse index directly

Yes, if you are sufficiently motivated you can indeed do this.

ftr · October 17, 2019, 1:56pm

How exactly? I tried using the termvector API on the relevant documents, but they did not seem to contain anything, and of course the inverse index only makes sense on a per index level, rather than on a per document level.

Is there an elasticsearch API that can be used to get the inverse index? Or am I left with directly looking at the lucene data?

DavidTurner · October 17, 2019, 2:00pm

In the example you gave, if you searched for the conjunction of the terms hello and world you would only get the first document as a result, whereas hello and mars would only yield the second document.

ftr · October 17, 2019, 2:08pm

Indeed, but that requires previous knowledge of the contents of the documents. Suppose (as in my real-world case) that at least some of the documents are no longer available. My original question is, can I, from the inverse index alone, recreate the documents - I guess I can, but I would need to have a way of accessing the inverse index directly, e.g.
http://localhost:9200/myindex/_FANCY_API_TO_GET_REVERSE_INDEX.

Is there such a way?

DavidTurner · October 17, 2019, 3:09pm

If you run a terms aggregation on the field (having enabled field data) then this will enumerate all the terms that were indexed. Then you can run a query for each term to find out which documents hold each term. From this it's not too hard to work out which terms are in each document. Is this what you mean?

system · November 14, 2019, 3:09pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Inverted index per field Elasticsearch	2	863	July 5, 2017
Illustration of DocValues, Fielddata and Inverted Index Elasticsearch	1	292	November 23, 2021
Searching terms in Inverted index Elasticsearch	1	341	December 28, 2020
Store only index without a way to retrieve the actual indexed data Elasticsearch	2	909	June 14, 2019
Why do we need field-data? Elasticsearch	4	580	May 22, 2017

Accessing the inverse index

Related topics