I am currently researching the contents of an index on an old 1.7.1 ES cluster, in relation to GDPR related issues. In other words, I need to know exactly what can be retrieved from the cluster.
I have come up against a text field, which is analyzed with default tokenization and so on, but excluded from the _source field - however one can still search this text field via query_string_query requests.
My understanding (which may be incorrect or incomplete), is that in the above setup, the inverted index is created, but the field is thrown away, rather than included in the document explicitly, i.e. in the _source field. So, for example, if i have two inputs:
{'mytext': 'hello world'}
{'mytext': 'hello mars'}
I would (given the above configuration for "mytext") end up with an inverted index containing:
{'hello': [doc1, doc2], 'world': [doc1], 'mars': [doc2]}
but neither document would actually contain the text "hello world" or "hello mars" respectively.
My question, then, is this: Can I retrieve the inverse index directly from elasticsearch, and thus rebuild (at least to some degree) the documents, i.e. learn that "hello" and "world" was part of the original first document, and "hello" and "mars" was part of the second, even if I cannot necessarily guess the order of the tokens?