I am currently researching the contents of an index on an old 1.7.1 ES cluster, in relation to GDPR related issues. In other words, I need to know exactly what can be retrieved from the cluster.
I have come up against a text field, which is analyzed with default tokenization and so on, but excluded from the _source field - however one can still search this text field via query_string_query requests.
My understanding (which may be incorrect or incomplete), is that in the above setup, the inverted index is created, but the field is thrown away, rather than included in the document explicitly, i.e. in the _source field. So, for example, if i have two inputs:
{'mytext': 'hello world'}
{'mytext': 'hello mars'}
I would (given the above configuration for "mytext") end up with an inverted index containing:
{'hello': [doc1, doc2], 'world': [doc1], 'mars': [doc2]}
but neither document would actually contain the text "hello world" or "hello mars" respectively.
My question, then, is this: Can I retrieve the inverse index directly from elasticsearch, and thus rebuild (at least to some degree) the documents, i.e. learn that "hello" and "world" was part of the original first document, and "hello" and "mars" was part of the second, even if I cannot necessarily guess the order of the tokens?
How exactly? I tried using the termvector API on the relevant documents, but they did not seem to contain anything, and of course the inverse index only makes sense on a per index level, rather than on a per document level.
Is there an elasticsearch API that can be used to get the inverse index? Or am I left with directly looking at the lucene data?
In the example you gave, if you searched for the conjunction of the terms hello and world you would only get the first document as a result, whereas hello and mars would only yield the second document.
Indeed, but that requires previous knowledge of the contents of the documents. Suppose (as in my real-world case) that at least some of the documents are no longer available. My original question is, can I, from the inverse index alone, recreate the documents - I guess I can, but I would need to have a way of accessing the inverse index directly, e.g. http://localhost:9200/myindex/_FANCY_API_TO_GET_REVERSE_INDEX.
If you run a terms aggregation on the field (having enabled field data) then this will enumerate all the terms that were indexed. Then you can run a query for each term to find out which documents hold each term. From this it's not too hard to work out which terms are in each document. Is this what you mean?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.