Following up on a StackExchange conversation, I was wondering "how" and "how successful" one could rebuild data from an index without the _source data.
The ES documentation on disabling the _source field warns about using this technique. However there is no information - or warning - about the fact that data could be restored from the index only.
So I was wondering, what would it require for an attacker to get some usable data out of the index? And what exactly would they get a hold of?
If you're talking about an attacker then it sounds like you're asking about this from a security perspective. The short answer is that you should treat disabling _source as having effectively zero security benefits. For starters, Elasticsearch still stores the source temporarily, but even after it's gone you can still determine quite a lot about the contents of each document simply by running searches and seeing which documents match which queries.
The precise details of what can be reconstructed from a source-free index depends on your mappings, but these things are super-subtle and may vary silently between versions. I don't think it is wise to build a security policy in this way.
This is indeed written from a security point of view. We are building a product where - at this point - everything is encrypted at rest in a traditional sql database.
However our customers are requesting (full text) search as a feature. We want to offer this to the customer as an option, but we also want to be clear on the trade-offs when they enable said feature.
If I understand your comment correctly, an attacker that has access to the index would be able to know if certain “terms” where used in which documents. However one would not be able to reconstruct a full text blob?
I’m assuming that in case of short text blobs - like an email - an attacker with access tot the API would not be able to easily extract all known emails, but an attacker with access to the index files themselves could take a shot at that. Is that a correct assumption?
Would it be correct to say that “in general” the longer the text blob is, the harder it will be to fully reconstruct it?
"Can fully reconstruct the source" is an extremely weak security target, you can leak an awful lot of sensitive information just by knowing which terms are in which docs. But anyway full text search typically requires you to track more than just which terms are in which docs, for instance phrase searches will be able to identify sequences of terms and highlighting will tell you where in the docs the matching terms are.
I'll say it again: I don't think it is wise to build a security policy in this way.
Hey David, thanks for the continiued replies on this thread, I realy do appreciate them.
If I understood correctly, highlighting is actually disabled when _source is disabled (per the documentation here), or is there an other type of highlighting besides "on the fly highlighting"?
Your reference to "phrase searches" opened up a new world to me. The world of "token graphs" to be exact. As far as I understand there is no (default) way to disable this (hence disabling phrase searches) and thus it stands to reason that a committed attacker could rebuild entire phrases, paragraphs or documents based on this information (minus stop words, and other tokenizer filters). Do I understand that correctly, because that is a very important point?
It's clear to me that this increases the overall attack surface of our system. However, customers should be properly informed about the tradeoffs of the features they desire. It is our responsibility to explain this as clearly as possible to the customer so they may make an informed decisions, and we can tailor the overall application features (eg disabling search on sensitive documents) and operational procedures (who can access the index, and how is it monitored, limit the max length of a query) accordingly.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.