Following is my use case to store data in Elasticsearch for workspace search connecting different data sources.
Text from file is chunked and stored in different document used to vector and keyword search chunk.
However, each file has set of allowed users as well as allowed groups who can access the document. Users can belong to group and used to search based on access. I want to support both keywork and semantic search.
I want to avoid permissions duplicacy for each text chunk.
What's the best way in order to index such data so that filtering also becomes easy. I want to filter the data while querying instead of applying pre filter/ post filter.
For access based control, each document can have list of allowed user, allowed group, or all user can have permission to it.
Consider I have millions of files and their permissions to index, for example google drive of an organisation using service accounts, what should be ideal data storage strategy, optimal search.
Should a single index strategy be used or multi index with terms lookup?
I'd suggest a single index for your document data.
I'd also suggest not storing each chunk in a separate document, but instead store each chunk as a different value in the same array field on the same document. This way, you won't need to duplicate the permissions metadata on each chunk.
Then, you can set up an Elasticsearch Role Template for your index that filters results based on the permissions metadata.
In that case, perhaps you be better served indexing each chunk as a separate elasticsearch document. Elasticsearch isn't intended to return the same _id across multiple hits in a single response.
I'm curious what your use case is. I could see the desire for this if you're wanting to highglight which passages matched a given query, and these might span multiple chunks. However, highlighting isn't easy to do with vector search, without highlighting the whole matched passage.
Doing some post processing based on chunks of a document. How seperate relevant chunks from same documet are
What about using terms lookup from another permission index and then search from another index. How effectively will that be in case of millions of chunks and 1000's of documents? and or permission and meta
data redundancy is something that is currently unavoidable.
Because for me , permission updates and content change are 2 different events.
Since on content change , I don't have control over the number chunks, I have to reindex all chunks.
Note that the difficult part with this is ensuring at search time that you can accurately associate a given search request's origination with the right set of permissions. This will be something you'll have to implement in your backend code.
Actually, I think I misunderstood you. My mistake.
Store both in same index at chunk level which leads to duplicates of permission.
Each Elasticsearch document should have the metadata fields on it necessary to filter that document for DLS. I think that's what you're meaning by "permissions". So I'd actually recommend this approach.
I understand that this is going to cause you to store identical permissions values for multiple Elasticsearch documents because the permissions are associated with source documents, and your use case necessitates that you'll need to store each chunk of a source document as a separate Elasticsearch document.
I don't think there's really a way around this, without changing your requirements. As I'd originally stated, it's typically better to put all of the chunks for a source document in one Elasticsearch document. You may want to re-evaluate if you really need/want to display a different hit for every single passage that matches. This is definitely an uncommon UX.
For example, if your dataset was "Books" and a query of "Alice" was issued, do you really want thousands of hits from "Alice In Wonderland"? Or do you want just one hit that previews a few of the relevant passages?
Yes. My use case requires that lets say the top passages are passage 1 and passage 3, then I also need passage2.
So , final suggestion is to have one index with duplicates, right?
Let's forget about the permission use case for now and simplify the use case.
Each document in ES is a passage, chunked from a piece of text and metadata is associated with a text.
Let's consider this:
My only concern is concurrent updates at the same time to both content as well as content metadata, and they can update independently.
Whenever content is changes, since it's chunked and number of passages can change, I would always have to re-index all passages belonging to same document along with all the metadata for that content.
For content metadata change, I only want to update the content metadata.
In concurrent situations, when using message bus, this can lead to a data inconsistency.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.