Following is my use case to store data in Elasticsearch for workspace search connecting different data sources.
- Text from file is chunked and stored in different document used to vector and keyword search chunk.
- However, each file has set of allowed users as well as allowed groups who can access the document. Users can belong to group and used to search based on access. I want to support both keywork and semantic search.
- I want to avoid permissions duplicacy for each text chunk.
What's the best way in order to index such data so that filtering also becomes easy. I want to filter the data while querying instead of applying pre filter/ post filter. - For access based control, each document can have list of allowed user, allowed group, or all user can have permission to it.
Consider I have millions of files and their permissions to index, for example google drive of an organisation using service accounts, what should be ideal data storage strategy, optimal search.
Considering I decide to store a single document per file with nested vector search, I have a use case of finding top-k passages irrespective of the top level document.
Even if the top 2 passages are from same document, then both of those passages should be returned.