Suppose we have a set of long documents which can be broken down into an arbitrary number of chunks (for example, paragraphs). We want to index the documents in a way such that we can satisfy the following requirements:
- We want to be able to query and get back specific document chunks that match a query.
- We want to at a minimum count (and possibly query and return) the number of full documents that match a query.
So for example, if I have the query "dog AND cat", I want to be able to query the set of paragraphs that match "dog AND cat". I also want to be able to count (and possibly retrieve) the full documents that match "dog AND cat". This differs from querying just the set of paragraphs because a full document may have "dog" in paragraph 1 and "cat" in paragraph 2, and we don't want to miss this (which we will when just querying the paragraphs).
Is there an efficient way of indexing the documents that doesn't require indexing the document chunks separately from storing the full document? If at all possible, we want to avoid indexing the same content twice, but I can't think of a way to do it that satisfies both of the above requirements.
If the answer to 1. is no, then what is the most efficient way to index the chunks and full documents in ES. I was thinking of doing a parent-child relationship where we store the full document and document level metadata as the parent and then the paragraphs and paragraph level metadata as the children so that we can easily return document level information with the paragraph if necessary.