Suppose we have a set of long documents which can be broken down into an arbitrary number of chunks (for example, paragraphs). We want to index the documents in a way such that we can satisfy the following requirements:
We want to be able to query and get back specific document chunks that match a query.
We want to at a minimum count (and possibly query and return) the number of full documents that match a query.
So for example, if I have the query "dog AND cat", I want to be able to query the set of paragraphs that match "dog AND cat". I also want to be able to count (and possibly retrieve) the full documents that match "dog AND cat". This differs from querying just the set of paragraphs because a full document may have "dog" in paragraph 1 and "cat" in paragraph 2, and we don't want to miss this (which we will when just querying the paragraphs).
Two questions:
Is there an efficient way of indexing the documents that doesn't require indexing the document chunks separately from storing the full document? If at all possible, we want to avoid indexing the same content twice, but I can't think of a way to do it that satisfies both of the above requirements.
If the answer to 1. is no, then what is the most efficient way to index the chunks and full documents in ES. I was thinking of doing a parent-child relationship where we store the full document and document level metadata as the parent and then the paragraphs and paragraph level metadata as the children so that we can easily return document level information with the paragraph if necessary.
It strongly depends on what you mean by "long documents". Is the volume 100k? 1000k? 1000m? Or the count?
Note that parent/child relationship requires routing and routing efficiency depends on your shard structure. If you have few nodes and few shards, there is not much difference, but if you can scale your nodes to a few dozen or hundreds, or if you can distribute the documents over several indices, you can handle large shard count and parent/child can be distributed more comfortably. The shard size is crucial, it should not grow over some GB.
Yes, indexing your paragraphs with coordinates, like 'document ID' and 'paragraph ID', into Elasticsearch documents, will make sense. Your queries will return exact paragraph coordinates as result.
"most efficient" always depends on your query use case and how much time/space you want to trade. For example, you can use aggregate query to return the estimated document count for the matched paragraphs. But sometimes you want the exact document count, and maybe a second query would make more sense.
You can index document metadata in a "metadata document" or augment all paragraphs with the metadata. The "metadata document" would require an extra "get" request while the denormalized metadata over all paragraphs take more space and is hard to change once written. So there is always a price to pay, and no exact answer to your question.
Hmm I'm not sure how indexing with paragraph coordinates gets around the issue of querying within paragraphs AND querying within the full document for 1.?
To my "dog AND cat" query example, indexing at the paragraph level allows me to query for paragraphs with both dog and cat, but it doesn't seem like it would allow me to find documents where dog appears in one paragraph and cat appears in another in the same document?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.