Indexing long documents in chunks

Patrick_Lam · May 24, 2016, 4:16am

Suppose we have a set of long documents which can be broken down into an arbitrary number of chunks (for example, paragraphs). We want to index the documents in a way such that we can satisfy the following requirements:

We want to be able to query and get back specific document chunks that match a query.
We want to at a minimum count (and possibly query and return) the number of full documents that match a query.

So for example, if I have the query "dog AND cat", I want to be able to query the set of paragraphs that match "dog AND cat". I also want to be able to count (and possibly retrieve) the full documents that match "dog AND cat". This differs from querying just the set of paragraphs because a full document may have "dog" in paragraph 1 and "cat" in paragraph 2, and we don't want to miss this (which we will when just querying the paragraphs).

Two questions:

Is there an efficient way of indexing the documents that doesn't require indexing the document chunks separately from storing the full document? If at all possible, we want to avoid indexing the same content twice, but I can't think of a way to do it that satisfies both of the above requirements.
If the answer to 1. is no, then what is the most efficient way to index the chunks and full documents in ES. I was thinking of doing a parent-child relationship where we store the full document and document level metadata as the parent and then the paragraphs and paragraph level metadata as the children so that we can easily return document level information with the paragraph if necessary.

jprante · May 24, 2016, 10:10am

It strongly depends on what you mean by "long documents". Is the volume 100k? 1000k? 1000m? Or the count?

Note that parent/child relationship requires routing and routing efficiency depends on your shard structure. If you have few nodes and few shards, there is not much difference, but if you can scale your nodes to a few dozen or hundreds, or if you can distribute the documents over several indices, you can handle large shard count and parent/child can be distributed more comfortably. The shard size is crucial, it should not grow over some GB.

Yes, indexing your paragraphs with coordinates, like 'document ID' and 'paragraph ID', into Elasticsearch documents, will make sense. Your queries will return exact paragraph coordinates as result.
"most efficient" always depends on your query use case and how much time/space you want to trade. For example, you can use aggregate query to return the estimated document count for the matched paragraphs. But sometimes you want the exact document count, and maybe a second query would make more sense.

You can index document metadata in a "metadata document" or augment all paragraphs with the metadata. The "metadata document" would require an extra "get" request while the denormalized metadata over all paragraphs take more space and is hard to change once written. So there is always a price to pay, and no exact answer to your question.

Patrick_Lam · May 24, 2016, 4:23pm

Hmm I'm not sure how indexing with paragraph coordinates gets around the issue of querying within paragraphs AND querying within the full document for 1.?

To my "dog AND cat" query example, indexing at the paragraph level allows me to query for paragraphs with both dog and cat, but it doesn't seem like it would allow me to find documents where dog appears in one paragraph and cat appears in another in the same document?

Topic		Replies	Views
Best Indexing approach Elasticsearch	5	447	July 6, 2017
Search to treat multiple documents as one Elasticsearch	2	697	July 5, 2017
Indexing very large document in ES Elasticsearch	6	9626	July 6, 2017
Indexing multiple things at once. Possible? Elasticsearch	7	431	July 6, 2017
Possible to Index PDFs by page? Elasticsearch	6	3835	July 6, 2017

Indexing long documents in chunks

Related topics