Indexing long documents in chunks

jprante · May 24, 2016, 10:10am

It strongly depends on what you mean by "long documents". Is the volume 100k? 1000k? 1000m? Or the count?

Note that parent/child relationship requires routing and routing efficiency depends on your shard structure. If you have few nodes and few shards, there is not much difference, but if you can scale your nodes to a few dozen or hundreds, or if you can distribute the documents over several indices, you can handle large shard count and parent/child can be distributed more comfortably. The shard size is crucial, it should not grow over some GB.

Yes, indexing your paragraphs with coordinates, like 'document ID' and 'paragraph ID', into Elasticsearch documents, will make sense. Your queries will return exact paragraph coordinates as result.
"most efficient" always depends on your query use case and how much time/space you want to trade. For example, you can use aggregate query to return the estimated document count for the matched paragraphs. But sometimes you want the exact document count, and maybe a second query would make more sense.

You can index document metadata in a "metadata document" or augment all paragraphs with the metadata. The "metadata document" would require an extra "get" request while the denormalized metadata over all paragraphs take more space and is hard to change once written. So there is always a price to pay, and no exact answer to your question.

Topic		Replies	Views
Best Indexing approach Elasticsearch	5	458	July 6, 2017
Search to treat multiple documents as one Elasticsearch	2	717	July 5, 2017
How to index documents in elasticsearch in the hierarchical way ( depth of the tree)? Elasticsearch	5	1036	August 9, 2019
KNN/ANN Search on a parent-child documents Elastic Search elastic-app-search	2	344	January 3, 2024
Handling Page breaks in Elasticsearch Elasticsearch	3	495	November 29, 2018

Indexing long documents in chunks

Related topics