Douments structure

Alessio_Creo · March 2, 2021, 10:31am

Hi there,

we're planning to ingest large xml payloads in our ES cluster and we'll have to do full text query on it.
This payload can be split in more documents if nedeed but we wonder if a full text query is more performant on many small docs or in few large payload before implementing the ingestion Logstash pipeline.

Thank you very much.

Best regards,
Alessio

vincenbr · March 2, 2021, 9:24pm

Hi Alessio,
Generally speaking, in Lucene world (or with any inverted index lookup) boolean matching (full text query before scoring) performance will not depend on the number of documents in your index.
But rather on the size of your vocabulary (the total number of terms in your entire indexed corpus, aka dictionary size |T|) . More formally: O(log(|T|)
Then scoring and sorting will have a performance cost based on the number of results at best. So anyway, splitting the same amount of text data into smaller elasticsearch documents won't have any positive impact on performance.
To sum up: For search performance, keep your docs as big as possible, provided that they fulfill your search requirements in terms of searchable units.

Alessio_Creo · March 3, 2021, 8:43am

Hi Vincent,

thank you very much for your reply. We supposed that there were no benefits splitting the data but we wanted to have confirm.

Thank you again.

Best regards,
Alessio

Christian_Dahlqvist · March 3, 2021, 9:48am

I guess the answer may depend on what you actually mean by large documents. What is the maximum size when converted to JSON?

system · March 31, 2021, 9:48am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.