we're planning to ingest large xml payloads in our ES cluster and we'll have to do full text query on it.
This payload can be split in more documents if nedeed but we wonder if a full text query is more performant on many small docs or in few large payload before implementing the ingestion Logstash pipeline.
Hi Alessio,
Generally speaking, in Lucene world (or with any inverted index lookup) boolean matching (full text query before scoring) performance will not depend on the number of documents in your index.
But rather on the size of your vocabulary (the total number of terms in your entire indexed corpus, aka dictionary size |T|) . More formally: O(log(|T|)
Then scoring and sorting will have a performance cost based on the number of results at best. So anyway, splitting the same amount of text data into smaller elasticsearch documents won't have any positive impact on performance.
To sum up: For search performance, keep your docs as big as possible, provided that they fulfill your search requirements in terms of searchable units.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.