At one of my customer projects we work with documents containing a very large text-field (content of eBooks....).
We saw that queries slow down more then 100 x times when such documents are queried. Even if we use the source-filter and exclude this field from the query! The only solution is to exclude the text-field from the _source at index-time from the documents:
I found the following blog post that explains the issue in detail:
What I don't understand:
Why is the search so slow even when using the source filtering in the query? There should be no need to fetch, retrieve and merge the excluded fields? I was expecting when using something like
the large text-field shouldn't impact the performance at all and should be ignored for this specific query?
Currently we are using Elasticsearch also as a datastore. If such large documents are slow things down, Elasticsearch seems not to be a perfect datastore (in contrast to MongoDB)?
CPU overhead: the json parser still needs to skip over the large text field in order to exclude it from the _source, which is linear with the size of your json doc.
Disk overhead, since those large fields make the index larger and thus the filesystem cache can only hold a smaller ratio of the total index size.
Thanks @jpountz!
It seems not to be the "json-parsing".
If we use the search without any searchterms it's fast as hell (2 ms):
GET index_with_large_text/_search
If we use a simple searchterm, there is the performance problem (152 ms):
GET index_with_large_text/_search?q=any_field:something
So JSON parsing seems not be the problem. Must be something with "Disk overhead" during the search- or merge-phase , right? Do you thing increasing RAM/Heap-Size can fix the problem?
Can you confirm you are not using really large size values? Also when you say, 100x slower, what is the order of magnitude of the response times we are talking about? Is it 100% reproducible?
When indexed without storing the large text-field in _source the "took"-time is around: 1-2ms
With this field in _source: 90ms -120ms (around 100x slower)
Yes, it's always reproducible.
For our tests we are using the default size of 10 hits that should be returned.
If using size=1 it's much faster; a size of 10000 is slowing things much more down
Currently we thinking about not storing those large texts in Elasticsearch, but using MongoDB for this. But we will loose highlighting features and some nice-to-have functions like "reindexing" and "updates" within Elasticsearch.....
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.