ES: 7.4.1
I have a tiny little index (~100MB). The mapping is simple and naive - with no nested objects whatsoever. There's one field (an array of objects) which is kind of big, and sometimes I'd like to exclude that from the resulting response.
The thing is, if either "excludes" or "includes" is applied - the overall execution time (took) becomes twice higher.
Here's an easier way to reproduce the issue:
'_source' => ['includes' => ['id']] => 30ms
Without any includes/excludes => 15ms
What's the reason of it? Take into account that the whole index is loaded into RAM ( 'index.store.preload' => ['nvd', 'dvd', 'tim', 'doc', 'dim']), for the sake of better performance.
What's the mechanism behind that? I understand that in case of no includes/excludes ES can just map the entire index memory to the response without any additional processing. But why does it take that long to exclude/include unnecessary fields since all the data is in RAM already?
It seems to be a very quick O(n) operation. Thanks for your work!
If you don't ask for any source manipulation then Elasticsearch treats the document as an opaque sequence of bytes, which it can handle very efficiently. If it has to manipulate the source at all then it must parse it, convert it into a tree of freshly-created objects, exclude the bits you want excluding, and then convert it back into a sequence of bytes for further processing. This extra work can be quite significant.
Note that the source is a stored field, but you are not preloading the stored fields file. Also note that preloading is only a best-effort process and does not guarantee that this data remains in RAM.
An alternative would be to store the field(s) that you do want for these queries rather than parsing them from the source each time.
Another alternative is to exclude this field at index time. This has downsides, of course (no reindexing, no updates, etc.) but maybe that's ok for you.
Thanks for the answer! I'll consider using 'index.store.preload' => ['*'] then.
An alternative would be to [store] the field(s)
Another alternative is to [exclude this field at index time]
I believe that neither of those solutions would let me to include (sometimes I need that) the field into the resulting response. I am not querying this field, I am only interested to see that in response (sometimes).
It seems like the only way to achieve that is to have the an extra index (with those fields I need sometimes)?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.