on Elasticsearch 1.5.0, I'm dealing with ~50 million documents. My index contains an integer field and a nested object also containing an integer field.
If I run a simple match_all query sorting it by (the external) id, the result is returned in around 200 ms. If on the other hand I try running the same, sorting it using "sort": "nested.id", the query takes more than 2 minutes to return. I haven't specified any special treatment for the field data, but the same timing applies both trying the query multiple times and specifying eager loading for the nested id field data.
Why is this happening? Would anybody be able to point me to any resources explaining how sorting is carried on when applied to nested fields? Is there any way I can make the two speeds at least comparable?
Sorting on nested fields is a bit slower since elasticsearch needs to go to all your nested docs to figure out the minimum value to use as a sort value. However I'm surprised this can go from 200ms to several minutes. Can you try to capture some node hot threads while the query is running? https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-nodes-hot-threads.html
I've actually proceeded quite a little in this, but I still have a very hard time understanding the overall reason for my problem.
If the nested field exists in all documents (or, actually, just most of them), the sort is almost as fast as if sorting on a normal field;
If the nested object containing the field only exists in a very small number of documents, the search is extremely slow (like a couple of order of magnitudes slower);
If a nested_filter is used when sorting on the nested field, the behavior is parallel to what previously explained:
if the filter returns little or no documents, the search is extremely slow;
if the filter returns enough documents (~40k for a search with total number of hits ~) the search is once again extremely slow.
I will certainly try to look into your suggestion.
My use case required me to return a set of documents sorted using nested objects, followed by everything else scored normally. I wanted to solve this by adding two sort values: the nested sorting first and _score second. Unfortunately this turned out to be extremely slow for the reasons mentioned above (I had searches with millions of results, with only a handful of documents containing nested fields).
My solution was to perform two distinct searches at the application level (still in a way that was completely transparent to the caller):
first filter for documents containing the nested object you want to search on, and sort them: this is going to be fast since you are ensuring that the nested documents are actually there;
than (if necessary) perform a search for the rest of the documents and return them if needed.
IMPORTANT
After we deployed this system, memory consumption skyrocketed: the only solution was to move to Elasticsearch 2.1 (where the memory issue is completely solved). I have the impression that the memory issues were caused by nested sorting, but I did not have the resources to investigate this thoroughly. Just a warning.
Updating to Elasticsearch 2.x has speed up the sorting on nested fields by orders of magnitude. Same query on 1.5.2 took 60 seconds, but was less than 200 milliseconds on 2.1.2.
Other approach we took (which doesn't require upgrading to 2.x) was to make multiple calls to Elasticsearch, dividing the whole result set into subset having values of the nested fields (and sorting those values) and a subset missing values of the nested fields (which didn't require sorting, or had a fallback sort applied). The final result was build on the application side and pagination is possible because each call returns the total count of the subset.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.