Elasticsearch fast query but slow response time when retrieving _source even if nested fields are in _source_exclude


#1

I have the following mapping

{
  "yellows" : {
    "aliases" : { },
    "mappings" : {
      "yellow" : {
        "properties" : {
          "ranges" : {
            "type" : "nested",
            "properties" : {
              "geometry" : {
                "type" : "geo_shape"
              },
              "id" : {
                "type" : "long"
              },
              "other1" : {
                "type" : "keyword"
              },
              "other2" : {
                "type" : "long"
              },
              "other3" : {
                "type" : "long"
              }
            }
          }
          ...
        } 
      }
    }
  }
}

queries gets slower and slower the bigger the size. For example

curl https://path/to/elastic/yellows/_search?_source_exclude=ranges&from=0&size=50' --data-binary '{"query":{"bool":{"must":[],"filter":{"bool":{"filter":[{"terms":{"...":["1"]}},{"terms":{"...":["..."]}}],"should":[]}}}},"sort":[{"...":{"order":"asc"}}]}'
# size 50 -> "took":71

curl https://path/to/elastic/yellows/_search?_source_exclude=ranges&from=0&size=100' --data-binary '{"query":{"bool":{"must":[],"filter":{"bool":{"filter":[{"terms":{"...":["1"]}},{"terms":{"...":["..."]}}],"should":[]}}}},"sort":[{"...":{"order":"asc"}}]}'
# size 100 -> "took":1421

At the same time, queries of size=0 or with _source=false are fast. For example

curl https://path/to/elastic/yellows/_search?_source_exclude=ranges&from=0&size=0' --data-binary '{"query":{"bool":{"must":[],"filter":{"bool":{"filter":[{"terms":{"...":["1"]}},{"terms":{"...":["..."]}}],"should":[]}}}},"sort":[{"...":{"order":"asc"}}]}'
# size 0 -> "took":32

curl https://path/to/elastic/yellows/_search?_source=false&from=0&size=100' --data-binary '{"query":{"bool":{"must":[],"filter":{"bool":{"filter":[{"terms":{"...":["1"]}},{"terms":{"...":["..."]}}],"should":[]}}}},"sort":[{"...":{"order":"asc"}}]}'
# _source=false -> "took":167

That means that queries retrieving the _sources (ie without _souce=false or size=0) are slower. Also, it seems that the more ranges in the retrieved documents the slower is the response. I’m using wc -c in the following as a proxy measure of how many ranges are in the retrieved documents. Not the best measure but should suffice

curl https://path/to/elastic/yellows/_search?from=0&size=50' --data-binary '{"query":{"bool":{"must":[],"filter":{"bool":{"filter":[{"terms":{"...":["1"]}},{"terms":{"...":["..."]}}],"should":[]}}}},"sort":[{"...":{"order":"asc"}}]}' | wc -c
# 2.332.822

curl https://path/to/elastic/yellows/_search?from=50&size=50' --data-binary '{"query":{"bool":{"must":[],"filter":{"bool":{"filter":[{"terms":{"...":["1"]}},{"terms":{"...":["..."]}}],"should":[]}}}},"sort":[{"...":{"order":"asc"}}]}' | wc -c
# 38.591.502

As you can see the first 50 have much less ranges than the second 50 in the first 100. Also, notice that in the first snippet, the query for the first 50 is much faster than the query for the second 50 even if it has _source_exclude=ranges.

It seems to me that the query is not the bottleneck. In fact, with size=0 or with _source=false the response time is small. So I suspect that it’s the fact that ranges are a nested field and Elastic takes them into consideration even if the request excludes them (ie _source_exclude=ranges).

Is there any other way to make the queries faster without changing the mapping or should I change the mapping so that ranges are not nested?


(Zachary Tong) #2

Yep, you're experiments are spot on. You're correct with regards to including the _source slowing down the query, and that larger documents tend to slow down the query more than smaller documents.

I believe -- although this could be wrong, checking to see if it's true -- that nested docs require additional file seeks which would explain the slowdown with documents that have more nested than others.

Regardless, _source_exclude is not going to help speed because the main slowdown is the file seek and loading from disk. This happens regardless of source filtering, which can only proceed to filter the source after it has already been loaded. It may even slow things down further, as the fetch phase has to A) load the entire source then B) parse and exclude a portion of the source before returning.

I don't think there is anything that can be optimized here other than returning fewer hits at a time, or moving your nested docs over to a parent/child scheme. That should give you the same relational data, but puts the nested values into independent child documents which won't affect the retrieval of the parent.


(system) #3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.