Elastic Search Query performance when source is disabled

Hi Everyone, we have an Elasticsearch index with 100 million documents (with replicas it's about 400 million). The index contains nested documents as well.

We have a use case where we have to boost the score of the documents using some fields present in the document. For this boosting we are using function score query.

Our response time when we disable the fetch operation is less than 30ms. We use this endpoint to disable the fetch

https://<elastic_endpoint>/elastic_index_name/_search?_source=false

However when we enable the fetch the same response time becomes greater than 2 seconds.

We tried to debug using the profile API, but based on the docs it doesn't look like the profile api returns the time spent during the fetch operation. Hence the output of the profile api shows time in milliseconds which is the same when we run the query with _source disabled.

We tried to use other forms of scoring like rankFeatures and script score query. But we haven't had any luck.

Can someone please share if they have some insights into this issue? Please let me know if I any more details are needed from my end.

What sort of storage do your nodes use?

If you are using nested documents I assume your average document size might be reasonably large. Retrieving the source requires Elasticsearch to perform a lot of small reads across the file syste, and this will as Mark pointed out be impacted by how fast your storage is.

Our storage is Amazon EBS.

What type of EBS?

How much storage do you have per node?

How much data do you have per node?

Our index size is about 17GB. Very few documents have very large number of nested docs in them. However for this use case we aren't searching or retrieving source information from the nested docs. Our query looks something like this:

{
	"from": 0,
	"size": 10,
	"query": {
		"function_score": {
			"query": {
				"bool": {
					"must": [{
						"match_all": {
							"boost": 1
						}
					}],
					"filter": [
						//Range and term filter queries
					]
				}
			},
			"functions": [
				{
					// Functions....
				}
			]
		}
	}
}

I checked the node_cache stats as well for this query. Strangely the range and the term queries are not returned from the node cache here.

We have another use case where the search is directed at some specific fields for some search keywords. Here the match_all part in the bool must clause is replaced with multi_match query on some fields. There we do some nested searches as well. However that is very fast (under 50ms).

What type of EBS are you using? What is the size of the volume?

We are using GP2 EBS volumes. Each node has about 100GB of storage space available and about 28GB of data.

Our cluster size is 7 nodes (3 Master, 4 Data Nodes).

Small gp2 EBS volumes fo not necessarily support a lot of IOPS. Check IO when you are querying using e.g. iostat to see if it is the storage that is slowing fown the response.

I was able to grab some IO statistics using the _nodes/stats API. I am not sure how to read these stats to find the bottleneck:
Master Node:

{
  "fs" : {
        "timestamp" : 1666816065822,
        "total" : {
          "total_in_bytes" : 7397171200,
          "free_in_bytes" : 4712013824,
          "available_in_bytes" : 4695236608
        },
        "data" : [ {
          "type" : "ext4",
          "total_in_bytes" : 7397171200,
          "free_in_bytes" : 4712013824,
          "available_in_bytes" : 4695236608
        } ],
        "io_stats" : {
          "devices" : [ {
            "operations" : 105227556,
            "read_operations" : 26983,
            "write_operations" : 105200573,
            "read_kilobytes" : 480896,
            "write_kilobytes" : 1131588120
          } ],
          "total" : {
            "operations" : 105227556,
            "read_operations" : 26983,
            "write_operations" : 105200573,
            "read_kilobytes" : 480896,
            "write_kilobytes" : 1131588120
          }
        }
      }
}

Data Node:

{
    "fs" : {
      "timestamp" : 1666816065822,
      "total" : {
        "total_in_bytes" : 105554829312,
        "free_in_bytes" : 76648398848,
        "available_in_bytes" : 71262912512
      },
      "data" : [ {
        "type" : "ext4",
        "total_in_bytes" : 105554829312,
        "free_in_bytes" : 76648398848,
        "available_in_bytes" : 71262912512
      } ],
      "io_stats" : {
        "devices" : [ {
          "operations" : 774302927,
          "read_operations" : 78,
          "write_operations" : 774302849,
          "read_kilobytes" : 312,
          "write_kilobytes" : 12612589788
        } ],
        "total" : {
          "operations" : 774302927,
          "read_operations" : 78,
          "write_operations" : 774302849,
          "read_kilobytes" : 312,
          "write_kilobytes" : 12612589788
        }
      }
    }
}

Ours is read intensive cluster. I am not sure why the read_operations here is more than the write_operations. Is that the other way around, like: write_operations refers to the fetch phase?

Run the command iostat -x from the command line on the host when queries are slow. This should show you if storage might be an issue.

I think disk IOPS was the bottleneck for us. We had some very heavy documents (about 35 - 40MB) in size. Their retrieval was really slow.

What we ended up doing was to disable their storage (nested fields) from the _source field. We didn't really need them to be returned in the response but needed them in the index for some scoring.

Thank you for helping us with this issue. Much appreciated!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.