Running Rally against AWS Elasticsearch

I created a simple challenge and a operation with custom Parameter Source. The idea behind this implementation is to hit an AWS managed Elasticsearch cluster with the defined challenge which says run the operation for 200 seconds with 15 clients and hit the max-throughput of 100. The operation is just a simple Parameter Source which currently just returns a static match_all query. Later on the plan with this operation (just FYI) is to randomly change a field in the query and pass that query.
So when I try to hit run the track with this simple "perf-test" challenge, I notice that on AWS console the search operations/min does not match the to ops/sec you get in the final Rally report.
For instance when I am getting 100 as the max throughput in the Rally report, the max search operations per min in the cloudwatch console is around 34.5k. As far as I understood from the Rally manual, throughput in Rally world is the number of operations per second; where the operations is the simple "match_all" custom Parameter Source I have created which indirectly indicates an operation is one match-all search query. So from this logic I should be expecting only 6000 search operations/min in the AWS console for the 100 throughput in Rally. But I am not able to find out any reason for this discrepancy.

Am I missing anything here?

Reference:

AWS ES cluster information

  1. Version: 6.5

Custom Parameter Source

class SearchRandomUtteranceSource(QueryParamSource):
    def params(self):
        query_terms = list(self.terms)  # copy
        #query_terms.append(str(random.randint(1, 1000)))  # avoid caching
        default_index = "document"
        result = {
            "body": {
                "query": {
			"match_all": {
			}
		}
            },
            "index": self._params.get("index", default_index),
	    "type": None,
            # This is the old name (before Rally 0.10.0). Remove after some grace period.
            "use_request_cache": self._params.get("cache", False),
            "cache": self._params.get("cache", False)
        }
	#print random.choice(query_terms)
        return result

Challenge:

	{
		"name": "perf-test",
		"description":"Runs a perf test with using param source",
		"schedule":[
			{
				"operation": "search_random_utterance",
          			"clients": 15,
          			"time-period": 200,
          			"target-throughput": 100 
			}
		]
	}

Operation:

{
		"name": "search_random_utterance",
		"operation-type": "search",
		"param-source": "search_random_utterance-source"
	}

Hi,

In terms of Rally your understanding is correct. Rally will attempt to issue the query 100 times per second in total or 6000 times per minute. It might be less than that if your cluster cannot achieve that throughput but it will never be more.

Unfortunately, I don't know anything about Cloudwatch and how it works but I guess it would also issue calls against the Elasticsearch stats APIs (e.g. node stats) to get its monitoring data. However, I'd expect those calls to be fairly infrequent, maybe in the order of one call per second. Also, Rally itself issues a few API calls at the beginning of the benchmark e.g. to determine the Elasticsearch version but also this does not get you nowhere near that number.

The node usage API might help you understand what type of requests are issued which might lead you towards discovering what is causing that traffic.

Daniel

How many shards do you have for the "document" index?

Source

If you have 5 shards, that would fit with 30K search rate and the rest of the discrepancy could simply be due to statistical/sampling "artefacts": Which statistic for which time range for which period over which dimension...and the peculiarities that can be specific to each metrics like the effective sampling rate, source of truth, intermediate aggregations, frequency of pull/push for the the source of truth, potential sampling distortions, etc.

It's hard to ponder the possibilities for sure since you haven't provided some details like how many ES nodes you have, how many shards were targeted by your query, where were the shards when the queries ran, which dimension over which period over which time range did you check. (How exactly did you obtain this 34.5K figure... from cloudwatch.)

In short, once the shard count multiplier is understood, if it applies in your case, the rest is more easily explained away because it's no longer a multiplication by 5 of what you expected to see.

If that's not it, I would recommend to ask AWS, something can explain such a discrepancy for sure, if it's not the shard count angle, it's something else.

Last but not least if you're surprised about the shard count multiplier effect in this metric, it's not THAT weird... Lots of stuff/metrics are like that in ES like for example when you check the ES query slow log, the queries are logged per shards... So that same "multiplier" applies in that context too.