IIRC esrally runs against index rather than against individual docs, hence need to rely on update_by_query. This index/update_by_query is acceptable for me and would provide comparable results. In my experiment, I am loading the original docs first and then I run esrally with corpus of incremental updates which are driven through stored_scripts (have a matching index field between original and incremental doc).
However, esrally seems to support only painless script "search" operation out of the box . I couldn't find anything for update or bulk update.
If I understood correctly, you'd like to use update_by_query which is currently not supported as a native operation in Rally.
You can easily create a custom runner, to build this new operation. You'd directly use update_by_query method from the Elasticsearch Python client. I haven't tested it, but I believe that you'd just specify your query and script in the body as shown e.g. in the Elasticsearch example.
Finally if you don't want to use a custom runner (and/or if a dedicated API call is missing from the elasticsearch-py client) you can always use the raw-request operation to invoke any ES Rest API.
Thanks @dliappis for great pointers ! One follow-up Qn: currently our application performs bulk painless updates. I was curious if both the above approaches you suggested let me measure the performance of bulk updates via esrally. I guess if I figure out a way to POST bulk http REST call via raw-request or custom runner approach, this may work, but not very sure.
We are not planning to implement custom runner unless there are no other alternatives for us. raw-request may help us as long as it lets us meeasure bulk partial updates too.
raw-request will contain the usual metrics and you'll have the individual samples for service_time, latency, error_rate and throughput (in rally-results as well as individual metric records in the index rally-metrics) .
With a custom runner in addition to those you have the chance to enhance results with other metrics of your choice, see for example this example.
I am curious if the custom runner allows me to inject this API. I will explore that path since it becomes tedious to supply the whole body with raw-request for bulk.
Couple years ago, I remember esrally used to store the benchmark standard metrics index /rally-metrics-*/ in ES cluster itself. Metrics — Rally 2.0.3 documentation
However, I am no longer seeing that default behavior. How can I store the metrics on a dedicated ES index in the cluster ?
Yes of course, this is the preferred way of using Rally, as having the metrics in an Elasticsearch cluster gives the possibility to explore data with Kibana visualizations. See the docs here.
Also since you mentioned:
store the benchmark staandard metrics in one of the indices in ES cluster itself
I wanted to highlight that storing benchmark metrics in the same cluster you are benchmarking is an anti-pattern. The ES cluster you are benchmarking should be doing just that, i.e. receiving only benchmark related role, not experience load from other activities like storing metrics. Instead, your metrics store should be a different Elasticsearch cluster (it doesn't need to be highly available, or very powerful/large).
Given that there is already a bulk operation in Rally and that you can specify action-and-metadata in the corpora section of your track using the include-action-and-metadata property, you could simply use your example above as your corpora.
I came up with the following example:
I have an existing Elasticsearch cluster containing docs like:
Agree. I will check the cost incurred on this. I am not running anything other than update, hence hoping that the metrics shouldn't skew my results way too much. If so, will move them to another host.
Correct, I later realized that I didn't mean to ask the "bulk" notion of update by update_by_query, rather my bulk operation meant for bundling multiple update requests at client and send them at once.
Awesome !! This is exactly what I was looking for. Thank you so much for this pointer.