Using esrally to measure painless script partial update performance

dineshabbi · January 26, 2021, 5:37pm

Hi team,

I am trying to measure bulk partial update performance. I am wondering if its possible to do this via esrally at all.

For eg, our application uses painless scripts for partial update and the target for these updates are individual docs since application is aware of _id:

POST index-001/doc/{_id=BlahBlah}
{
// body of partial update message
}

To update a document via esrally, I may have to use update_by_query API

POST index-001/_update_by_query
  "query": {
    "primaryId": "BlahBlah"
  },
  "script": {
    "source": "abs(ln(abs(doc['population']) + 1) + doc['location'].lon + doc['location'].lat) * _score",
    "lang": "painless"
  }

IIRC esrally runs against index rather than against individual docs, hence need to rely on update_by_query. This index/update_by_query is acceptable for me and would provide comparable results. In my experiment, I am loading the original docs first and then I run esrally with corpus of incremental updates which are driven through stored_scripts (have a matching index field between original and incremental doc).

However, esrally seems to support only painless script "search" operation out of the box . I couldn't find anything for update or bulk update.

      "operation-type" : "search",
      """body": {
        "query": {
          "function_score": {
            "query": {
              "primaryId": "BlahBlah"
            },
            "functions": [
              {
                "script_score": {
                  "script": {
                    "source": "abs(ln(abs(doc['population']) + 1) + doc['location'].lon + doc['location'].lat) * _score",
                    "lang": "painless"
                  }
                }

Please lemm eknow if there are any other ideas. I am not sure if custom runners would solve our problem of non-existant operation-type, and any pointers to custom runner example may help.

Tx,

dliappis · January 27, 2021, 8:15am

Hi @dineshabbi,

Thanks for your interest in Rally.

If I understood correctly, you'd like to use update_by_query which is currently not supported as a native operation in Rally.

You can easily create a custom runner, to build this new operation. You'd directly use update_by_query method from the Elasticsearch Python client. I haven't tested it, but I believe that you'd just specify your query and script in the body as shown e.g. in the Elasticsearch example.

Finally if you don't want to use a custom runner (and/or if a dedicated API call is missing from the elasticsearch-py client) you can always use the raw-request operation to invoke any ES Rest API.

dineshabbi · January 27, 2021, 11:25am

Thanks @dliappis for great pointers ! One follow-up Qn: currently our application performs bulk painless updates. I was curious if both the above approaches you suggested let me measure the performance of bulk updates via esrally. I guess if I figure out a way to POST bulk http REST call via raw-request or custom runner approach, this may work, but not very sure.

We are not planning to implement custom runner unless there are no other alternatives for us. raw-request may help us as long as it lets us meeasure bulk partial updates too.

dliappis · January 28, 2021, 12:38pm

raw-request will contain the usual metrics and you'll have the individual samples for service_time, latency, error_rate and throughput (in rally-results as well as individual metric records in the index rally-metrics) .

With a custom runner in addition to those you have the chance to enhance results with other metrics of your choice, see for example this example.

It's worth reading https://esrally.readthedocs.io/en/stable/recipes.html?highlight=result#checking-queries-and-responses as well.

dineshabbi · February 3, 2021, 7:27am

Thanks again @dliappis .

With bulk partial update, what I was looking for was something like this:

POST _bulk
{ "update" : { "_id" : "ID1", "_index" : "t1_item", "retry_on_conflict" : 3} }
{ "script" : { "id": "item-update-script", "params" : {"partialUpdate":{"supplierOfTradeItem":[{"primaryId": "_primaryIdBulk2_","additionalPartyId": [{"value": "_value_","typeCode": "FOR_BULK_USE_1"}],"name": "_name_","isPrimarySupplier": false,"avpList": [{"value": "value","name": "name_bulk2","actionCode":"ADD"},{"value": "value","name": "name_bulk3","actionCode":"ADD"}]}]}}}}

{ "update" : { "_id" : "ID2", "_index" : "t1_item", "retry_on_conflict" : 3} }
{ "script" : { "id": "item-update-script", "params" : {"partialUpdate":{"supplierOfTradeItem":[{"primaryId": "_primaryIdBulk2_","additionalPartyId": [{"value": "_value_","typeCode": "FOR_BULK_USE_1"}],"name": "_name_","isPrimarySupplier": false,"avpList": [{"value": "value","name": "name_bulk2","actionCode":"ADD"},{"value": "value","name": "name_bulk3","actionCode":"ADD"}]}]}}}}

I am curious if the custom runner allows me to inject this API. I will explore that path since it becomes tedious to supply the whole body with raw-request for bulk.

Couple years ago, I remember esrally used to store the benchmark standard metrics index /rally-metrics-*/ in ES cluster itself.
Metrics — Rally 2.0.3 documentation
However, I am no longer seeing that default behavior. How can I store the metrics on a dedicated ES index in the cluster ?

dliappis · February 3, 2021, 9:07am

Yes of course, this is the preferred way of using Rally, as having the metrics in an Elasticsearch cluster gives the possibility to explore data with Kibana visualizations. See the docs here.

Also since you mentioned:

store the benchmark staandard metrics in one of the indices in ES cluster itself

I wanted to highlight that storing benchmark metrics in the same cluster you are benchmarking is an anti-pattern. The ES cluster you are benchmarking should be doing just that, i.e. receiving only benchmark related role, not experience load from other activities like storing metrics. Instead, your metrics store should be a different Elasticsearch cluster (it doesn't need to be highly available, or very powerful/large).

This is different to what we were discussing earlier which was the _update_by_query API.

Given that there is already a bulk operation in Rally and that you can specify action-and-metadata in the corpora section of your track using the include-action-and-metadata property, you could simply use your example above as your corpora.

I came up with the following example:

I have an existing Elasticsearch cluster containing docs like:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "logs-181998",
        "_type" : "_doc",
        "_id" : "oR7zZncBQLUmKToTl7Tr",
        "_score" : 1.0,
        "_source" : {
          "@timestamp" : 893999489,
          "clientip" : "253.27.0.0",
          "request" : "GET /images/hm_anime_e.gif HTTP/1.0",
          "status" : 200,
          "size" : 15609
        }
      },
      {
        "_index" : "logs-181998",
        "_type" : "_doc",
        "_id" : "Wh7zZncBQLUmKToTl5Xe",
        "_score" : 1.0,
        "_source" : {
          "@timestamp" : 894196437,
          "clientip" : "39.164.0.0",
          "request" : "GET /english/images/news_btn_kits_off.gif HTTP/1.1",
          "status" : 200,
          "size" : 933
        }
      }
    ]
  }
}

Then I created the following simple Rally track updatescripttrack to update these two docs to show status 404 instead of 200:

The corpora:

me@server ~/updatescripttrack $ cat documents.json 
{"update": {"_id": "oR7zZncBQLUmKToTl7Tr", "_index" : "logs-181998" } }
{"script": "ctx._source.status = 404" }
{"update": {"_id": "Wh7zZncBQLUmKToTl5Xe", "_index" : "logs-181998" } }
{"script": "ctx._source.status = 404" }

The track itself:

me@server ~/updatescripttrack $ cat track.json 
{
  "version": 2,
  "description": "Test Rally with update script workload",
  "indices": [
    {
      "name": "logs-181998"
    }
  ],
  "corpora": [
    {
      "name": "testupdate",
      "documents": [
        {
          "source-file": "documents.json",
          "includes-action-and-meta-data": true,
          "document-count": 2
        }
      ]
    }
  ],
  "schedule": [
    {
      "operation": {
        "operation-type": "bulk",
        "bulk-size": 1
      }
    }
  ]
}

Which I then ran using:

esrally --pipeline=benchmark-only --track-path=~/updatescripttrack --on-error=abort

which successfully did the change:

GET logs-181998/_search
{
  "query": {
    "terms": {
      "_id": ["oR7zZncBQLUmKToTl7Tr", "Wh7zZncBQLUmKToTl5Xe"]
    }
  }
}

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "logs-181998",
        "_type" : "_doc",
        "_id" : "oR7zZncBQLUmKToTl7Tr",
        "_score" : 1.0,
        "_source" : {
          "@timestamp" : 893999489,
          "clientip" : "253.27.0.0",
          "request" : "GET /images/hm_anime_e.gif HTTP/1.0",
          "status" : 404,
          "size" : 15609
        }
      },
      {
        "_index" : "logs-181998",
        "_type" : "_doc",
        "_id" : "Wh7zZncBQLUmKToTl5Xe",
        "_score" : 1.0,
        "_source" : {
          "@timestamp" : 894196437,
          "clientip" : "39.164.0.0",
          "request" : "GET /english/images/news_btn_kits_off.gif HTTP/1.1",
          "status" : 404,
          "size" : 933
        }
      }
    ]
  }
}

Rally will report the standard bulk operation metrics as with any other bulk operation.

dineshabbi · February 3, 2021, 9:42am

Agree. I will check the cost incurred on this. I am not running anything other than update, hence hoping that the metrics shouldn't skew my results way too much. If so, will move them to another host.

Correct, I later realized that I didn't mean to ask the "bulk" notion of update by update_by_query, rather my bulk operation meant for bundling multiple update requests at client and send them at once.

Awesome !! This is exactly what I was looking for. Thank you so much for this pointer.

system · March 3, 2021, 9:42am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Using Bulk API with Update in elasticsearch 0.19.3 Elasticsearch	12	468	July 6, 2017
Bulk is too slow Elasticsearch	34	16876	December 14, 2017
Document updates in ES rally scenarious Elasticsearch rally	2	548	November 2, 2020
Esrally got "The benchmark ended already during warmup" when running custom track Elasticsearch rally	9	1590	July 17, 2019
Does Rally support benchmarking partial updates (doc_as_upsert) Elasticsearch rally	3	924	October 23, 2017

Using esrally to measure painless script partial update performance

Related topics