Questions about custom tracks

Hi,

last time I tried to upgrade my productive ES instance from 5.1.2 to 6.1.0 I got performance issues afterwards.

Before running the next rollout, I would like to do some loadtests before to be sure that changes during my migration / integration do nothing bad.

I got the hint to use rally and here I am.
I would like to test as near to reality as possible by spending as few time as possible for that :wink:

We are indexing log files, splitting it to fields and we have multiple extensive dashboards which are visualized via kibana.

Is the assumption correct, that the kind of data (number of indexes, number of docs per index, number of shards, number of fields per index, number of fields per type and of course the user queries (via kibana)) are (significantly) effecting the speed of my node / cluster?

If so, I assume I need to build my own track.

  • is it possible to export productive data and use it as document input for rally? How?
  • How do I deal with the user queries? Is there an easy way to take the queries e.g. from kibana logs and provide them as challenge? How?

In the meantime I will continue reading the docs (I am not completely through yet) and will get warm with the default tracks.

Thanks, Andreas

Sounds like the eventdata track could be a good starting point for you. Note that this track generates data on the fly though (normally we bulk-index an already existing data set).

Yes.

The closer your model is to your production workload the more useful are the numbers that you get out of it. So I'd agree here.

Rally needs a line-delimited file that is in a bulk-friendly format. There is no tooling in Rally but can use a small script to get data out of Elasticsearch and write it into a line-delimited JSON file. You could base your script on something like CLI for elaasticsearch-py helpers ยท GitHub.

Not at the moment but we plan to add some support for that soon. The idea is that you use Elasticsearch's slow log to generate the query workload, for details see #262 on Github.

I'd choose the queries that are most important to you, determine your current workload (e.g. by turning the slowlog theshold to zero for a little while) and then model the number of clients and the target throughput accordingly. Modelling client arrivals gets quite important if you want to simulate several clients. For independent clients, Poisson-based arrival are a good choice (see the docs for details).

You might also want to watch:

(both are free to watch but require prior registration).

At this year's ElasticON, I'll also give a talk about benchmarking pitfalls, called The Seven Deadly Sins of Elasticsearch Benchmarking. I don't know for sure but I'd expect that the contents are available a little while after the conference (as it was the case in previous years).

1 Like

Thanks for the answer. Also the 1st mentioned video was very interesting. Looks like the part I need, but with our queries and data.

Another idea came into my mind about how creating the input json. I could use logstash and output the data as json lines to a file. Should be compatible, right?

When I use a file as input, is it also possible throttle the ingestion speed? So that I could set the throughput like on production?

Is it possible to import from multiple files simulaneously to different indices / types?

Then hopefully I could replay the productive load.

I will read through the event-data-track and hope to understand what Christian is doing there. (I am not familar with python yet :frowning: )

I would think so. As long as you have one line per document you should be good. You can decide to include the action and meta-data line in that file directly or have Rally generate it for you on the fly (see the track reference for includes-action-and-meta-data).

Yes, you can set target-throughput to throttle it. One caveat: The target throughput is operation-agnostic (i.e. it does not know about bulks) and is measured in requests per second (not documents per second).

Yes. By default, Rally will just ingest all data files that you provide it (one after the other). If you want to have more control over that process then you can leverage the parallel element but then you need explicitly specify an index list for your bulk operation. Otherwise Rally would ingest all your indices twice (remember: ingesting everything is the default behavior). I'm thinking something along the lines of:

{
  "schedule": [
    {
      "parallel": {
        "tasks": [
          {
            "operation": {
              "operation-type": "bulk",
              "bulk-size": 5000,
              "indices": ["logs"]
            },
            "warmup-time-period": 120,
            "clients": 8,
            "target-throughput": 1
          },
          {
            "operation": {
              "operation-type": "bulk",
              "bulk-size": 2000,
              "indices": ["users"]
            },
            "warmup-time-period": 120,
            "clients": 4,
            "target-throughput": 2
          }
        ]
      }
    }
  ]
}

(beware: totally untested)

This track is quite advanced and Christian pushed Rally quite to the edge :slight_smile:

If you have static data, then you can probably just model it as a regular track and don't need any Python at all. One more tip: I'd stick to the examples in the docs and not base your track on the standard Rally tracks because they implement a few workarounds so users with older versions of Rally can still run them. That makes them a bit harder to understand than the examples in the docs that only target a specific Rally version.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.