Is this rally result valid?

Hello,

We have deployed rally (0.4.5) to measure our existing cluster (external car) which consisted of 3 dedicated data nodes and 2 ingest/client nodes. We are indexing 5 million apache log documents already in JSON format as the test data and we're using geoip ingest pipeline to retrieve the location data based on the IP address data. The ES cluster (5.0.1) was set up just for this benchmark.

The track.json used is shown below and the esrally is invoked with the following command:
esrally --track=apache --offline --target-hosts=10.0.0.180:9200,10.0.0.181:9200 --pipeline=benchmark-only

We run two set of indexing tests to compare the pipeline performance, with the same set of data and environment.

  1. Without pipeline in place: we got around 18k docs/s median throughput
  2. With the pipeline in place (which only has geoip processor): we had 9k docs/s median throughput.

So based on the result, it seems the indexing throughput was reduced to half, purely because of the pipeline, which seems very strange as we assumed geoip is a common processor which is widely used.

So we're thinking to get some input if there's any rally configuration which we might have missed that cause the big performance setback, before we start suspecting the problem is on the ingest geoip plugin or other stuff.

Are we doing it right?

track.json:

{
  "meta": {
    "short-description": "Apache Logging benchmark",
    "description": "This benchmark indexes Apache server log data. Data-url below is a dummy as offline data is used instead",
    "data-url": "http://benchmarks.elasticsearch.org.s3.amazonaws.com/corpora/apache"
  },
  "indices": [
    {
      "name": "apachelog",
      "types": [
        {
          "name": "type",
          "mapping": "mappings.json",
          "documents": "apachelog.bz2",
          "document-count": 5000000,
          "compressed-bytes": 188542509,
          "uncompressed-bytes": 2263006234
        }
      ]
    }
  ],
  "operations": [
    {
      "name": "index-append",
      "operation-type": "index",
      "bulk-size": 8000,
      "pipeline": "pl_clickstream"
    },
    {
      "name": "query-match-all",
      "operation-type": "search",
      "body": {
        "query": {
          "match_all": {}
        }
      }
    }
  ],
  "challenges": [
    {
      "name": "append-no-conflicts",
      "description": "Indexes the whole document corpus using Elasticsearch settings.",
      "index-settings": {
        "index.number_of_shards": 6,
        "index.number_of_replicas": 1
      },
      "schedule": [
        {
          "parallel":
          {
            "clients": 2,
            "tasks": [
              {
                "operation": "index-append",
                "warmup-time-period": 120
              },
              {
                "operation": "query-match-all",
                "warmup-iterations": 1000,
                "iterations": 1000,
                "target-throughput": 100
              }
            ]
          }
        }
      ]
    }
  ]
}

Hi @obudiman,

upon first glance your setup and the track both seem fine.

I'll try to reproduce this behavior locally but this will take a little bit of time. So bear with me; I'll come back to you with my findings then.

Daniel

Hi @obudiman,

after having a closer look I think you really have a mistake in your track. You have defined an ingest pipeline for the bulk operation but how did you create it? If you created it before you ran the benchmark then you should be aware that by default Rally will wipe all indices that are part of the benchmark because it needs to ensure a consistent state. I fear that the benchmark produces tons of exceptions on the server due to the missing pipeline and this is the slowdown that you see.

You can do two things:

  • Write a custom runner, schedule it as first operation and create the ingest pipeline there (see docs on how to write a custom runner). I fear this will not be easily doable because Rally uses the Elasticsearch Python client in version 2.3 at the moment and ingest pipelines are a new feature in 5.0.
  • Set the index to auto-managed=false (see Github ticket) and create the index and the index pipeline yourself. This is a new feature in Rally 0.4.6. I have just released Rally 0.4.6 so you can progress in that direction.

By the way, with 0.4.6 you can also remove the dummy data-url parameter as I've loosened some constraints.

It would be great if you could report your results.

Daniel

Hi Daniel,

Thanks for checking on this, I created and registered the pipeline before running the benchmark (also confirmed that it is working fine when tested manually with a small set of the same document).

I can confirm that Rally did take the pipeline and it worked just fine as I can see all the final documents indexed properly by Rally with all the processing that we did with the pipeline (the geoip data is there). So I don't think Rally wiped the pipeline data. I have to note again that we run the Rally against an existing cluster, it's not being provisioned by Rally; maybe it has different behaviour in this case.

[INFO] Racing on track [apache], challenge [append-no-conflicts] and car [external]

In this case, since we can already confirm the pipeline is working, I don't suppose the two suggestions will work, will it..?

Appreciate any further suggestion on this,

Oswin

Hi Oswin,

if you've defined the ingest pipeline independently of the index, then all should be fine, that's correct. Rally will not touch the pipeline then.

As we set aside the Rally specific part, let's continue the discussion then in the other topic you've opened.

Daniel

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.