Hello,
We have deployed rally (0.4.5) to measure our existing cluster (external car) which consisted of 3 dedicated data nodes and 2 ingest/client nodes. We are indexing 5 million apache log documents already in JSON format as the test data and we're using geoip ingest pipeline to retrieve the location data based on the IP address data. The ES cluster (5.0.1) was set up just for this benchmark.
The track.json used is shown below and the esrally is invoked with the following command:
esrally --track=apache --offline --target-hosts=10.0.0.180:9200,10.0.0.181:9200 --pipeline=benchmark-only
We run two set of indexing tests to compare the pipeline performance, with the same set of data and environment.
- Without pipeline in place: we got around 18k docs/s median throughput
- With the pipeline in place (which only has geoip processor): we had 9k docs/s median throughput.
So based on the result, it seems the indexing throughput was reduced to half, purely because of the pipeline, which seems very strange as we assumed geoip is a common processor which is widely used.
So we're thinking to get some input if there's any rally configuration which we might have missed that cause the big performance setback, before we start suspecting the problem is on the ingest geoip plugin or other stuff.
Are we doing it right?
track.json:
{ "meta": { "short-description": "Apache Logging benchmark", "description": "This benchmark indexes Apache server log data. Data-url below is a dummy as offline data is used instead", "data-url": "http://benchmarks.elasticsearch.org.s3.amazonaws.com/corpora/apache" }, "indices": [ { "name": "apachelog", "types": [ { "name": "type", "mapping": "mappings.json", "documents": "apachelog.bz2", "document-count": 5000000, "compressed-bytes": 188542509, "uncompressed-bytes": 2263006234 } ] } ], "operations": [ { "name": "index-append", "operation-type": "index", "bulk-size": 8000, "pipeline": "pl_clickstream" }, { "name": "query-match-all", "operation-type": "search", "body": { "query": { "match_all": {} } } } ], "challenges": [ { "name": "append-no-conflicts", "description": "Indexes the whole document corpus using Elasticsearch settings.", "index-settings": { "index.number_of_shards": 6, "index.number_of_replicas": 1 }, "schedule": [ { "parallel": { "clients": 2, "tasks": [ { "operation": "index-append", "warmup-time-period": 120 }, { "operation": "query-match-all", "warmup-iterations": 1000, "iterations": 1000, "target-throughput": 100 } ] } } ] } ] }