Ingest-geoip plugin performance


#1

Hello,

We're currently doing a benchmark (using Rally) on our existing cluster by doing bulk indexing with an ingest pipeline, which only has a geoip processor, and we noticed considerably lower indexing performance (around 50% slower) as compared to one without any pipeline processing. We're not really sure yet if this is due to the benchmark setup (there's another topic I've opened separately for this), but we're wondering if this amount of performance hit is normal when using the geoip ingest plugin.

Just a brief summary of the test, we're indexing 5mil JSON format document to a single index on 3 data nodes and 2 dedicated ingest nodes. We set up 6 shards for the index and bulk size is set to 5000. We have also confirmed the data are indexed properly with the geoip data result shown fine after the test; no issue on the pipeline definition.

We did notice that the CPU and memory utilization on the ingest node are very minimal during the indexing test, it averages about 10% CPU (8 core) usage, and 35% JVM heap usage (8GB heap size). So while there might be some other ways to tune the indexing speed, we're curious to know what is the typical indexing performance hit when using the geoip ingest plugin, considering this should be one of the commonly used plugin; 50% indexing performance hit seems a bit too much.

Appreciate any input,

Oswin


Is this rally result valid?
(David Pilato) #2

I have no idea of what we can expect but I think that a fair comparison would be Logstash and see if the Ingest throughput is better with node ingest or not.

I'm thinking myself of running such benchmarks soonish as I'm getting super often the similar question at conferences.

If you do such a test, please share it! :slight_smile:


(Daniel Mitterdorfer) #3

Hi @obudiman,

I ran a subset of our official logging track on a single node with 4GB heap (Elasticsearch 5.0.1 and the ingest-geoip plugin).

I used the following pipeline definition:

  "description": "_description",
  "processors": [
    {
      "geoip": {
        "field": "clientip",
        "properties": ["city_name", "country_iso_code", "country_name"]
      }
    }
  ]

Without the pipeline I get a median indexing throughput of 74053 docs/s. With the pipeline I get 45528 docs/s. So in my scenario the slowdown is significant but not 50% (but this depends on a lot of factors and I benchmarked a slightly different scenario than yours).

The profiler shows that the geoip ingest processor takes up around 20% in the runtime profile:

Note that the geoip ingest plugin is rather new so it's likely that performance will improve in later versions.

Daniel


(Daniel Mitterdorfer) #4

Hi,

I've created issue #22074 in the Elasticsearch repo and will work on improving the situation.

Daniel


#5

@dadoonet Well, we did a quick comparison with logstash by ingesting the same 5mil documents through filebeat. Same bulk size of 8000 and only geoip filter is used on logstash.

Logstash Test: Filebeat > Logstash > ES
Ingest Test: Filebeat > Ingest > ES

And we got around 10% faster indexing time with logstash. Then again, we're not really sure in the scenario when there are multiple filebeats sending the docs directly to the ingest node will improve throughput.. but it doesn't look like it will change the fact that the geo-ip filter impact to the throughput is significant.

@danielmitterdorfer Thanks, appreciate the test and the issue created.


#6

Just for reference, here is the geoip ingest pipeline definition we've been using:

PUT /_ingest/pipeline/pl_clickstream
{
    "description" : "Parse apache clickstream log",
    "processors" : [
      {
        "geoip" : {
          "field" : "clientip",
          "properties" : [
            "continent_name",
            "country_name",
            "region_name",
            "city_name",
            "location"
          ]
        }
      }
  ]
}

#7

@danielmitterdorfer I just happened to see the issue thread, and ingest-user-agent was mentioned there. We actually did another rally benchmark with that ingest plugin as well and we found around 35% indexing performance hit with it; the weirdest part was that the "agent" source field value that we have is exactly the same across all of the 5mil docs, we thought it would be much faster but apparently not so much..

These are some of the few lines of the input docs fo reference,

{"@timestamp":"2016-12-08T23:54:57.272+0000","clientip":"46.68.2.250","response":"200","request":"/or.CONDITIONS?os_destination\u003dlimitations/or.See\u0026xxx\u003dto\u0026yyy\u003dLicense\u0026id\u003d154","bytes":"15651","referer":"/pages/editpage.action","agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/601.7.7 (KHTML, like Gecko) Version/9.1.2 Safari/601.7.7","uid":"d19daaa2-96d7-4f46-a5b1-26f97e0157af","pagetime":"689"}
{"@timestamp":"2016-12-08T23:54:57.272+0000","clientip":"67.59.33.139","response":"200","request":"/Unless.OF?os_destination\u003dWARRANTIES/and.permissions\u0026xxx\u003dexpress\u0026yyy\u003dunder\u0026id\u003d345","bytes":"15651","referer":"/pages/editpage.action","agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/601.7.7 (KHTML, like Gecko) Version/9.1.2 Safari/601.7.7","uid":"7786045c-fe6d-48e6-b183-d5f7b12c215e","pagetime":"615"}
{"@timestamp":"2016-12-08T23:54:57.273+0000","clientip":"33.17.149.6","response":"200","request":"/BASIS.writing?os_destination\u003dan/BASIS.or\u0026xxx\u003dan\u0026yyy\u003dfor\u0026id\u003d193","bytes":"15651","referer":"/pages/editpage.action","agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/601.7.7 (KHTML, like Gecko) Version/9.1.2 Safari/601.7.7","uid":"76cbbc43-746c-43ac-95ea-c7563e00a299","pagetime":"304"}
{"@timestamp":"2016-12-08T23:54:57.273+0000","clientip":"44.24.40.67","response":"200","request":"/BASIS.OF?os_destination\u003dWITHOUT/distributed.License\u0026xxx\u003ddistributed\u0026yyy\u003din\u0026id\u003d207","bytes":"15651","referer":"/pages/editpage.action","agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/601.7.7 (KHTML, like Gecko) Version/9.1.2 Safari/601.7.7","uid":"3cfdf2ed-f309-40ad-9af4-7f965b7866c2","pagetime":"792"}
{"@timestamp":"2016-12-08T23:54:57.273+0000","clientip":"178.85.183.115","response":"200","request":"/language.AS?os_destination\u003dor/an.language\u0026xxx\u003dANY\u0026yyy\u003drequired\u0026id\u003d9","bytes":"15651","referer":"/pages/editpage.action","agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/601.7.7 (KHTML, like Gecko) Version/9.1.2 Safari/601.7.7","uid":"20a214e5-bf8f-4e97-8808-f48c8100b660","pagetime":"200"}
{"@timestamp":"2016-12-08T23:54:57.273+0000","clientip":"21.203.59.189","response":"200","request":"/implied.law?os_destination\u003dKIND/required.the\u0026xxx\u003dsoftware\u0026yyy\u003dUnless\u0026id\u003d185","bytes":"15651","referer":"/pages/editpage.action","agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/601.7.7 (KHTML, like Gecko) Version/9.1.2 Safari/601.7.7","uid":"c1d27c6c-efd7-40f4-9777-572847b34597","pagetime":"19"}

(Daniel Mitterdorfer) #8

Hi @obudiman,

just as a heads-up: I just merged two performance improvements that will be released with Elasticsearch 5.2:

  • The Geoip processor caches the 1000 most recent lookup results now (see #22231))
  • Ingest pipelines are a little bit faster now by default (by an internal simplification, see #22234).

This should improve your situation after the upgrade a bit but I would not expect wonders: It's a matter of fact that the ingest pipeline will add some overhead and thus reduce your indexing throughput. I have a few ideas for further performance improvements but they will likely involve more work than the fixes that I just made.

I hope that these fixes help you already a bit and thanks for bringing this to our attention!

Daniel


(system) #9

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.