What's limiting my Elasticsearch?

Hello!

I'm trying to make my Elasticsearch work faster.
I have a logstash that's processing 17k/s events, when using null output.
If I set Elasticsearch as my output I'm able to get 6k/s only.

The Elasticsearch 1.7.1 is currently a single node on a separate server.
It's 16 cores, 64 GB of RAM + SSD disk.

I can see that the CPU utilisation is quite low - around 40%.
The IO is also nothing for SSD - 20MB/s of writes.
The heap space is set to 30GB, so it shouldn't be a problem too.

Does anyone have an idea how can I check what's limiting my Elasticsearch?
I've tried to set more workers on logstash elasticsearch output - no significant change observed.

Have you followed the following optimization guidelines, especially around segments and merging? What does your Elasticsearch configuration look like?

Ooops... looks like I missed the
index.store.throttle.type: none
switch... It would make perfect sense, as I could see that the IO was never above 20MB/s... and that's the default value of throttling.

I'll test it and let you know.
Thanks!

Ok... looks like setting index.store.throttle to none helped a little.

But still the elasticsearch isn't consuming the load generated by logstash.
The logstash generated around 17k events/s
I'm sending them using http mode to a single elasticsearch using 4 workers.

The Elasticsearch configuration is as follows:

cluster.name: MD-test

index.number_of_shards: 1
index.number_of_replicas: 0

path.logs: /opt/MD/logs/monitoring

discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["127.0.0.1", "10.141.51.19:9300"]
#10.141.51.19:9300 is unavailable for now 

index.store.throttle.type: none

I get an overall throughput of 8.5k - 9k events/s.

Mem: 65972480k total, 38879532k used, 27092948k free, 182728k buffers
The CPU usage is around 25%
The disk IO is:

  • writes: 20-40MB/s
  • reads: 0
  • cancelled writes: 10-20 MB/s

What's limiting me now? How can I check it?
Didn't install Marvel, as this machine doesn't have internet access, but I can workaround it if needed.

Hi, try to increase number of shards to 4 (number of workers).

Single ES node can easily handle 20-25 events/sec but it depends on many factors. Maybe you have a lot of fields or some analyzers or doc values in your mapping?

Hello Rusty,

increasing number of shards didn't help much.
I could get 10.000 events per second. Now it's 11.500.

The cpu usage is around 30-40%.

I was thinking about increasing the number of indexing threads.
As documentation says - it should be equal to number of CPU cores by default, but in node stats I can see:

  "thread_pool" : {
    "index" : {
      "threads" : 0,
      "queue" : 0,
      "active" : 0,
      "rejected" : 0,
      "largest" : 0,
      "completed" : 0
    },

Looks strange...

Would you share your index mapping?

As for thread look at thread_pool.bulk you can monitor thread pool activity with this statement:

watch -d -n 5 "curl -s localhost:9200/_cat/thread_pool | sort"

You can also play with processors setting.

Sure, that's my mapping:

{
  "template" : "logstash-*",
  "settings" : {
    "index.refresh_interval" : "5s"
  },
  "mappings" : {
    "_default_" : {
       "_all" : {"enabled" : true, "omit_norms" : true},
       "dynamic_templates" : [ {
         "message_field" : {
           "match" : "message",
           "match_mapping_type" : "string",
           "mapping" : {
             "type" : "string", "index" : "analyzed", "omit_norms" : true
           }
         }
       }, {
         "string_fields" : {
           "match" : "*",
           "match_mapping_type" : "string",
           "mapping" : {
             "type" : "string", "index" : "not_analyzed", "omit_norms" : true, "store" : true
           }
         }
       } ],
       "properties" : {
         "@version": { "type": "string", "index": "not_analyzed" },
         "geoip"  : {
           "type" : "object",
             "dynamic": true,
             "properties" : {
               "location" : { "type" : "geo_point" }
             }
         }
       }
    }
  }
}
host            ip           bulk.active bulk.queue bulk.rejected index.active index.queue index.rejected search.active search.queue search.rejected 
XXXXX2BPR110V02 10.141.51.15           0          0             0            0           0              0             0            0               0 

bulk.active varies from 0 to 16
bulk.queue is usually 0, but sometimes has some greater value (1, 1, 12)
The rest is 0.

Firstly disable _all if don't need it. What about _source do you using it or not (if not try to disable it)?
Try to increase "index.refresh_interval" : "5s" to 30s.

For settings add:

"settings": {
"index.refresh_interval": "30s",
"index.codec.bloom.load": "false",
},

For not_analyzed strings try following (Do you really need store: true and all index_options as it is in your mapping template?):

"some_string_field": {
"norms": {
"enabled": false
},
"index": "not_analyzed",
"omit_norms": true,
"store": false,
"type": "string",
"index_options": "docs"
},

I've changed the refresh_interval, removed "store" option and _all field.
Btw. it turned out, that somehow the mapping wasn't applied, so I had the default configuration - all strings analyzed etc.
So after applying all these changes I couldn't see any difference in performance.

I've also performed some experiments with logstashes...
I was able to parse 7000 entries per second using one Logstash & one Elasticsearch.
I was able to parse 1100 entries per second using two Logstashes on single server & one Elasticsearch
I was able to parse 1400 entries per second using three Logstashes on single server & one Elasticsearch.

I couldn't see much difference between 1/2 logstashes on a server when using null output.
But now it looks like it makes a difference for Elasticsearch.
So I assume it's either a Logstash issue (Elasticsearch output limiting the Logstash performance) or Elasticsearch working better when more clients call him (but I've tried to tune the "workers" switch on logstash elasticsearch output - no improvement observed.)

I'll try to add more Logstashes tomorrow to see if Elasticsearch will handle more load.

It's only applied to new indexes, so you need to recreate old index (delete and reload your test data again).

Yes, I know... removed everyting, the mapping changes got applied :slightly_smiling:
I'm waiting for additional servers to check if putting more pressure on Elasticsearch will make it sweat a little :slightly_smiling:

Hmm... I've tried with 2 more elasticsearches on different servers...
The usage of Elasticsearch is still 25 - 30% and it's handling around 14000 entries per second.

I have no what's limiting the throughput...

Have you tried to start two independent logstash instance at once? Maybe you are limited by logstash output plugin not by elasticsearch itself?

For data rate 17k/sec is about 1 468 800 000 documents a day it a lot but can be handled by one es instace. But if you are planning to keep this data for a long period one instance is not sufficient.

Which version of Logstash are you using? What does your configuration look like? Have you tried using more than one Logstash instance on different hosts to ensure Logstash is not the bottleneck? Is CPU saturated on the Logstash host during indexing?

Yes, that's what I've been waiting for - more servers to run Logstashes on... I've reached 14k per second using 3 Logstashes and adding more didn't make any difference.

I know one instance for elasticsearch is not enough - I just run on 1 to ease the maintenance and tests.

Can you share some sample data for you index and actual index mapping?

I use Logstash 1.5.6.

the configuration is:

input {
  beats {
    port=>5000
  }
  beats {
    port=>5002
  }
}
filter {
  ...
}

output {
  if "metric" in [tags] {
    file {
      path => "/opt/logs/monitoring/metric-even-15.log"
    }
  } else {
    elasticsearch { 
      host => "10.141.51.15"
      port => "9200"
      protocol => "http"
      template => "/opt/monitoring/config/elasticsearch-template.json"
      template_overwrite => true  
      workers => 2
    }
  }
}

The CPU has high usage, but it surely has some spare cycles, so it's not Logstash limitation.

Sample data:

10.141.96.3 [07/Mar/2016:06:35:16 +0000] "POST /isAlive HTTP/1.0" 204 - 1 

20160307 16:52:48.437 744454d7-d659-480f-b3e0-76b8f57cdfe7 10.135.9.27 10.141.96.42 POST /session/invalidate null {"sso-token":"xxxxx","Content-Type":"application/json",,"lb-cookie":"341",,,"x-forwarded-for":"10.135.9.27",} 

that's the majority of log entries.