Expectable performance


#1

Hi,

we have some application servers and we would like to ship all their logs into ES.

Therefore I set up a ES 2.0.0 cluster of 8 nodes on 5 different virtual machines (we are using xen):

3 machines with each 1 data node and 1 master node
1 machine with logstash, rabbitmq and a client node
1 machine with kibana and a client node

Settings:

vm with data and master node

  • total of 16gb memory
  • data node 12gb ES_HEAP
  • master node 512M ES_HEAP
  • total of 4 cores
  • direkt attached raid array of spinning disks

vm with logstash

  • queue: logstash with input codec for multiline ---> Rabbit MQ ---> up to 4 logstash instances for filtering with output to (ES Client on Logstash, ES DATA Node 1,2,3)

Our Application is distributed to 18 machines with ngninx/pound reverseproxys for load balancing.

Shipping and processing of the nginx and pound logs works gently. Indexsize is about 12GB per day and the number of docs is about 10mio. Indexrate is about 300-500/sec.

When we activate the shipping of the App-Server logs we run into performance issues. The needed indexing rate is about 1500-2000/sec. The documents are a bit larger (we have java stack traces in it for example). Indexsize per day grows up to 70GB.

Our monitoring tool is sending frequently requests to ES and ask for some interface states of the application. Therefore we analyse the last 5minutes in time.

The problem is, that the last 5minutes are not available when rabbitmq queue is rising due to bad indexing performance.

Finally :slight_smile:
My Question is: Do I expect to much out of this setup, or is this typical performance of such a setup.

Thanks in advance!

Cheers,
Niko

Below you'll find our configuration of the data nodes:

======================== Elasticsearch Configuration DATA NODE =========================

cluster.name: tatooine

node.name: data-node-1

node.data: true
node.master: false

index.number_of_shards: 9
index.number_of_replicas: 1

path.logs: /var/log/elasticsearch-data
path.data: /var/lib/elasticsearch-data

bootstrap.mlockall: true

network.host: IP of the vm

http.port: 9202
transport.tcp.port: 9302

discovery.zen.ping.unicast.hosts: ["List of Our IP:Port combinations"]

discovery.zen.minimum_master_nodes: 2
discovery.zen.ping.multicast.enabled: false
discovery.zen.fd.ping_timeout: 30s

index.merge.scheduler.max_thread_count: 1
index.refresh_interval: 30s
index.translog.flush_threshold_size: 1gb
index.translog.flush_threshold_period: 30m
indices.memory.index_buffer_size: 50%
indices.memory.min_shard_index_buffer_size: 128mb

threadpool.search.type: fixed
threadpool.search.size: 20
threadpool.search.queue_size: 100

threadpool.bulk.type: fixed
threadpool.bulk.size: 60
threadpool.bulk.queue_size: 300

threadpool.index.type: fixed
threadpool.index.size: 20
threadpool.index.queue_size: 100

indices.fielddata.cache.size: 15%
indices.fielddata.cache.expire: 6h
indices.cache.filter.size: 15%
indices.cache.filter.expire: 6h


(Christian Dahlqvist) #2

I have a few comments on your setup:

It is recommended to set aside 50% of available memory to heap, so I would recommend lowering the data node heap to 7-8GB, especially considering that you also have a dedicated master node on each host.

Given the data volumes you are mentioning, this seems like an unnecessarily large number of shards, You should be fine reducing this to 3 (or possibly 6).

Have you verified that it is Elasticsearch and not your single Logstash indexer that is limiting indexing throughput? What does CPU usage look like on this node during indexing?


#3

Hi,

I did a performance test with a single data node. I simply took one of the data nodes and unjoined it from the cluster.

1 VM with 16gb
1 data node with 7gb
1 master node with 512m

shards set to 6
replica set to 0

rest of the config was not changed (instead of cluster settings)

This node was able to index ~1000 Docs/sec

On logstash node I started an additional indexer that is reading messages from another queue (with application logs in it).

So logstash Queue looked like this

reverseproxy logs ---> logstash-input instance ---> rabbitmq.queue1 ---> logstash-indexer1 ----> produktive ES (with 2 nodes left)
app logs ---> logstash-input instance ---> rabbitmq.queue2 ---> logstash-indexer2 ----> Test ES (only 1 data node)

I processed 650k Docs. I added screenshots with cpu utilization of logstash and the ES node. Also Screen of marvel.



As you can see CPU is idling when indexing and there is some IO-Wait.

Do you have any recommendations how to improve performance?


(Christian Dahlqvist) #4

Depending on your data and configuration, Logstash throughput often end up being CPU bound. What does the CPU usage on the Logstash VM look like during indexing? What does your Logstash config look like?


#5

Hi,

CPU Load you can see in the last screenshot of my recent post (logstash in the heading).

The Config of Logstash is as follows:

logstash-input instance (collects everything from the servers and writes to rabbitmq)

input {
lumberjack {
type => "nginx_access"
port => 5001
ssl_certificate => "/etc/pki/tls/certs/logstash-forwarder.crt"
ssl_key => "/etc/pki/tls/private/logstash-forwarder.key"
}
lumberjack {
type => "cdp-app"
port => 5002
ssl_certificate => "/etc/pki/tls/certs/logstash-forwarder.crt"
ssl_key => "/etc/pki/tls/private/logstash-forwarder.key"
codec => multiline {
pattern => "^\s.+"
what => "previous"
}
}
lumberjack {
type => "bosh-app"
port => 5003
ssl_certificate => "/etc/pki/tls/certs/logstash-forwarder.crt"
ssl_key => "/etc/pki/tls/private/logstash-forwarder.key"
codec => multiline {
pattern => "^\s.+"
what => "previous"
}
}
}

output {
if "cdp-app" in [type] {
rabbitmq {
exchange => "cdp-rabbitmq"
exchange_type => "direct"
key => "cdp-key"
host => ""
workers => 4
durable => true
persistent => true
}
}
else {
rabbitmq {
exchange => "logstash-rabbitmq"
exchange_type => "direct"
key => "logstash-key"
host => ""
workers => 4
durable => true
persistent => true
}
}
}

Logstash Indexer (reads from rabbitmq and does the filtering afterwards sending to ES):

input {
rabbitmq {
host => ""
queue => "logstash-queue"
durable => true
key => "logstash-key"
exchange => "logstash-rabbitmq"
threads => 3
prefetch_count => 50
port => 5672
}
}

filter {
if [type] == "bosh-app" {
grok {
match => { "message" => "%{BOSHAPP}" }
patterns_dir => "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-patterns-core-2.0.2/patterns/"
}
}
if [type] == "cdp-app" {
grok {
match => { "message" => "%{CDPAPPSHORT}" }
match => { "message" => "%{CDPAPPLONG}" }
patterns_dir => "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-patterns-core-2.0.2/patterns/"
}
}
if [type] == "nginx-access" {
grok {
match => { "message" => "%{NGINXACCESS}" }
}
geoip {
source => "ip"
target => "geoip"
add_field => [ "[geoip][coordinates]", "%{[geoip][longitude]}" ]
add_field => [ "[geoip][coordinates]", "%{[geoip][latitude]}" ]
}
mutate {
convert => [ "[geoip][coordinates]", "float"]
convert => [ "duration", "float" ]
convert => [ "body_bytes_sent", "float" ]
}
}
if [type] == "pound" {
grok {
match => { "message" => "%{POUNDACCESS}" }
}
geoip {
source => "ip"
target => "geoip"
add_field => [ "[geoip][coordinates]", "%{[geoip][longitude]}" ]
add_field => [ "[geoip][coordinates]", "%{[geoip][latitude]}" ]
}
mutate {
convert => [ "[geoip][coordinates]", "float"]
convert => [ "duration", "float" ]
convert => [ "body_bytes_sent", "float" ]
}
}
if "_grokparsefailure" in [tags]
{
drop {}
}
}

output {
elasticsearch {
hosts => ["ip:9200"]
}
}


#6

No other ideas?

So I would like to change the question:

What do you think is the minimal setup to process the following Data:

Indexing

  • 10 Mio nginx log entrys per day (with geo ip)
  • 60 Mio JBOSS log entrys per day (with java stack traces: multiline!)

Searching:

  • around 50 searches per 5minutes
  • each search uses aggregate functions
  • each search needs no data older than 5minutes

We want to go back in time for at least 14 days: total docs: ~ 1.000.000.000

Searchlatency of 5-10sek is ok. Indexing is on focus.

Of course do we have peak times on our applications, so the 70mio docs are not evenly distributed.

Thanks in advance!

Niko


(system) #7