Performance Issues with ElasticSearch

Hi,

We're currently facing some performance issues with our ElastiSearch cluster and trying to find what is the issue and how we may solve it. We have system with Nxlog -> Logstash Broker -> Redis -> 12 Logstash clients -> 9 ElasticSearch Node + 1 ElasticSearch Master. At some point we hit the situation when the data processing slows down. The symptoms are that logstash machine do not take data with the same speed, broker puts it to Redis, which causes two things:
1 - delay in putting data to shards - we can see up to couple of hours in data processing
2 - redis queue became overloaded, reaching up to 30 million documents and redis just being killed by OS

We don't see any specific metrics in Marvel or/and HQ o KOPF plugins that ES nodes are overloaded, everything looks absolutely normal.

So, I appreciate any help or advice, since we don't see anything that can help us identify the problem

Below is our configuration:

Logstash Broker:

input {

file {
type => "syslog_product"
path => ["/data/product/*"]
sincedb_path => "/data/sincedb"
}
}
output {

stdout {}
redis {
host => ["euwest-redis"]
data_type => "list"
key => "product:syslog_product"
type => "syslog_product"
batch => true
workers => 8
}
}

Logstash Machines:

input {

redis {

host => ["euwest-redis"]
data_type => "list"
key => "product:syslog_product"
type => "syslog_product"
tags => "product_pri"
threads => 8
batch_count => 200

}
}

filter {

grok {
match => ["message", "%{DATA:hostname} %{DATA:cluster} %{GREEDYDATA:empty} - - - [%{MONTHDAY:day}/%{MONTH:month}/%{YEAR:year}:%{HOUR:hour}:%{MINUTE:minute}:%{SECOND :second}+%{GREEDYDATA:empty}] {{ %{DATA:http_request} /%{DATA:snippet}/%{DATA:referer} }} %{DATA:http_code} {{ %{DATA:empty} }} {{ %{DATA:url} }} {{ %{DATA:browser} }} {{ %{DATA:empty} }} {{ %{DATA:client_ip} }} {{ %{DATA:empty} {{ %{DATA:empty} }} {{ %{DATA:empty} }} {{ %{DATA:empty} }} {{ %{DATA:session_time} }} {{ %{DATA:empty} }} {{ %{DATA:session_id} }} {{ %{DATA:snippet_id} }} {{ %{DATA:product_version} }} {{ %{DATA:papyrus_revision} }}"]
}

mutate {
replace => [ "@source_host", "%{hostname}" ]
remove => [ "empty", "@source_path", "@source" ]
convert => [ "snippet", "integer", "session_time", "float" ]

}

date {
match => [ "MMM d HH:mm:ss", "MMM dd HH:mm:ss", "ISO8601" ]
}
if "_grokparsefailure" in [tags] { drop {} }
}

output {

elasticsearch {
cluster => "G177"
host => "euwest-elastic"
port => "9300"
index => "logstash-%{+YYYY.MM.dd}"
manage_template => false
}
}

ElasticSearch Node :

cluster.name: G177
node.name: elasticsearch-euwest-qqqq
node.master: false
node.data: true
bootstrap.mlockall: true
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["MASTER IP"]
network.host: eth0:ipv4
path.conf: /etc/elasticsearch
path.data: /ebs/elasticsearch
path.logs: /data/logs/elasticsearch
path.plugins: /usr/share/elasticsearch/plugins
indices.memory.index_buffer_size: 50%
index.translog.flush_threshold_ops: 50000
index.store.type: mmapfs
index.refresh_interval: 10s
indices.fielddata.cache.size: 25%
indices.cluster.send_refresh_mapping: false
index.number_of_replicas: 1
index.search.slowlog.threshold.query.warn: 10s
index.search.slowlog.threshold.query.info: 5s
index.search.slowlog.threshold.query.debug: 2s
index.search.slowlog.threshold.query.trace: 500ms

index.search.slowlog.threshold.fetch.warn: 1s
index.search.slowlog.threshold.fetch.info: 800ms
index.search.slowlog.threshold.fetch.debug: 500ms
index.search.slowlog.threshold.fetch.trace: 200ms

index.indexing.slowlog.threshold.index.warn: 10s
index.indexing.slowlog.threshold.index.info: 5s
index.indexing.slowlog.threshold.index.debug: 2s
index.indexing.slowlog.threshold.index.trace: 500ms

indices.store.throttle.type: none

What versions are you on?

Logstash: 1.5.3
ElasticSearch: Version: 1.4.4, Build: c88f77f/2015-02-19T13:05:36Z, JVM: 1.7.0_79

How much data in the cluster? How many nodes and what are their specs?

I'd suggest upgrading ES (irrespective of those answers).

We have 1 Master Node, 2 Search Nodes and 9 Data Nodes.

We store 40 days of data. Each day is about 500Gb.

How much RAM and heap for the data nodes?

Each data node has 30Gb of Memory.

The heap size is set to 25Gb :

usr/bin/java -Xms25g -Xmx25g -Xss256k -Djava.awt.headless=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -Delasticsearch -Des.pidfile=/var/run/elasticsearch/elasticsearch.pid -Des.path.home=/usr/share/elasticsearch -cp :/usr/share/elasticsearch/lib/elasticsearch-0.90.9.jar:/usr/share/elasticsearch/lib/:/usr/share/elasticsearch/lib/sigar/ -Des.default.path.home=/usr/share/elasticsearch -Des.default.path.logs=/var/log/elasticsearch -Des.default.path.data=/var/lib/elasticsearch -Des.default.path.work=/tmp/elasticsearch -Des.default.path.conf=/etc/elasticsearch org.elasticsearch.bootstrap.Elasticsearch

That my be part of if then.

We recommend setting heap to 50% of total system memory to allow the OS to cache the underlying lucene files to help performance.

Thanks! But if I reduce heap size to 15Gb, wouldn't it create problems with java memory. We had some issues, when heap was reaching 100%, causing Java to crash.

Then your cluster is overloaded and you need more resources or nodes, or less data.

There's only so much you can do with a given set of resources :slight_smile: