Performance Issues with ElasticSearch

Igor_Zeiger · August 12, 2015, 4:13pm

Hi,

We're currently facing some performance issues with our ElastiSearch cluster and trying to find what is the issue and how we may solve it. We have system with Nxlog -> Logstash Broker -> Redis -> 12 Logstash clients -> 9 ElasticSearch Node + 1 ElasticSearch Master. At some point we hit the situation when the data processing slows down. The symptoms are that logstash machine do not take data with the same speed, broker puts it to Redis, which causes two things:
1 - delay in putting data to shards - we can see up to couple of hours in data processing
2 - redis queue became overloaded, reaching up to 30 million documents and redis just being killed by OS

We don't see any specific metrics in Marvel or/and HQ o KOPF plugins that ES nodes are overloaded, everything looks absolutely normal.

So, I appreciate any help or advice, since we don't see anything that can help us identify the problem

Below is our configuration:

Logstash Broker:

input {

file {
type => "syslog_product"
path => ["/data/product/*"]
sincedb_path => "/data/sincedb"
}
}
output {

stdout {}
redis {
host => ["euwest-redis"]
data_type => "list"
key => "product:syslog_product"
type => "syslog_product"
batch => true
workers => 8
}
}

Logstash Machines:

input {

redis {

host => ["euwest-redis"]
data_type => "list"
key => "product:syslog_product"
type => "syslog_product"
tags => "product_pri"
threads => 8
batch_count => 200

}
}

filter {

grok {
match => ["message", "%{DATA:hostname} %{DATA:cluster} %{GREEDYDATA:empty} - - - [%{MONTHDAY:day}/%{MONTH:month}/%{YEAR:year}:%{HOUR:hour}:%{MINUTE:minute}:%{SECOND :second}+%{GREEDYDATA:empty}] {{ %{DATA:http_request} /%{DATA:snippet}/%{DATA:referer} }} %{DATA:http_code} {{ %{DATA:empty} }} {{ %{DATA:url} }} {{ %{DATA:browser} }} {{ %{DATA:empty} }} {{ %{DATA:client_ip} }} {{ %{DATA:empty} {{ %{DATA:empty} }} {{ %{DATA:empty} }} {{ %{DATA:empty} }} {{ %{DATA:session_time} }} {{ %{DATA:empty} }} {{ %{DATA:session_id} }} {{ %{DATA:snippet_id} }} {{ %{DATA:product_version} }} {{ %{DATA:papyrus_revision} }}"]
}

mutate {
replace => [ "@source_host", "%{hostname}" ]
remove => [ "empty", "@source_path", "@source" ]
convert => [ "snippet", "integer", "session_time", "float" ]

}

date {
match => [ "MMM d HH:mm:ss", "MMM dd HH:mm:ss", "ISO8601" ]
}
if "_grokparsefailure" in [tags] { drop {} }
}

output {

elasticsearch {
cluster => "G177"
host => "euwest-elastic"
port => "9300"
index => "logstash-%{+YYYY.MM.dd}"
manage_template => false
}
}

ElasticSearch Node :

cluster.name: G177
node.name: elasticsearch-euwest-qqqq
node.master: false
node.data: true
bootstrap.mlockall: true
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["MASTER IP"]
network.host: eth0:ipv4
path.conf: /etc/elasticsearch
path.data: /ebs/elasticsearch
path.logs: /data/logs/elasticsearch
path.plugins: /usr/share/elasticsearch/plugins
indices.memory.index_buffer_size: 50%
index.translog.flush_threshold_ops: 50000
index.store.type: mmapfs
index.refresh_interval: 10s
indices.fielddata.cache.size: 25%
indices.cluster.send_refresh_mapping: false
index.number_of_replicas: 1
index.search.slowlog.threshold.query.warn: 10s
index.search.slowlog.threshold.query.info: 5s
index.search.slowlog.threshold.query.debug: 2s
index.search.slowlog.threshold.query.trace: 500ms

index.search.slowlog.threshold.fetch.warn: 1s
index.search.slowlog.threshold.fetch.info: 800ms
index.search.slowlog.threshold.fetch.debug: 500ms
index.search.slowlog.threshold.fetch.trace: 200ms

index.indexing.slowlog.threshold.index.warn: 10s
index.indexing.slowlog.threshold.index.info: 5s
index.indexing.slowlog.threshold.index.debug: 2s
index.indexing.slowlog.threshold.index.trace: 500ms

indices.store.throttle.type: none

warkolm · August 12, 2015, 10:55pm

What versions are you on?

Igor_Zeiger · August 13, 2015, 6:51am

Logstash: 1.5.3
ElasticSearch: Version: 1.4.4, Build: c88f77f/2015-02-19T13:05:36Z, JVM: 1.7.0_79

warkolm · August 13, 2015, 7:27am

How much data in the cluster? How many nodes and what are their specs?

I'd suggest upgrading ES (irrespective of those answers).

Igor_Zeiger · August 14, 2015, 9:09pm

We have 1 Master Node, 2 Search Nodes and 9 Data Nodes.

We store 40 days of data. Each day is about 500Gb.

warkolm · August 14, 2015, 10:06pm

How much RAM and heap for the data nodes?

Igor_Zeiger · August 17, 2015, 6:55pm

Each data node has 30Gb of Memory.

The heap size is set to 25Gb :

usr/bin/java -Xms25g -Xmx25g -Xss256k -Djava.awt.headless=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -Delasticsearch -Des.pidfile=/var/run/elasticsearch/elasticsearch.pid -Des.path.home=/usr/share/elasticsearch -cp :/usr/share/elasticsearch/lib/elasticsearch-0.90.9.jar:/usr/share/elasticsearch/lib/:/usr/share/elasticsearch/lib/sigar/ -Des.default.path.home=/usr/share/elasticsearch -Des.default.path.logs=/var/log/elasticsearch -Des.default.path.data=/var/lib/elasticsearch -Des.default.path.work=/tmp/elasticsearch -Des.default.path.conf=/etc/elasticsearch org.elasticsearch.bootstrap.Elasticsearch

warkolm · August 17, 2015, 9:56pm

That my be part of if then.

We recommend setting heap to 50% of total system memory to allow the OS to cache the underlying lucene files to help performance.

Igor_Zeiger · August 18, 2015, 1:01pm

Thanks! But if I reduce heap size to 15Gb, wouldn't it create problems with java memory. We had some issues, when heap was reaching 100%, causing Java to crash.

warkolm · August 18, 2015, 10:03pm

Then your cluster is overloaded and you need more resources or nodes, or less data.

There's only so much you can do with a given set of resources

Topic		Replies	Views
Serious data delay in elasticsearch Elasticsearch	2	327	July 5, 2017
Issues with logstash sustained throughput Elasticsearch	2	454	July 6, 2017
Logstash at 100% CPU, slow to process Redis queue to Elasticsearch Logstash	3	1068	July 6, 2017
Possible ES bottleneck in redis-logstash-elasticsearch system Elasticsearch	2	546	July 6, 2017
Logstash 1.5.0 Performance Regression Logstash	11	1927	July 6, 2017

Performance Issues with ElasticSearch

Related topics