Data delay in ELK


(sundar) #1

Hi,
Everyday we are getting close to 2 TB data and per sec 70K events from Kafka to ELK.
The below are my ELK and hardware setup.
15 logstash servers(36 GM RAM, 6 CPU)
25 data nodes(36 GM RAM, 6 CPU) and each data node has 2 TB storage.

We are getting data delay almost 2 days. Could you please suggest to me is any issues with my configurations and hardware.

Regards
Sundar


(Christian Dahlqvist) #2

Have you verified whether Elasticsearch or Logstash is limiting throughput? If so, how did you go about doing so?


(sundar) #3

Thanks Christian! I have tested 6k throughput from Kafka to logstash server but not tested throughput from logstash to elasticsearch .Is any way to test?


(Christian Dahlqvist) #4

If each Logstash instance is able to process in excess of 6k events per second (90k per second in total) with all the filters present but without outputting to Elasticsearch, it sounds like either Elasticsearch or the Elasticsearch output plugin could be the bottleneck. Which version of Logstash are you using? How have you configured your Elasticsearch output plugin(s)?


(sundar) #5

I'm using Logstash 2.4 and elasticsearch is 5.0.2 version.

output {
elasticsearch {
codec => avro {
schema_uri => "/apps/schema/rocana3.schema"
}
hosts => "http://es-uat-rtp-master-ltm.xxx.com:9200/"
index => "logstash-applogs-%{+YYYY.MM.dd}-1"
workers => 6
}
}


(Christian Dahlqvist) #6

I would expect each Logstash node to be able to connect to all data nodes in order to spread the load, but it looks like each Logstash node is sending all traffic to a single node (which also based on the name seems to be a master node). Is traffic evenly spread across the cluster?


(sundar) #7

Thanks for the reply and removed ltm url from host and given all data nodes in the host.
host => ["host1","host2",............"host25"]
even not getting more data to the elasticsearch data node.


(Christian Dahlqvist) #8

What does the resource utilisation look like on the Elasticsearch nodes? Do you see high CPU usage and/or high iowait? Is there anything in the Elasticsearch logs indicating e.g. long GC or merge throttling?


(sundar) #9

Seems CPU and IO is fine. No issues for GC.

Please find below are my ELK config hope it will help you to understand my configuration and suggest if anything is wrong

Linux Infrastructure for logstash, ES and Kibana
Hardware 6 CPU / 32 GB RAM
Operating System Oracle Enterprise Linux 6 FID16a 2X-Large

Logstash :
input {
kafka {
zk_connect=>"kafka1:2181,kafka2:2181,kafka3:2181,kafka4:2181,kafka5:2181"
white_list => "applogs"
group_id => "logstash-mng-applogs-uat-rtp"
codec => avro {
schema_uri => "/apps/schema/rocana3.schema"
}
}
}
filter {
de_dot {
nested => true
}
date {
match => ["ts","UNIX_MS"]
target => "@timestamp"
timezone => "America/New_York"
}
ruby {
code => "
event['ingest_time'] = DateTime.now.strftime('%Q');
event['ingest_delay'] = (1000 * (Time.now.to_f - event['@timestamp'].to_f)).round(0);
"
}
}
output {
elasticsearch {
hosts => ["es-uat-rtp-data-1.xxx.com:9200","es-uat-rtp-data-2.xxx.com:9200","es-uat-rtp-data-3.xxx.com:9200","es-uat-rtp-data-4.xxx.com:9200","es-uat-rtp-data-5.xxx.com:9200","es-uat-rtp-data-6.xxx.com:9200","es-uat-rtp-data-7.xxx.com:9200","es-uat-rtp-data-8.xxx.com:9200","es-uat-rtp-data-9.xxx.com:9200","es-uat-rtp-data-10.xxx.com:9200","es-uat-rtp-data-11.xxx.com:9200","es-uat-rtp-data-12.xxx.com:9200","es-uat-rtp-data-13.xxx.com:9200","es-uat-rtp-data-14.xxx.com:9200","es-uat-rtp-data-15.xxx.com:9200","es-uat-rtp-data-16.xxx.com:9200","es-uat-rtp-data-17.xxx.com:9200","es-uat-rtp-data-18.xxx.com:9200","es-uat-rtp-data-19.xxx.com:9200","es-uat-rtp-data-20.xxx.com:9200","es-uat-rtp-data-21.xxx.com:9200","es-uat-rtp-data-22.xxx.com:9200","es-uat-rtp-data-23.xxx.com:9200","es-uat-rtp-data-24.xxx.com:9200","es-uat-rtp-data-25.xxx.com:9200"]
index => "logstash-applogs-%{+YYYY.MM.dd}-1"
workers => 6
}
}

Master Node ES :-

cluster.name: sei-elk-uat-rtp
node.name: sundar-master-01
node.master: true
node.data: false
path.data: /apps/masterES/data
path.logs: /apps/masterES/logs
bootstrap.memory_lock: true
network.host: 01.02.03.04
http.port: 9200
discovery.zen.ping.unicast.hosts: ["master1 ip","master2 ip","master3 ip"]
discovery.zen.minimum_master_nodes: 2
http.cors.enabled: true
http.cors.allow-origin: "*"

Data Node ES :-

cluster.name: sei-elk-uat-rtp
node.name: sundar-data-01
node.master: false
node.data: true
path.data: /apps/dataES1/data
path.logs: /apps/dataES1/logs
discovery.zen.ping.unicast.hosts: ["master1 ip","master2 ip","master3 ip"]
network.host: 05.06.07.08
http.port: 9200
bootstrap.memory_lock: true

Client ES:-

cluster.name: sei-elk-uat-rtp
node.name: sundar-client-01
node.master: false
node.data: false
path.data: /apps/clientES/data
path.logs: /apps/clientES/logs
network.host: 10.138.000.00
http.port: 9200
discovery.zen.ping.unicast.hosts: ["master1 ip","master2 ip","master3 ip"]


(sundar) #10

Today we have created close to 50 data nodes(2TB storage),3 masters and 3 clients.
still we are not getting more index rate. the screen has monitoring details.
Could you please me if any wrong settings.


(Christian Dahlqvist) #11

If you have doubled the size of the cluster, traffic is distributed across all nodes and the shards being indexed into are spread across node and you are still not seeing any performance improvement, it is quite possible that Logstash after all is limiting throughput.

The Kafka input has a range of configuration parameters that you can tune for performance, e.g. consumer_threads. It may be worthwhile tuning this, but I have personally not done it so can not really give any advice on this.


(system) #12

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.