Filebeat 2 hours latency

aqiao · April 10, 2017, 4:28am

{"@timestamp":"2017-04-07T18:22:59.075Z","beat":{"hostname":"localhost","name":"localhost","version":"5.2.1"},"input_type":"log","message":"2017-04-07T15:19:29.670Z,RACI,EURAB6E15Y=QQ,BRKR,CheckFID=22;Rate=1.083;FID2=25;Rate=1.097;TS=20170407_151929;# of RAT=8;RAT#1=(RABO 1.0642 20170407_151914);RAT#2=(CALP 1.069 20170407_151916);RAT#3=(CALP 1.068 20170407_151917);RAT#4=(BSAB 1.0865 20170407_151919);RAT#5=(BRKR 1.069 20170407_151919);RAT#6=(BRKR 1.069 20170407_151919);RAT#7=(BRKR 1.082 20170407_151919);RAT#8=(BRKR 1.083 20170407_151929),,Print Processed RAT.","offset":25201218,"source":"D:\\works\\VAA\\vae0\\partitioncore4\\test_05v1\\log\\vae0_TolCheck04_20170407_Fri_151158_478.txt","type":"tolcheck"}

Above is filebeat output , we can see @timestamp is 2017-04-07T18:22:59.075Z but event time in message field is 2017-04-07T15:19:29.670Z, there are about 3 hours latency.
I use logstash 2.3.0 and filebeat 5.2.1.
Is sensitive pressure in filebeat 5.x? if yes, how can i solve this issue?

Christian_Dahlqvist · April 10, 2017, 4:59am

This does not necessarily mean that Filebeat is slow. Filebeat can only send events at a rate that Logstash and its outputs can handle. It could therefore be your Logstash configuration or the performance of one of the outputs that is limiting throughput, resulting in delays.

aqiao · April 10, 2017, 5:54am

Thanks @Christian_Dahlqvist ! Could Elasticsearch cluster cause this latency? or any suggestion on logstash configuration

Christian_Dahlqvist · April 10, 2017, 6:16am

It could be either. If Elasticsearch is not able to keep up with traffic, the rest of the pipeline will apply back-pressure and Filebeat will eventually slow down reading. The same happens if Logstash is not able to process fast enough even though Elasticsearch could handle more throughput. What is resource usage looking like on the Logstah and Elasticsearch nodes?

aqiao · April 10, 2017, 6:57am

Sorry @Christian_Dahlqvist, it's my fault, above log is filebeat output to local file but no logstash , which means there is no network IO.

A bit of ELK stack deployment in our project :
There are 4 app servers in a cluster, one is live ,the others are standby. we just want to collect live app log, so we install 4 filebeat instances and setup 4 window task schedulers on the 4 servers separately. The task scheduler is used for monitor app status, if the app is live, then set filebeat output to logstash, if the app is standby, then set filebeat output to local files. The scheduler runs every 1 minuter
Since above log is output to local file, so why there is still about 3 hour latency?

ruflin · April 13, 2017, 11:14am

Could you share your filebeat config and your filebeat log? Best in debug mode because then you see in detail which event is created when ...

aqiao · April 20, 2017, 9:04am

Hi @ruflin, sorry for the late reply.
Filebeat config

#=========================== Filebeat prospectors     ============================= 
filebeat.prospectors: 
- input_type: log 
  # Paths that should be crawled and fetched. Glob based paths. 
  paths: 
    - D:/VAA/vae0/*/*/log/vae0_TolCheck*.txt 
  document_type: tolcheck 
  ### State options 
  # Removes the state for file which cannot be found on disk anymore immediately 
  clean_removed: true   
#========================= Filebeat global options ============================ 
filebeat.registry_file: "C:/Program Files/Filebeat/registry" 
#================================ Outputs         ===================================== 
#----------------------------- Logstash output -------------------------------- 
output.logstash: 
  # The Logstash hosts 
  hosts: ["10.52.144.221:5044"] 
  index: tolreport_test 
#================================ Logging     ===================================== 
# Sets log level. The default log level is info. 
# Available log levels are: critical, error, warning, info, debug 
logging.level: debug 
logging.to_files: true 
logging.files: 
  path: "C:/Program Files/Filebeat/log/" 
  rotateeverybytes: 10485760 # = 10MB 
  keepfiles: 100

aqiao · April 20, 2017, 9:06am

Logstash config

input {
  beats {
    port => 5044
	congestion_threshold => 30
  }
}
filter { 
  grok {
    match => {
	"message" => "%{TIMESTAMP_ISO8601:EventTime},%{GREEDYDATA:Type},%    {GREEDYDATA:Item},%{GREEDYDATA:Contributer},%{GREEDYDATA:Detail},%    {GREEDYDATA:ContextInfo},%{GREEDYDATA:Description}"
	}
  }
  #date {
  #  match => ["EventTime","yyyy-MM-dd HH:mm:ss.SSS"]
  #  target => "EventTime"
  #  timezone => "UTC"
  #} 
  mutate {
    remove_field => ["[beat][hostname]","[beat]    [name]","count","fields","input_type","offset","type","beat","@version","message"]
  }  
}
#logstash_sit.conf
output {
  elasticsearch {
    hosts => ["http://hdcs-analytics-elasticsearch.int.com:9200"]
    #manage_template => false
    index => "%{[@metadata][beat]}-%{+YYYY.MM.dd}"
    document_type => "%{[@metadata][type]}"
    template => "conf/template_tolreport.json"
    template_name =>"template_tolreport"
    template_overwrite => true
  }
}

aqiao · April 20, 2017, 9:07am

{
"template": "tolreport-*",
"settings": {
	"index.refresh_interval": "5s",
	"index.number_of_shards": 5,
	"number_of_replicas": 1
},
"mappings": {
	"_default_": {
		"_all": {
			"enabled": false
		},
		"dynamic_templates": [{
			"string-fields": {
				"match_mapping_type": "string",
				"match": "*",
				"mapping": {
					"type": "string",
					"index": "not_analyzed"
				}
			}
		}],
		"properties": {
			"@timestamp": {
				"type": "date",
				"format": "date_time"
			},
			"message": {
				"type": "string",
				"index": "analyzed"
			}
		}
	}
}

aqiao · April 20, 2017, 9:28am

Hi @ruflin, i design the below two cases and reproduced the latency issue

Case 1:
Filebeat(PQ_Server_A) ---> Logstash( Logstash_Server_A, Logstash_Server_B) ---> ES cluster (remote server)

Case 2:
Filebeat(PQ_Server_B) ---> Logstash(Logstash_Server_C, Logstash_Server_D) ---> Local disk

Note:

Filebeat version 5.2.1 Logstash version 2.3.0
The same PQ application runs on PQ_Server A and B ,which cam mare sure there are same amount of events generated.
Logstash_Server_A/B/C/D are in same LAN
we run the two cases simultaneously, and found the about 7 minutes latency in case 1

Below are the test result (Filebeat debug log), i just selected a event on both PQ_Server_A and B to describe the issue

PQ_Server_A

2017-04-12T09:28:34Z DBG  Publish: {
  "@timestamp": "2017-04-12T09:28:15.113Z",
  "beat": {
    "hostname": "PQ_Server_A",
    "name": "lmhi-ddnmvah26a",
    "version": "5.2.1"
  },
  "input_type": "log",
  "message": "2017-04-12T09:21:05.271Z,SCF,EUR5=,SSAL,FID1=22;Rate=1.06;FID2=25;Rate=1.0631;SSLim=0.00052,,Non-Strict Static SCF. Threshold exceed.",
  "offset": 1031730,
  "source": "D:\\ThomsonReuters\\VAA\\vae0\\partitioncore4\\pq_05v1\\log\\vae0_TolCheck04_20170412_Wed_092056_185.txt",
  "type": "tolcheck"

}

PQ_Server_B

2017-04-12T09:22:32Z DBG  Publish: {
  "@timestamp": "2017-04-12T09:22:32.251Z",
  "beat": {
    "hostname": "PQ_Server_B",
    "name": "lmhi-ddnmvah26b",
    "version": "5.2.1"
  },
  "input_type": "log",
  "message": "2017-04-12T09:22:32.136Z,SCF,EUR5=,SSAL,FID1=22;Rate=1.06;FID2=25;Rate=1.0631;SSLim=0.00052,,Non-Strict Static SCF. Threshold exceed.",
  "offset": 14485405,
  "source": "D:\\ThomsonReuters\\VAA\\vae0\\partitioncore4\\pq_05v1\\log\\vae0_TolCheck04_20170412_Wed_092032_024.txt",
  "type": "tolcheck"
}

Note:
The @timestamp represents the time Filebeat read PQ raw log, and the time in message field represents the raw log generated. We can see there have about 7 minute latency in case 1

aqiao · April 20, 2017, 9:29am

So from the cases, if we can say the ES cluster in my dev environment causes the latency? if yes,how can i fix it, if no, please help to figure out the root cause

The throughput is 500k/s

ruflin · April 24, 2017, 7:47am

I assume the latency is caused by LS or ES doing back pressure on the filebeat instance, so FB is throttling sending events. Could you share the log output from the filebeat instance that has the lag?

What do you mean by 500k/s? 500k events per second? 500KB per second?

aqiao · April 25, 2017, 2:49am

Thanks @ruflin! 500k/s means 500KB per second, not 500k events
i just have Filebeat debug log, since i just set the filebeat output to Logstash, you mean i set the output to local file and send the file to you?

Here is Filebeat debug log, which contains latency, please compare the @timestamp field with date time in message field:

2017-04-25T00:55:19Z DBG  Publish: {
  "@timestamp": "2017-04-25T00:55:10.972Z",
  "beat": {
    "hostname": "PQ_Server_A",
    "name": "PQ_Server_A",
    "version": "5.2.1"
  },
  "input_type": "log",
  "message": "2017-04-25T00:44:14.276Z,RACI,GBPCHF=,LMAX,CheckFID=22;Rate=1.2728;FID2=25;Rate=1.2729;TS=20170425_004414;# of RAT=6;RAT#1=(R 1.2727 20170425_004412);RAT#2=(R 1.2724 20170425_004413);RAT#3=(R 1.2726 20170425_004413);RAT#4=(DDBK 1.2727 20170425_004413);RAT#5=(R 1.2722 20170425_004413);RAT#6=(LMAX 1.2728 20170425_004414),,Print Processed RAT.",
  "offset": 548677,
  "source": "D:\\VAA\\vae0\\partitioncore5\\pq_05v2\\log\\vae0_TolCheck05_20170425_Tue_004357_943.txt",
  "type": "tolcheck"
}
2017-04-25T00:55:19Z DBG  Publish: {
  "@timestamp": "2017-04-25T00:55:10.972Z",
  "beat": {
    "hostname": "PQ_Server_A",
    "name": "PQ_Server_A",
    "version": "5.2.1"
  },
  "input_type": "log",
  "message": "2017-04-25T00:00:39.981Z,RACI,GBP10=,SGEB,CheckFID=22;Rate=1.2778;FID2=25;Rate=1.278;TS=20170425_000039;# of RAT=10;RAT#1=(DNBO 1.2778 20170425_000032);RAT#2=(COBA 1.2778 20170425_000033);RAT#3=(DDBK 1.2777 20170425_000033);RAT#4=(NEDB 1.2777 20170425_000034);RAT#5=(CMBK 1.2777 20170425_000036);RAT#6=(ALFN 1.2779 20170425_000036);RAT#7=(DDBK 1.2777 20170425_000036);RAT#8=(BARL 1.2778 20170425_000037);RAT#9=(CMBK 1.2778 20170425_000038);RAT#10=(SGEB 1.2778 20170425_000039),,Print Processed RAT.",
  "offset": 3318809,
  "source": "D:\\VAA\\vae0\\partitioncore4\\pq_05v1\\log\\vae0_TolCheck04_20170425_Tue_000000_014.txt",
  "type": "tolcheck"
}
2017-04-25T00:55:19Z DBG  Publish: {
  "@timestamp": "2017-04-25T00:55:10.972Z",
  "beat": {
    "hostname": "PQ_Server_A",
    "name": "PQ_Server_A",
    "version": "5.2.1"
  },
  "input_type": "log",
  "message": "2017-04-24T23:08:58.354Z,RACI,USD4X10F=BRKR,YMPR,CheckFID=22;Rate=1.5464;FID2=25;Rate=1.6064;TS=20170424_230858;# of RAT=4;RAT#1=(TRDT 1.534 20170424_230702);RAT#2=(PYNY 1.552 20170424_230702);RAT#3=(YMPR 1.5476 20170424_230702);RAT#4=(YMPR 1.5464 20170424_230858),,Print Processed RAT.",
  "offset": 5968983,
  "source": "D:\\VAA\\vae0\\partitioncore5\\pq_05v2\\log\\vae0_TolCheck05_20170424_Mon_230519_151.txt",
  "type": "tolcheck"
}
2017-04-25T00:55:19Z DBG  Publish: {
  "@timestamp": "2017-04-25T00:55:10.972Z",
  "beat": {
    "hostname": "PQ_Server_A",
    "name": "PQ_Server_A",
    "version": "5.2.1"
  },
  "input_type": "log",
  "message": "2017-04-24T18:23:19.433Z,RACI,USD7X10F=BRKR,TTKL,CheckFID=22;Rate=1.465;FID2=25;Rate=1.485;TS=20170424_182319;# of RAT=3;RAT#1=(TTKL 1.466 20170424_182316);RAT#2=(PYNY 1.465 20170424_182317);RAT#3=(TTKL 1.465 20170424_182319),,Print Processed RAT.",
  "offset": 23177788,
  "source": "D:\\VAA\\vae0\\partitioncore5\\pq_05v2\\log\\vae0_TolCheck05_20170424_Mon_181506_674.txt",
  "type": "tolcheck"
}
2017-04-25T00:55:19Z DBG  Publish: {
  "@timestamp": "2017-04-25T00:55:10.972Z",
  "beat": {
    "hostname": "PQ_Server_A",
    "name": "PQ_Server_A",
    "version": "5.2.1"
  },
  "input_type": "log",
  "message": "2017-04-24T23:36:47.091Z,SCF,MXN=,TTKL,FID1=22;Rate=18.824;FID2=25;Rate=18.854;SSLim=0.021,,Non-Strict Static SCF. Threshold exceed.",
  "offset": 5478556,
  "source": "D:\\VAA\\vae0\\partitioncore4\\pq_05v1\\log\\vae0_TolCheck04_20170424_Mon_233503_585.txt",
  "type": "tolcheck"
}
2017-04-25T00:55:19Z DBG  Publish: {
  "@timestamp": "2017-04-25T00:55:10.972Z",
  "beat": {
    "hostname": "PQ_Server_A",
    "name": "PQ_Server_A",
    "version": "5.2.1"
  },
  "input_type": "log",
  "message": "2017-04-24T21:29:57.301Z,RACI,EURJPY9=,BSAB,CheckFID=22;Rate=119.2;FID2=25;Rate=119.27;TS=20170424_212957;# of RAT=8;RAT#1=(PBGR 119.21 20170424_212950);RAT#2=(CIBC 119.24 20170424_212950);RAT#3=(DDBK 119.2 20170424_212950);RAT#4=(PBGR 119.2 20170424_212950);RAT#5=(PBGR 119.19 20170424_212952);RAT#6=(PBGR 119.18 20170424_212953);RAT#7=(COBA 119.18 20170424_212954);RAT#8=(BSAB 119.2 20170424_212957),,Print Processed RAT.",
  "offset": 15809697,
  "source": "D:\\VAA\\vae0\\partitioncore4\\pq_05v1\\log\\vae0_TolCheck04_20170424_Mon_212437_581.txt",
  "type": "tolcheck"
}
2017-04-25T00:55:19Z DBG  Publish: {
  "@timestamp": "2017-04-25T00:55:10.972Z",
  "beat": {
    "hostname": "PQ_Server_A",
    "name": "PQ_Server_A",
    "version": "5.2.1"
  },
  "input_type": "log",
  "message": "2017-04-24T23:33:27.962Z,RACI,GBPNOK=,R,CheckFID=22;Rate=10.9365;FID2=25;Rate=10.9455;TS=20170424_233327;# of RAT=6;RAT#1=(SWED 10.9375 20170424_233238);RAT#2=(BARL 10.9397 20170424_233240);RAT#3=(SWED 10.9378 20170424_233242);RAT#4=(NDEA 10.934 20170424_233242);RAT#5=(R 10.9374 20170424_233327);RAT#6=(R 10.9365 20170424_233327),,Print Processed RAT.",
  "offset": 4364982,
  "source": "D:\\VAA\\vae0\\partitioncore5\\pq_05v2\\log\\vae0_TolCheck05_20170424_Mon_233043_595.txt",
  "type": "tolcheck"
}

aqiao · April 25, 2017, 2:51am

Hi @ruflin, is there any method to check if the logstash or es is busy ?

ruflin · April 26, 2017, 12:54pm

500KB is quite a limited connection. That could explain the delay.

If you look at the info log in filebeat you see every 30 second some stats on how many events were sent, acked, not acked. This should give you some indication on how many events were not sent which is probably because of the network congestion. For the load on LS and ES best have a look at the stats of both systems. But with just 500KB/s I expected the network to be the bottleneck.

aqiao · April 28, 2017, 7:56am

thx ruflin, what dose the below filebeat log info mean?

2017-04-28T07:52:09Z DBG output worker: publish 50 events
2017-04-28T07:52:09Z DBG forwards msg with attempts=-1
2017-04-28T07:52:09Z DBG message forwarded

ruflin · April 28, 2017, 2:45pm

it means messages were forwarded. but there is more info every 30 seconds on a summary of the events.

system · May 1, 2017, 4:28am

This topic was automatically closed after 21 days. New replies are no longer allowed.

Topic		Replies	Views
Filebeat cannot keep up Beats filebeat	7	1097	July 13, 2018
Filebeat sending data to Logstash seems too slow Beats filebeat	20	22483	June 1, 2017
File beat fails to send events to logstash Beats	17	5354	July 5, 2017
Filebeat performance stall sometimes Beats filebeat	17	2546	May 1, 2020
Filebeat slow to send logs to logstash Beats filebeat	3	1753	January 10, 2020

Filebeat 2 hours latency

Related topics