Filebeat 2 hours latency

{"@timestamp":"2017-04-07T18:22:59.075Z","beat":{"hostname":"localhost","name":"localhost","version":"5.2.1"},"input_type":"log","message":"2017-04-07T15:19:29.670Z,RACI,EURAB6E15Y=QQ,BRKR,CheckFID=22;Rate=1.083;FID2=25;Rate=1.097;TS=20170407_151929;# of RAT=8;RAT#1=(RABO 1.0642 20170407_151914);RAT#2=(CALP 1.069 20170407_151916);RAT#3=(CALP 1.068 20170407_151917);RAT#4=(BSAB 1.0865 20170407_151919);RAT#5=(BRKR 1.069 20170407_151919);RAT#6=(BRKR 1.069 20170407_151919);RAT#7=(BRKR 1.082 20170407_151919);RAT#8=(BRKR 1.083 20170407_151929),,Print Processed RAT.","offset":25201218,"source":"D:\\works\\VAA\\vae0\\partitioncore4\\test_05v1\\log\\vae0_TolCheck04_20170407_Fri_151158_478.txt","type":"tolcheck"}

Above is filebeat output , we can see @timestamp is 2017-04-07T18:22:59.075Z but event time in message field is 2017-04-07T15:19:29.670Z, there are about 3 hours latency.
I use logstash 2.3.0 and filebeat 5.2.1.
Is sensitive pressure in filebeat 5.x? if yes, how can i solve this issue?

This does not necessarily mean that Filebeat is slow. Filebeat can only send events at a rate that Logstash and its outputs can handle. It could therefore be your Logstash configuration or the performance of one of the outputs that is limiting throughput, resulting in delays.

Thanks @Christian_Dahlqvist ! Could Elasticsearch cluster cause this latency? or any suggestion on logstash configuration

It could be either. If Elasticsearch is not able to keep up with traffic, the rest of the pipeline will apply back-pressure and Filebeat will eventually slow down reading. The same happens if Logstash is not able to process fast enough even though Elasticsearch could handle more throughput. What is resource usage looking like on the Logstah and Elasticsearch nodes?

Sorry @Christian_Dahlqvist, it's my fault, above log is filebeat output to local file but no logstash , which means there is no network IO.

A bit of ELK stack deployment in our project :
There are 4 app servers in a cluster, one is live ,the others are standby. we just want to collect live app log, so we install 4 filebeat instances and setup 4 window task schedulers on the 4 servers separately. The task scheduler is used for monitor app status, if the app is live, then set filebeat output to logstash, if the app is standby, then set filebeat output to local files. The scheduler runs every 1 minuter
Since above log is output to local file, so why there is still about 3 hour latency?

Could you share your filebeat config and your filebeat log? Best in debug mode because then you see in detail which event is created when ...

Hi @ruflin, sorry for the late reply.
Filebeat config

#=========================== Filebeat prospectors     ============================= 
filebeat.prospectors: 
- input_type: log 
  # Paths that should be crawled and fetched. Glob based paths. 
  paths: 
    - D:/VAA/vae0/*/*/log/vae0_TolCheck*.txt 
  document_type: tolcheck 
  ### State options 
  # Removes the state for file which cannot be found on disk anymore immediately 
  clean_removed: true   
#========================= Filebeat global options ============================ 
filebeat.registry_file: "C:/Program Files/Filebeat/registry" 
#================================ Outputs         ===================================== 
#----------------------------- Logstash output -------------------------------- 
output.logstash: 
  # The Logstash hosts 
  hosts: ["10.52.144.221:5044"] 
  index: tolreport_test 
#================================ Logging     ===================================== 
# Sets log level. The default log level is info. 
# Available log levels are: critical, error, warning, info, debug 
logging.level: debug 
logging.to_files: true 
logging.files: 
  path: "C:/Program Files/Filebeat/log/" 
  rotateeverybytes: 10485760 # = 10MB 
  keepfiles: 100

Logstash config

input {
  beats {
    port => 5044
	congestion_threshold => 30
  }
}
filter { 
  grok {
    match => {
	"message" => "%{TIMESTAMP_ISO8601:EventTime},%{GREEDYDATA:Type},%    {GREEDYDATA:Item},%{GREEDYDATA:Contributer},%{GREEDYDATA:Detail},%    {GREEDYDATA:ContextInfo},%{GREEDYDATA:Description}"
	}
  }
  #date {
  #  match => ["EventTime","yyyy-MM-dd HH:mm:ss.SSS"]
  #  target => "EventTime"
  #  timezone => "UTC"
  #} 
  mutate {
    remove_field => ["[beat][hostname]","[beat]    [name]","count","fields","input_type","offset","type","beat","@version","message"]
  }  
}
#logstash_sit.conf
output {
  elasticsearch {
    hosts => ["http://hdcs-analytics-elasticsearch.int.com:9200"]
    #manage_template => false
    index => "%{[@metadata][beat]}-%{+YYYY.MM.dd}"
    document_type => "%{[@metadata][type]}"
    template => "conf/template_tolreport.json"
    template_name =>"template_tolreport"
    template_overwrite => true
  }
}
{
"template": "tolreport-*",
"settings": {
	"index.refresh_interval": "5s",
	"index.number_of_shards": 5,
	"number_of_replicas": 1
},
"mappings": {
	"_default_": {
		"_all": {
			"enabled": false
		},
		"dynamic_templates": [{
			"string-fields": {
				"match_mapping_type": "string",
				"match": "*",
				"mapping": {
					"type": "string",
					"index": "not_analyzed"
				}
			}
		}],
		"properties": {
			"@timestamp": {
				"type": "date",
				"format": "date_time"
			},
			"message": {
				"type": "string",
				"index": "analyzed"
			}
		}
	}
}

Hi @ruflin, i design the below two cases and reproduced the latency issue

Case 1:
Filebeat(PQ_Server_A) ---> Logstash( Logstash_Server_A, Logstash_Server_B) ---> ES cluster (remote server)

Case 2:
Filebeat(PQ_Server_B) ---> Logstash(Logstash_Server_C, Logstash_Server_D) ---> Local disk

Note:

  1. Filebeat version 5.2.1 Logstash version 2.3.0
  2. The same PQ application runs on PQ_Server A and B ,which cam mare sure there are same amount of events generated.
  3. Logstash_Server_A/B/C/D are in same LAN
  4. we run the two cases simultaneously, and found the about 7 minutes latency in case 1

Below are the test result (Filebeat debug log), i just selected a event on both PQ_Server_A and B to describe the issue

PQ_Server_A

2017-04-12T09:28:34Z DBG  Publish: {
  "@timestamp": "2017-04-12T09:28:15.113Z",
  "beat": {
    "hostname": "PQ_Server_A",
    "name": "lmhi-ddnmvah26a",
    "version": "5.2.1"
  },
  "input_type": "log",
  "message": "2017-04-12T09:21:05.271Z,SCF,EUR5=,SSAL,FID1=22;Rate=1.06;FID2=25;Rate=1.0631;SSLim=0.00052,,Non-Strict Static SCF. Threshold exceed.",
  "offset": 1031730,
  "source": "D:\\ThomsonReuters\\VAA\\vae0\\partitioncore4\\pq_05v1\\log\\vae0_TolCheck04_20170412_Wed_092056_185.txt",
  "type": "tolcheck"

}

PQ_Server_B

2017-04-12T09:22:32Z DBG  Publish: {
  "@timestamp": "2017-04-12T09:22:32.251Z",
  "beat": {
    "hostname": "PQ_Server_B",
    "name": "lmhi-ddnmvah26b",
    "version": "5.2.1"
  },
  "input_type": "log",
  "message": "2017-04-12T09:22:32.136Z,SCF,EUR5=,SSAL,FID1=22;Rate=1.06;FID2=25;Rate=1.0631;SSLim=0.00052,,Non-Strict Static SCF. Threshold exceed.",
  "offset": 14485405,
  "source": "D:\\ThomsonReuters\\VAA\\vae0\\partitioncore4\\pq_05v1\\log\\vae0_TolCheck04_20170412_Wed_092032_024.txt",
  "type": "tolcheck"
}

Note:
The @timestamp represents the time Filebeat read PQ raw log, and the time in message field represents the raw log generated. We can see there have about 7 minute latency in case 1

So from the cases, if we can say the ES cluster in my dev environment causes the latency? if yes,how can i fix it, if no, please help to figure out the root cause

The throughput is 500k/s

I assume the latency is caused by LS or ES doing back pressure on the filebeat instance, so FB is throttling sending events. Could you share the log output from the filebeat instance that has the lag?

What do you mean by 500k/s? 500k events per second? 500KB per second?

Thanks @ruflin! 500k/s means 500KB per second, not 500k events
i just have Filebeat debug log, since i just set the filebeat output to Logstash, you mean i set the output to local file and send the file to you?

Here is Filebeat debug log, which contains latency, please compare the @timestamp field with date time in message field:

2017-04-25T00:55:19Z DBG  Publish: {
  "@timestamp": "2017-04-25T00:55:10.972Z",
  "beat": {
    "hostname": "PQ_Server_A",
    "name": "PQ_Server_A",
    "version": "5.2.1"
  },
  "input_type": "log",
  "message": "2017-04-25T00:44:14.276Z,RACI,GBPCHF=,LMAX,CheckFID=22;Rate=1.2728;FID2=25;Rate=1.2729;TS=20170425_004414;# of RAT=6;RAT#1=(R 1.2727 20170425_004412);RAT#2=(R 1.2724 20170425_004413);RAT#3=(R 1.2726 20170425_004413);RAT#4=(DDBK 1.2727 20170425_004413);RAT#5=(R 1.2722 20170425_004413);RAT#6=(LMAX 1.2728 20170425_004414),,Print Processed RAT.",
  "offset": 548677,
  "source": "D:\\VAA\\vae0\\partitioncore5\\pq_05v2\\log\\vae0_TolCheck05_20170425_Tue_004357_943.txt",
  "type": "tolcheck"
}
2017-04-25T00:55:19Z DBG  Publish: {
  "@timestamp": "2017-04-25T00:55:10.972Z",
  "beat": {
    "hostname": "PQ_Server_A",
    "name": "PQ_Server_A",
    "version": "5.2.1"
  },
  "input_type": "log",
  "message": "2017-04-25T00:00:39.981Z,RACI,GBP10=,SGEB,CheckFID=22;Rate=1.2778;FID2=25;Rate=1.278;TS=20170425_000039;# of RAT=10;RAT#1=(DNBO 1.2778 20170425_000032);RAT#2=(COBA 1.2778 20170425_000033);RAT#3=(DDBK 1.2777 20170425_000033);RAT#4=(NEDB 1.2777 20170425_000034);RAT#5=(CMBK 1.2777 20170425_000036);RAT#6=(ALFN 1.2779 20170425_000036);RAT#7=(DDBK 1.2777 20170425_000036);RAT#8=(BARL 1.2778 20170425_000037);RAT#9=(CMBK 1.2778 20170425_000038);RAT#10=(SGEB 1.2778 20170425_000039),,Print Processed RAT.",
  "offset": 3318809,
  "source": "D:\\VAA\\vae0\\partitioncore4\\pq_05v1\\log\\vae0_TolCheck04_20170425_Tue_000000_014.txt",
  "type": "tolcheck"
}
2017-04-25T00:55:19Z DBG  Publish: {
  "@timestamp": "2017-04-25T00:55:10.972Z",
  "beat": {
    "hostname": "PQ_Server_A",
    "name": "PQ_Server_A",
    "version": "5.2.1"
  },
  "input_type": "log",
  "message": "2017-04-24T23:08:58.354Z,RACI,USD4X10F=BRKR,YMPR,CheckFID=22;Rate=1.5464;FID2=25;Rate=1.6064;TS=20170424_230858;# of RAT=4;RAT#1=(TRDT 1.534 20170424_230702);RAT#2=(PYNY 1.552 20170424_230702);RAT#3=(YMPR 1.5476 20170424_230702);RAT#4=(YMPR 1.5464 20170424_230858),,Print Processed RAT.",
  "offset": 5968983,
  "source": "D:\\VAA\\vae0\\partitioncore5\\pq_05v2\\log\\vae0_TolCheck05_20170424_Mon_230519_151.txt",
  "type": "tolcheck"
}
2017-04-25T00:55:19Z DBG  Publish: {
  "@timestamp": "2017-04-25T00:55:10.972Z",
  "beat": {
    "hostname": "PQ_Server_A",
    "name": "PQ_Server_A",
    "version": "5.2.1"
  },
  "input_type": "log",
  "message": "2017-04-24T18:23:19.433Z,RACI,USD7X10F=BRKR,TTKL,CheckFID=22;Rate=1.465;FID2=25;Rate=1.485;TS=20170424_182319;# of RAT=3;RAT#1=(TTKL 1.466 20170424_182316);RAT#2=(PYNY 1.465 20170424_182317);RAT#3=(TTKL 1.465 20170424_182319),,Print Processed RAT.",
  "offset": 23177788,
  "source": "D:\\VAA\\vae0\\partitioncore5\\pq_05v2\\log\\vae0_TolCheck05_20170424_Mon_181506_674.txt",
  "type": "tolcheck"
}
2017-04-25T00:55:19Z DBG  Publish: {
  "@timestamp": "2017-04-25T00:55:10.972Z",
  "beat": {
    "hostname": "PQ_Server_A",
    "name": "PQ_Server_A",
    "version": "5.2.1"
  },
  "input_type": "log",
  "message": "2017-04-24T23:36:47.091Z,SCF,MXN=,TTKL,FID1=22;Rate=18.824;FID2=25;Rate=18.854;SSLim=0.021,,Non-Strict Static SCF. Threshold exceed.",
  "offset": 5478556,
  "source": "D:\\VAA\\vae0\\partitioncore4\\pq_05v1\\log\\vae0_TolCheck04_20170424_Mon_233503_585.txt",
  "type": "tolcheck"
}
2017-04-25T00:55:19Z DBG  Publish: {
  "@timestamp": "2017-04-25T00:55:10.972Z",
  "beat": {
    "hostname": "PQ_Server_A",
    "name": "PQ_Server_A",
    "version": "5.2.1"
  },
  "input_type": "log",
  "message": "2017-04-24T21:29:57.301Z,RACI,EURJPY9=,BSAB,CheckFID=22;Rate=119.2;FID2=25;Rate=119.27;TS=20170424_212957;# of RAT=8;RAT#1=(PBGR 119.21 20170424_212950);RAT#2=(CIBC 119.24 20170424_212950);RAT#3=(DDBK 119.2 20170424_212950);RAT#4=(PBGR 119.2 20170424_212950);RAT#5=(PBGR 119.19 20170424_212952);RAT#6=(PBGR 119.18 20170424_212953);RAT#7=(COBA 119.18 20170424_212954);RAT#8=(BSAB 119.2 20170424_212957),,Print Processed RAT.",
  "offset": 15809697,
  "source": "D:\\VAA\\vae0\\partitioncore4\\pq_05v1\\log\\vae0_TolCheck04_20170424_Mon_212437_581.txt",
  "type": "tolcheck"
}
2017-04-25T00:55:19Z DBG  Publish: {
  "@timestamp": "2017-04-25T00:55:10.972Z",
  "beat": {
    "hostname": "PQ_Server_A",
    "name": "PQ_Server_A",
    "version": "5.2.1"
  },
  "input_type": "log",
  "message": "2017-04-24T23:33:27.962Z,RACI,GBPNOK=,R,CheckFID=22;Rate=10.9365;FID2=25;Rate=10.9455;TS=20170424_233327;# of RAT=6;RAT#1=(SWED 10.9375 20170424_233238);RAT#2=(BARL 10.9397 20170424_233240);RAT#3=(SWED 10.9378 20170424_233242);RAT#4=(NDEA 10.934 20170424_233242);RAT#5=(R 10.9374 20170424_233327);RAT#6=(R 10.9365 20170424_233327),,Print Processed RAT.",
  "offset": 4364982,
  "source": "D:\\VAA\\vae0\\partitioncore5\\pq_05v2\\log\\vae0_TolCheck05_20170424_Mon_233043_595.txt",
  "type": "tolcheck"
}

Hi @ruflin, is there any method to check if the logstash or es is busy ?

500KB is quite a limited connection. That could explain the delay.

If you look at the info log in filebeat you see every 30 second some stats on how many events were sent, acked, not acked. This should give you some indication on how many events were not sent which is probably because of the network congestion. For the load on LS and ES best have a look at the stats of both systems. But with just 500KB/s I expected the network to be the bottleneck.

thx ruflin, what dose the below filebeat log info mean?

2017-04-28T07:52:09Z DBG output worker: publish 50 events
2017-04-28T07:52:09Z DBG forwards msg with attempts=-1
2017-04-28T07:52:09Z DBG message forwarded

it means messages were forwarded. but there is more info every 30 seconds on a summary of the events.

This topic was automatically closed after 21 days. New replies are no longer allowed.