HI Guys, first, let me explain my setup.
Using filebeat, Kibana and Elasticsearch 6.3.x.
Windows Server (filebeat and iis module & metricsbeat) -> Logstash Server -> ElasticSearch <- Kibana
Configs:
Filebeat.yml:
#=========================== Filebeat inputs =============================
filebeat.inputs:
- type: log
enabled: true
# Paths that should be crawled and fetched. Glob based paths.
paths:
- L:\Logs\L*\*\*.log
- L:\Logs\L*\*\*\*.log
multiline.pattern: '^[0-9]{4}-[0-9]{2}-[0-9]{2}\s[0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3}\s\['
multiline.negate: true
multiline.match: after
exclude_files: ['^_Error-(.*)+']
tags: platform-logs
#============================= Filebeat modules ===============================
filebeat.config.modules:
path: ${path.config}/modules.d/*.yml
reload.enabled: true
reload.period: 60s
#==================== Elasticsearch template setting ==========================
setup.template.settings:
index.number_of_shards: 1
index.codec: best_compression
#_source.enabled: false
#================================ General =====================================
name: APP-01-LON
#================================ Outputs =====================================
# Configure what output to use when sending the data collected by the beat.
#----------------------------- Logstash output --------------------------------
output.logstash:
hosts: ["log1:5001","log2:5001", "log3:5001"]
Logstash input conf:
input {
beats {
port => 5001
}
}
filter {
#platform logs .net
if "platform-logs" in [tags] {
mutate {
copy => { "@timestamp" => "beat_timestamp" }
}
grok {
match => {
"message" => [
"%{GREEDYDATA:time} \[%{DATA:if.platform}\] \[%{DATA:if.app}\] \[%{DATA:if.correlationid}\] \[%{DATA:if.threadid}\] \[%{DATA:if.loglevel}\] \[%{DATA:if.logsource}\] - %{GREEDYDATA:if.logmessage}"
]
}
remove_field => [ "message" ]
}
date {
match => [ "time", "yyyy-MM-dd HH:mm:ss,SSS" ]
remove_field => [ "time" ]
}
}
else if "syslog-logs" in [tags] {
grok {
match => { "message" => "%{SYSLOGTIMESTAMP:syslog_timestamp} %{SYSLOGHOST:syslog_hostname} %{DATA:syslog_program}(?:\[%{POSINT:syslog_pid}\])?: %{GREEDYDATA:syslog_message}" }
add_field => [ "received_at", "%{@timestamp}" ]
add_field => [ "received_from", "%{host}" ]
}
mutate {
add_field => { "read_timestamp" => "%{@timestamp}" }
}
date {
match => [ "syslog_timestamp", "MMM d HH:mm:ss", "MMM dd HH:mm:ss" ]
}
}
# nodejs logs (gel)
else if "nodejs-logs" in [tags] {
mutate {
copy => { "@timestamp" => "beat_timestamp" }
rename => { "correlationID" => "if.correlationid" }
rename => { "threadID" => "if.threadid" }
rename => { "platformKey" => "if.platform" }
rename => { "level" => "if.loglevel" }
rename => { "message" => "if.logmessage" }
rename => { "source" => "if.logsource" }
rename => { "app" => "if.app" }
}
date {
match => [ "timestamp", "ISO8601" ]
remove_field => [ "timestamp" ]
}
}
#metric logs .net
else if "metricbeat" in [tags] {
}
}
output {
if "nodejs-logs" in [tags] {
elasticsearch {
hosts => ["els-ip1", "els-ip2", "els-ip3"]
index => "nodejs-%{+YYYY.MM.dd}"
manage_template => false
user => logstash_writer
password => ''
}
}
else if "iis-logs" in [tags] {
elasticsearch {
hosts => ["els-ip1", "els-ip2", "els-ip3]
index => "filebeat-%{+YYYY.MM.dd}"
manage_template => false
user => logstash_writer
password => ''
}
}
else if "platform-logs" in [tags] {
elasticsearch {
hosts => ["els-ip1", "els-ip2", "els-ip3]
index => "platform-%{+YYYY.MM.dd}"
manage_template => false
user => logstash_writer
password => ''
}
}
else if "syslog-logs" in [tags] {
elasticsearch {
hosts => ["els-ip1", "els-ip2", "els-ip3]
index => "syslog-%{+YYYY.MM.dd}"
manage_template => false
user => logstash_writer
password => ''
}
}
else if "metricbeat" in [tags] {
elasticsearch {
hosts => ["els-ip1", "els-ip2", "els-ip3]
index => "metricbeat-%{+YYYY.MM.dd}"
manage_template => false
user => logstash_writer
password => ''
}
}
}
All machines are on a vmware hypervisor, with the following specs:
WindowsAppServer: 8 vcpus, 8gb ram (winsrv 2016)
Logstash servers: 16gb ram, 8gb jvm heap size, 8 vcpus (centos 7.5)
es servers: 64GB ram, 32gb jvm heap, 12 vcpus (centos 7.5)
All machines connect to a SAN (dell compellent) over fibre/10gig nic cards
using Xpack with monitoring and security.
We currently have 155 Indices, 846 millon documents indexed, 719GB disk usage, 347 primary shards, 348 second shards.
Indexes are created on a daily basis filebeat-(%YYYY.MM.DD), platform-* etc...
3 Logstash nodes, 3 ES nodes (with kibana server acting as a node for load balanced requests to ES cluster, not data, ingest or master node).
There are 43 active beats, running metricbeat and filebeat. and using Persistent queues in logstash.
Phew, I think i've given a good overview.
Now to the problem, the consultant that Elastic came to help us set things up suggested we have a beat_timestamp field to show when filebeat actually picked up a log entry and the @timestamp field would be the log entries' time so we could see if there was a 'drift' between the log being written to disk on the windows server and it being picked up by filebeat.
Today, I've noticed that this time 'drift' can be as much as 45 minutes out but i'm not sure why. I believe we have enough 'grunt' but unsure if there are some filebeat settings I can change to help fix this..
Our indexes are set to 1 primary shard and 1 replica shard, as per the consultants suggestion.
Example of issue: (I've had to obfusicate some fields but should hopefully be enough to see my issue)
Let me know if you need any more information, but hopefully i've covered everything.. I'm seeing this drift across the board but the drift is bigger on the busiest windows servers (the ones with more log entries written per second)..