Thanks for looking into this.
Server Specs & Memory:
Sorry I missed the total memory in the first post.
CPU: 32 Cores
RAM: 1 TB (Yes, 1 Terabyte physical RAM)
Elasticsearch Heap: 31GB (To stay under compressed oops threshold).
Logstash Heap: 8GB.
OS Cache: The rest (~900GB) is left for the OS to handle filesystem caching (Lucene).
Since I have a high volume of logs and a single Logstash instance with many cores, I configured multiple listeners to avoid a single TCP thread bottleneck at the input stage
Filebeat.yml(it is almost the same config in every beat):
filebeat.config.modules:
path: ${path.config}/modules.d/*.yml
reload.enabled: false
queue.mem:
events: 2048
flush.min_events: 512
filebeat.inputs:
- paths:
- /apache-tomcat-ws*/logs/catalina.out
- /var/log/tomcats/*/catalina.out
exclude_lines: ['.*MemcachedConnection.*','.*INFO: Added.*','.*Shut down memcached client.*','.*Could not redistribute to another node.*','.*Conexiones activas.*','.*ZV - EvaluarCondicion.*','.*evaluacion de resultado.*','.*Hay nodos.*','.*Cache hit.*']
document_type: _doc
fields:
log_type: "w"
fields_under_root: true
scan_frequency: 10s
close_inactive: 5m
clean_inactive: 15m
ignore_older: 10m
close_removed: true
tail_files: true
close_timeout: 15m
close_renamed: true
force_close_files: false
harvester_buffer_size: 16384
max_bytes: 10485760
tags: ["wap"]
- paths:
- /var/log/smpp*.log
exclude_lines: ['.*MemcachedConnection.*','.*INFO: Added.*','.*Shut down memcached client.*','.*Could not redistribute to another node.*','.*Conexiones activas.*','.*ZV - EvaluarCondicion.*','.*evaluacion de resultado.*','.*Hay nodos.*','.*Cache hit.*']
document_type: _doc
fields:
log_type: "colas-smpp"
fields_under_root: true
close_inactive: 1m
clean_inactive: 2h
ignore_older: 1h
close_removed: true
tail_files: true
close_timeout: 2h
close_renamed: true
force_close_files: false
- paths:
- /var/log/receptor*.log
exclude_lines: ['.*MemcachedConnection.*','.*INFO: Added.*','.*Shut down memcached client.*','.*Could not redistribute to another node.*','.*Conexiones activas.*','.*ZV - EvaluarCondicion.*','.*evaluacion de resultado.*','.*Hay nodos.*','.*Cache hit.*']
document_type: _doc
fields:
log_type: "receptores-sms"
fields_under_root: true
close_inactive: 1m
clean_inactive: 5m
ignore_older: 2m
close_removed: true
tail_files: true
close_timeout: 2h
close_renamed: true
force_close_files: true
output.logstash:
hosts: ["LOGSTASHIP:4000","LOGSTASHIP:4001","LOGSTASHIP:4002","LOGSTASHIP:4003"]
loadbalance: true
worker: 4
pipelining: 5
logging.level: info
logging.to_files: true
logging.to_syslog: false
logging.files:
path: /var/log/filebeat
name: filebeat.log
keepfiles: 7
permissions: 0600
logstash.yml(Aside from the path locations, this is the only config I set):
# ------------ Pipeline Settings --------------
#
# The ID of the pipeline.
#
# pipeline.id: main
#
# Set the number of workers that will, in parallel, execute the filters+outputs
# stage of the pipeline.
#
# This defaults to the number of the host's CPU cores.
#
pipeline.workers: 12
#
# How many events to retrieve from inputs before sending to filters+workers
#
pipeline.batch.size: 2048
#
# How long to wait in milliseconds while polling for the next event
# before dispatching an undersized batch to filters+outputs
#
# pipeline.batch.delay: 50
#
# Force Logstash to exit during shutdown even if there are still inflight
# events in memory. By default, logstash will refuse to quit until all
# received events have been pushed to the outputs.
#
# WARNING: Enabling this can lead to data loss during shutdown
#
# pipeline.unsafe_shutdown: false
#
# Set the pipeline event ordering. Options are "auto" (the default), "true" or "false".
# "auto" automatically enables ordering if the 'pipeline.workers' setting
# is also set to '1', and disables otherwise.
# "true" enforces ordering on the pipeline and prevent logstash from starting
# if there are multiple workers.
# "false" disables any extra processing necessary for preserving ordering.
#
# pipeline.ordered: auto
#
# Sets the pipeline's default value for `ecs_compatibility`, a setting that is
# available to plugins that implement an ECS Compatibility mode for use with
# the Elastic Common Schema.
# Possible values are:
# - disabled
# - v1
# - v8 (default)
# Pipelines defined before Logstash 8 operated without ECS in mind. To ensure a
# migrated pipeline continues to operate as it did before your upgrade, opt-OUT
# of ECS for the individual pipeline in its `pipelines.yml` definition. Setting
# it here will set the default for _all_ pipelines, including new ones.
#
# pipeline.ecs_compatibility: v8
Yes, the delay varies between 1 to 2 hours sometimes 12, it is not related to a specific application type or server hardware the delay affects logs of the same type indiscriminately. It is not the case that "Log Type A" arrives fast and "Log Type B" arrives slow.
If I filter in Kibana for a single log type (e.g., just catalina.out logs), I see a mix of timestamps: some records arrive in real-time, while others from the exact same source/type arrive with a huge delay.