Hi all,
We would like to seek for your advice regarding an issue we have on Logstash. Currently we are receiving data from 18 servers/clients (each with 9 check metrics that creates 9 indices in ES per day). The event data per client are then sent in parallel to Logstash every 10 minutes via TCP (few times there are thousands of records for a single client) - the issue is - now that we are in Production, we observed that there are inconsistencies and missing data on Elasticsearch although it was supposed to be received and processed by Logstash.
We cannot find any relevant message (warning/critical) on Logstash logs even though we're on debug mode. The health status of both Logstash and Elasticsearch are also set to green when we check on the monitoring dashboard. The average CPU utilization was minimal with only 15% while the used JVM Heap is 456Mb.
We have replicated a few data that was supposed to be processed in our DEV environment if we could reproduce the issue but it was working fine. The only thing that we cannot reproduce is the amount of traffic coming into the system since this is production data already. In this case, we're thinking that this maybe related to the load of events received on Production as compared to Development. Our config setting is fairly default so we're not sure if we missed anything.
Elasticsearch: 6.3.0
Logstash: 6.3.0
Kibana: 6.3.0
> ------------ Pipeline Settings --------------
The ID of the pipeline.
pipeline.id: main
Set the number of workers that will, in parallel, execute the filters+outputs
stage of the pipeline.
This defaults to the number of the host's CPU cores.
pipeline.workers: 2
How many events to retrieve from inputs before sending to filters+workers
pipeline.batch.size: 125
How long to wait in milliseconds while polling for the next event
before dispatching an undersized batch to filters+outputs
pipeline.batch.delay: 50
Force Logstash to exit during shutdown even if there are still inflight
events in memory. By default, logstash will refuse to quit until all
received events have been pushed to the outputs.
WARNING: enabling this can lead to data loss during shutdown
pipeline.unsafe_shutdown: false
------------ Queuing Settings --------------
Internal queuing model, "memory" for legacy in-memory based queuing and
"persisted" for disk-based acked queueing. Defaults is memory
queue.type: memory
If using queue.type: persisted, the directory path where the data files will be stored.
Default is path.data/queue
path.queue:
If using queue.type: persisted, the page data files size. The queue data consists of
append-only data files separated into pages. Default is 64mb
queue.page_capacity: 64mb
If using queue.type: persisted, the maximum number of unread events in the queue.
Default is 0 (unlimited)
queue.max_events: 0
If using queue.type: persisted, the total capacity of the queue in number of bytes.
If you would like more unacked events to be buffered in Logstash, you can increase the
capacity using this setting. Please make sure your disk drive has capacity greater than
the size specified here. If both max_bytes and max_events are specified, Logstash will pick
whichever criteria is reached first
Default is 1024mb or 1gb
queue.max_bytes: 1024mb
If using queue.type: persisted, the maximum number of acked events before forcing a checkpoint
Default is 1024, 0 for unlimited
queue.checkpoint.acks: 1024
If using queue.type: persisted, the maximum number of written events before forcing a checkpoint
Default is 1024, 0 for unlimited
queue.checkpoint.writes: 1024
If using queue.type: persisted, the interval in milliseconds when a checkpoint is forced on the head page
Default is 1000, 0 for no periodic checkpoint.
queue.checkpoint.interval: 1000