We would like to seek for your advice regarding an issue we have on Logstash. Currently we are receiving data from 18 servers/clients (each with 9 check metrics that creates 9 indices in ES per day). The event data per client are then sent in parallel to Logstash every 10 minutes via TCP (few times there are thousands of records for a single client) - the issue is - now that we are in Production, we observed that there are inconsistencies and missing data on Elasticsearch although it was supposed to be received and processed by Logstash.
We cannot find any relevant message (warning/critical) on Logstash logs even though we're on debug mode. The health status of both Logstash and Elasticsearch are also set to green when we check on the monitoring dashboard. The average CPU utilization was minimal with only 15% while the used JVM Heap is 456Mb.
We have replicated a few data that was supposed to be processed in our DEV environment if we could reproduce the issue but it was working fine. The only thing that we cannot reproduce is the amount of traffic coming into the system since this is production data already. In this case, we're thinking that this maybe related to the load of events received on Production as compared to Development. Our config setting is fairly default so we're not sure if we missed anything.
> ------------ Pipeline Settings -------------- The ID of the pipeline. pipeline.id: main Set the number of workers that will, in parallel, execute the filters+outputs stage of the pipeline. This defaults to the number of the host's CPU cores. pipeline.workers: 2 How many events to retrieve from inputs before sending to filters+workers pipeline.batch.size: 125 How long to wait in milliseconds while polling for the next event before dispatching an undersized batch to filters+outputs pipeline.batch.delay: 50 Force Logstash to exit during shutdown even if there are still inflight events in memory. By default, logstash will refuse to quit until all received events have been pushed to the outputs. WARNING: enabling this can lead to data loss during shutdown pipeline.unsafe_shutdown: false ------------ Queuing Settings -------------- Internal queuing model, "memory" for legacy in-memory based queuing and "persisted" for disk-based acked queueing. Defaults is memory queue.type: memory If using queue.type: persisted, the directory path where the data files will be stored. Default is path.data/queue path.queue: If using queue.type: persisted, the page data files size. The queue data consists of append-only data files separated into pages. Default is 64mb queue.page_capacity: 64mb If using queue.type: persisted, the maximum number of unread events in the queue. Default is 0 (unlimited) queue.max_events: 0 If using queue.type: persisted, the total capacity of the queue in number of bytes. If you would like more unacked events to be buffered in Logstash, you can increase the capacity using this setting. Please make sure your disk drive has capacity greater than the size specified here. If both max_bytes and max_events are specified, Logstash will pick whichever criteria is reached first Default is 1024mb or 1gb queue.max_bytes: 1024mb If using queue.type: persisted, the maximum number of acked events before forcing a checkpoint Default is 1024, 0 for unlimited queue.checkpoint.acks: 1024 If using queue.type: persisted, the maximum number of written events before forcing a checkpoint Default is 1024, 0 for unlimited queue.checkpoint.writes: 1024 If using queue.type: persisted, the interval in milliseconds when a checkpoint is forced on the head page Default is 1000, 0 for no periodic checkpoint. queue.checkpoint.interval: 1000