Heavy Traffic on Logstash causing data inconsistencies

Hi all,

We would like to seek for your advice regarding an issue we have on Logstash. Currently we are receiving data from 18 servers/clients (each with 9 check metrics that creates 9 indices in ES per day). The event data per client are then sent in parallel to Logstash every 10 minutes via TCP (few times there are thousands of records for a single client) - the issue is - now that we are in Production, we observed that there are inconsistencies and missing data on Elasticsearch although it was supposed to be received and processed by Logstash.

We cannot find any relevant message (warning/critical) on Logstash logs even though we're on debug mode. The health status of both Logstash and Elasticsearch are also set to green when we check on the monitoring dashboard. The average CPU utilization was minimal with only 15% while the used JVM Heap is 456Mb.

We have replicated a few data that was supposed to be processed in our DEV environment if we could reproduce the issue but it was working fine. The only thing that we cannot reproduce is the amount of traffic coming into the system since this is production data already. In this case, we're thinking that this maybe related to the load of events received on Production as compared to Development. Our config setting is fairly default so we're not sure if we missed anything.

Elasticsearch: 6.3.0
Logstash: 6.3.0
Kibana: 6.3.0

> ------------ Pipeline Settings --------------

 The ID of the pipeline.

 pipeline.id: main

 Set the number of workers that will, in parallel, execute the filters+outputs
stage of the pipeline.

 This defaults to the number of the host's CPU cores.

 pipeline.workers: 2

 How many events to retrieve from inputs before sending to filters+workers

 pipeline.batch.size: 125

 How long to wait in milliseconds while polling for the next event
 before dispatching an undersized batch to filters+outputs

 pipeline.batch.delay: 50

 Force Logstash to exit during shutdown even if there are still inflight
 events in memory. By default, logstash will refuse to quit until all
 received events have been pushed to the outputs.

 WARNING: enabling this can lead to data loss during shutdown

 pipeline.unsafe_shutdown: false

 ------------ Queuing Settings --------------
  Internal queuing model, "memory" for legacy in-memory based queuing and
  "persisted" for disk-based acked queueing. Defaults is memory
  queue.type: memory
  If using queue.type: persisted, the directory path where the data files will be stored.
  Default is path.data/queue
  If using queue.type: persisted, the page data files size. The queue data consists of
  append-only data files separated into pages. Default is 64mb
  queue.page_capacity: 64mb
  If using queue.type: persisted, the maximum number of unread events in the queue.
  Default is 0 (unlimited)
  queue.max_events: 0
  If using queue.type: persisted, the total capacity of the queue in number of bytes.
  If you would like more unacked events to be buffered in Logstash, you can increase the
  capacity using this setting. Please make sure your disk drive has capacity greater than
  the size specified here. If both max_bytes and max_events are specified, Logstash will pick
  whichever criteria is reached first
  Default is 1024mb or 1gb
  queue.max_bytes: 1024mb
  If using queue.type: persisted, the maximum number of acked events before forcing a checkpoint
  Default is 1024, 0 for unlimited
  queue.checkpoint.acks: 1024
  If using queue.type: persisted, the maximum number of written events before forcing a checkpoint
  Default is 1024, 0 for unlimited
  queue.checkpoint.writes: 1024
  If using queue.type: persisted, the interval in milliseconds when a checkpoint is forced on the head page
  Default is 1000, 0 for no periodic checkpoint.
  queue.checkpoint.interval: 1000

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.