Identifying Pipeline Bottlenecks (Lost Events)

First time poster, long time lurker.

I am using the Elastic stack to process and analyze XML documents which are sent to me over the HTTP protocol. I currently have the following pipeline set up to realise this behaviour:

  1. Node.JS (receives document over HTTP and does some processing)
  2. Logstash formats XML as JSON and does further processing on some fields
  3. Elastic indexes the documents
    (4. Kibana for visualisation)

This works great on our live system (Windows server + 3 x CentOS Elastic Cluster) however I am migrating to a containerized solution on our test system, to be eventually rolled out to live.

Many events (approximately 55% or 5,135 events over a 15 minute period) are being lost on the test system and I do not know where. I know this because I can find event ids which exist on the live system, but not on the test system (they share a data feed). Does anyone have any ideas how I could go about identifying which part of this pipeline is causing the bottleneck and events to be missed? Any help would be much appreciated.

Talking to myself here, for anyone who is using a similar setup. Logstash to Elastic uses backpressure, through HTTP status codes, to buffer events (in memory) when Elastic is too busy to accept. This means events, in theory, cannot be lost between these components.

I my previous setup Node.JS was sending events to Logstash over TCP and, if Logstash was too busy to accept, would cause a timeout but my solution had a high time-out set of 3 seconds which meant these were likely missed.

The solution is to put a messaging queue, such as Apache Kafka, between Node.JS and Logstash and then any events which are lost you can be fairly certain it happened within Node.JS

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.