Logstash HTTP Output Stuck/Hanging

todd-bowles-console · June 15, 2017, 8:06am

I have the weirdest problem.

We use Logstash deployed via Octopus on AWS EC2 Instances to process an assortment of local files (mostly IIS logs, some application logs, performance counters, etc). All of these EC2 instances are Windows based, specifically Windows Server 2012 R2.

I recently pushed 5.4.1 out to all of our deployment locations (going from 5.0.2) and switch the outputs from TCP to HTTP (still going to the same host, which is a Logstash Broker layer that we use for aggregation into our ELK stack).

Most of the instances were fine and continued to output events as expected.

Two specific sets of instances stopped outputting events.

In trying to diagnose why this was the case, I started using the Logstash logs (which are pretty amazing) and its local API for stats and metrics and stuff, but I couldn't find anything that would point towards the failure.

Each time it stops outputting events, the /stats/pipeline output looks like this:

{
  "host" : "a-hostname",
  "version" : "5.4.1",
  "http_address" : "127.0.0.1:9600",
  "id" : "a82b92de-d402-4b99-a569-a859c8e00d2b",
  "name" : "a-name",
  "pipeline" : {
    "events" : {
      "duration_in_millis" : 11293,
      "in" : 377,
      "filtered" : 375,
      "out" : 250,
      "queue_push_duration_in_millis" : 28027
    },
    "plugins" : {
      "inputs" : [ {
        "id" : "get-events-from-api-iis-logs",
        "events" : {
          "out" : 129,
          "queue_push_duration_in_millis" : 14075
        },
        "name" : "file"
      }, {
        "id" : "get-events-from-system-stats",
        "events" : {
          "out" : 248,
          "queue_push_duration_in_millis" : 13952
        },
        "name" : "file"
      }, {
        "id" : "get-events-from-api-application-logs",
        "events" : {
          "out" : 0,
          "queue_push_duration_in_millis" : 0
        },
        "name" : "file"
      } ],
      "filters" : [ {
        "id" : "break-iis-fields-down-via-csv",
        "events" : {
          "duration_in_millis" : 1135,
          "in" : 128,
          "out" : 128
        },
        "name" : "csv"
      }, {
        "id" : "break-system-stats-fields-down-via-csv",
        "events" : {
          "duration_in_millis" : 1041,
          "in" : 247,
          "out" : 247
        },
        "name" : "csv"
      }, {
        "id" : "timestamp-system-stats-from-event-time",
        "events" : {
          "duration_in_millis" : 295,
          "in" : 247,
          "out" : 247
        },
        "matches" : 247,
        "name" : "date"
      }, {
        "id" : "set-api-image-flag-to-true",
        "events" : {
          "duration_in_millis" : 610,
          "in" : 119,
          "out" : 119
        },
        "name" : "mutate"
      }, {
        "id" : "set-api-image-flag-to-false",
        "events" : {
          "duration_in_millis" : 0,
          "in" : 9,
          "out" : 9
        },
        "name" : "mutate"
      }, {
        "id" : "drop-header-row-from-system-stats",
        "name" : "drop"
      }, {
        "id" : "set-type-to-logs",
        "events" : {
          "duration_in_millis" : 224,
          "in" : 375,
          "out" : 375
        },
        "name" : "mutate"
      }, {
        "id" : "timestamp-application-event-from-event-time",
        "name" : "date"
      }, {
        "id" : "80e1264da701ce77080d05bcbc5e7b371dcd7df9-4",
        "name" : "drop"
      }, {
        "id" : "timestamp-iis-event-from-event-time",
        "events" : {
          "duration_in_millis" : 401,
          "in" : 128,
          "out" : 128
        },
        "matches" : 128,
        "name" : "date"
      }, {
        "id" : "parse-fields-from-api-path",
        "events" : {
          "duration_in_millis" : 496,
          "in" : 128,
          "out" : 128
        },
        "matches" : 128,
        "patterns_per_field" : {
          "Path" : 5
        },
        "name" : "grok"
      }, {
        "id" : "convert-iis-fields-to-actual-numbers-and-dates",
        "events" : {
          "duration_in_millis" : 177,
          "in" : 128,
          "out" : 128
        },
        "name" : "mutate"
      }, {
        "id" : "add-fields-common-to-all-events",
        "events" : {
          "duration_in_millis" : 735,
          "in" : 375,
          "out" : 375
        },
        "name" : "mutate"
      }, {
        "id" : "convert-system-stats-fields-to-actual-numbers",
        "events" : {
          "duration_in_millis" : 847,
          "in" : 247,
          "out" : 247
        },
        "name" : "mutate"
      } ],
      "outputs" : [ {
        "id" : "send-events-to-elk-broker-over-http",
        "events" : {
          "duration_in_millis" : 5441,
          "in" : 375,
          "out" : 250
        },
        "name" : "http"
      } ]
    },
    "reloads" : {
      "last_error" : null,
      "successes" : 0,
      "last_success_timestamp" : null,
      "last_failure_timestamp" : null,
      "failures" : 0
    },
    "queue" : {
      "type" : "memory"
    },
    "id" : "main"
  }
}

Looking at the output above, it looks like something is wrong with the HTTP output and that's causing pushback?

Honestly at this point I have no idea.

todd-bowles-console · June 19, 2017, 12:53am

Interestingly, I thought that maybe some state from the previous version of logstash was causing problems.

I spun up some brand new EC2 instances, and they've only ever had logstash 5.4.1 on them. They output log events for a little while, and then stop with the same symptoms as the others.

It really does feel like something is going wrong in the HTTP output and its creating back pressure, but there are no errors, so I've got no indication what that something is

The fact it only occurs on this particular subset of machines (but consistently across all machines in the environment) is also very strange.

todd-bowles-console · June 19, 2017, 6:13am

I rescind my previous comment about it working while TRACE logging is on, it just happens to work sometimes and I was getting (un)lucky.

I've got a detailed TRACE output now, and it is entirely unhelpful. For example, this appears to be exactly when the HTTP output stops doing its job.

[2017-06-19T15:40:46,114][DEBUG][logstash.pipeline        ] output received {"event"=>{...event was here...}}
[2017-06-19T15:40:50,052][DEBUG][logstash.pipeline        ] Pushing flush onto pipeline
[2017-06-19T15:40:59,083][DEBUG][logstash.inputs.file     ] _globbed_files: C:/logs/{name}/*.log: glob is: ["C:/logs/{name}/20170619-00001.log"]
[2017-06-19T15:41:14,521][DEBUG][logstash.inputs.file     ] _globbed_files: C:/logs/name/*.log: glob is: ["C:/logs/{name}/20170619-00001.log"]

The pipeline does a flush and then nothing else. Those glob lines continue pretty much uninterrupted after that.

The files on disk continue to grow and change, but Logstash is just not picking anything up anymore.

Typically there are exactly BATCH_SIZE (i.e. 125) events sitting uncompleted, and they can be accounted for in the difference between the IN and OUT counters in the http output plugin stats.

system · July 17, 2017, 6:13am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Logstash 5.4.0 stuck/hang after processing certain amount of data (http-output plugin) Logstash	1	695	August 4, 2017
Logstash runs as service but stops sending logs over http after some time Logstash	3	780	November 3, 2017
Logstash Pipeline To Pipeline Communication Error Using HTTP Output Plugin Logstash	1	632	January 20, 2020
Logstash http output plugin throws "HTTP Output Failure" Logstash	5	2116	January 31, 2018
Does not retry failed http output Logstash	1	986	August 11, 2017

Logstash HTTP Output Stuck/Hanging

Related topics