Logstash HTTP Output Stuck/Hanging

I have the weirdest problem.

We use Logstash deployed via Octopus on AWS EC2 Instances to process an assortment of local files (mostly IIS logs, some application logs, performance counters, etc). All of these EC2 instances are Windows based, specifically Windows Server 2012 R2.

I recently pushed 5.4.1 out to all of our deployment locations (going from 5.0.2) and switch the outputs from TCP to HTTP (still going to the same host, which is a Logstash Broker layer that we use for aggregation into our ELK stack).

Most of the instances were fine and continued to output events as expected.

Two specific sets of instances stopped outputting events.

In trying to diagnose why this was the case, I started using the Logstash logs (which are pretty amazing) and its local API for stats and metrics and stuff, but I couldn't find anything that would point towards the failure.

Each time it stops outputting events, the /stats/pipeline output looks like this:

{
  "host" : "a-hostname",
  "version" : "5.4.1",
  "http_address" : "127.0.0.1:9600",
  "id" : "a82b92de-d402-4b99-a569-a859c8e00d2b",
  "name" : "a-name",
  "pipeline" : {
    "events" : {
      "duration_in_millis" : 11293,
      "in" : 377,
      "filtered" : 375,
      "out" : 250,
      "queue_push_duration_in_millis" : 28027
    },
    "plugins" : {
      "inputs" : [ {
        "id" : "get-events-from-api-iis-logs",
        "events" : {
          "out" : 129,
          "queue_push_duration_in_millis" : 14075
        },
        "name" : "file"
      }, {
        "id" : "get-events-from-system-stats",
        "events" : {
          "out" : 248,
          "queue_push_duration_in_millis" : 13952
        },
        "name" : "file"
      }, {
        "id" : "get-events-from-api-application-logs",
        "events" : {
          "out" : 0,
          "queue_push_duration_in_millis" : 0
        },
        "name" : "file"
      } ],
      "filters" : [ {
        "id" : "break-iis-fields-down-via-csv",
        "events" : {
          "duration_in_millis" : 1135,
          "in" : 128,
          "out" : 128
        },
        "name" : "csv"
      }, {
        "id" : "break-system-stats-fields-down-via-csv",
        "events" : {
          "duration_in_millis" : 1041,
          "in" : 247,
          "out" : 247
        },
        "name" : "csv"
      }, {
        "id" : "timestamp-system-stats-from-event-time",
        "events" : {
          "duration_in_millis" : 295,
          "in" : 247,
          "out" : 247
        },
        "matches" : 247,
        "name" : "date"
      }, {
        "id" : "set-api-image-flag-to-true",
        "events" : {
          "duration_in_millis" : 610,
          "in" : 119,
          "out" : 119
        },
        "name" : "mutate"
      }, {
        "id" : "set-api-image-flag-to-false",
        "events" : {
          "duration_in_millis" : 0,
          "in" : 9,
          "out" : 9
        },
        "name" : "mutate"
      }, {
        "id" : "drop-header-row-from-system-stats",
        "name" : "drop"
      }, {
        "id" : "set-type-to-logs",
        "events" : {
          "duration_in_millis" : 224,
          "in" : 375,
          "out" : 375
        },
        "name" : "mutate"
      }, {
        "id" : "timestamp-application-event-from-event-time",
        "name" : "date"
      }, {
        "id" : "80e1264da701ce77080d05bcbc5e7b371dcd7df9-4",
        "name" : "drop"
      }, {
        "id" : "timestamp-iis-event-from-event-time",
        "events" : {
          "duration_in_millis" : 401,
          "in" : 128,
          "out" : 128
        },
        "matches" : 128,
        "name" : "date"
      }, {
        "id" : "parse-fields-from-api-path",
        "events" : {
          "duration_in_millis" : 496,
          "in" : 128,
          "out" : 128
        },
        "matches" : 128,
        "patterns_per_field" : {
          "Path" : 5
        },
        "name" : "grok"
      }, {
        "id" : "convert-iis-fields-to-actual-numbers-and-dates",
        "events" : {
          "duration_in_millis" : 177,
          "in" : 128,
          "out" : 128
        },
        "name" : "mutate"
      }, {
        "id" : "add-fields-common-to-all-events",
        "events" : {
          "duration_in_millis" : 735,
          "in" : 375,
          "out" : 375
        },
        "name" : "mutate"
      }, {
        "id" : "convert-system-stats-fields-to-actual-numbers",
        "events" : {
          "duration_in_millis" : 847,
          "in" : 247,
          "out" : 247
        },
        "name" : "mutate"
      } ],
      "outputs" : [ {
        "id" : "send-events-to-elk-broker-over-http",
        "events" : {
          "duration_in_millis" : 5441,
          "in" : 375,
          "out" : 250
        },
        "name" : "http"
      } ]
    },
    "reloads" : {
      "last_error" : null,
      "successes" : 0,
      "last_success_timestamp" : null,
      "last_failure_timestamp" : null,
      "failures" : 0
    },
    "queue" : {
      "type" : "memory"
    },
    "id" : "main"
  }
}

Looking at the output above, it looks like something is wrong with the HTTP output and that's causing pushback?

Honestly at this point I have no idea.

Interestingly, I thought that maybe some state from the previous version of logstash was causing problems.

I spun up some brand new EC2 instances, and they've only ever had logstash 5.4.1 on them. They output log events for a little while, and then stop with the same symptoms as the others.

It really does feel like something is going wrong in the HTTP output and its creating back pressure, but there are no errors, so I've got no indication what that something is :frowning:

The fact it only occurs on this particular subset of machines (but consistently across all machines in the environment) is also very strange.

I rescind my previous comment about it working while TRACE logging is on, it just happens to work sometimes and I was getting (un)lucky.

I've got a detailed TRACE output now, and it is entirely unhelpful. For example, this appears to be exactly when the HTTP output stops doing its job.

[2017-06-19T15:40:46,114][DEBUG][logstash.pipeline        ] output received {"event"=>{...event was here...}}
[2017-06-19T15:40:50,052][DEBUG][logstash.pipeline        ] Pushing flush onto pipeline
[2017-06-19T15:40:59,083][DEBUG][logstash.inputs.file     ] _globbed_files: C:/logs/{name}/*.log: glob is: ["C:/logs/{name}/20170619-00001.log"]
[2017-06-19T15:41:14,521][DEBUG][logstash.inputs.file     ] _globbed_files: C:/logs/name/*.log: glob is: ["C:/logs/{name}/20170619-00001.log"]

The pipeline does a flush and then nothing else. Those glob lines continue pretty much uninterrupted after that.

The files on disk continue to grow and change, but Logstash is just not picking anything up anymore.

Typically there are exactly BATCH_SIZE (i.e. 125) events sitting uncompleted, and they can be accounted for in the difference between the IN and OUT counters in the http output plugin stats.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.