I have the weirdest problem.
We use Logstash deployed via Octopus on AWS EC2 Instances to process an assortment of local files (mostly IIS logs, some application logs, performance counters, etc). All of these EC2 instances are Windows based, specifically Windows Server 2012 R2.
I recently pushed 5.4.1 out to all of our deployment locations (going from 5.0.2) and switch the outputs from TCP to HTTP (still going to the same host, which is a Logstash Broker layer that we use for aggregation into our ELK stack).
Most of the instances were fine and continued to output events as expected.
Two specific sets of instances stopped outputting events.
In trying to diagnose why this was the case, I started using the Logstash logs (which are pretty amazing) and its local API for stats and metrics and stuff, but I couldn't find anything that would point towards the failure.
Each time it stops outputting events, the /stats/pipeline output looks like this:
{
"host" : "a-hostname",
"version" : "5.4.1",
"http_address" : "127.0.0.1:9600",
"id" : "a82b92de-d402-4b99-a569-a859c8e00d2b",
"name" : "a-name",
"pipeline" : {
"events" : {
"duration_in_millis" : 11293,
"in" : 377,
"filtered" : 375,
"out" : 250,
"queue_push_duration_in_millis" : 28027
},
"plugins" : {
"inputs" : [ {
"id" : "get-events-from-api-iis-logs",
"events" : {
"out" : 129,
"queue_push_duration_in_millis" : 14075
},
"name" : "file"
}, {
"id" : "get-events-from-system-stats",
"events" : {
"out" : 248,
"queue_push_duration_in_millis" : 13952
},
"name" : "file"
}, {
"id" : "get-events-from-api-application-logs",
"events" : {
"out" : 0,
"queue_push_duration_in_millis" : 0
},
"name" : "file"
} ],
"filters" : [ {
"id" : "break-iis-fields-down-via-csv",
"events" : {
"duration_in_millis" : 1135,
"in" : 128,
"out" : 128
},
"name" : "csv"
}, {
"id" : "break-system-stats-fields-down-via-csv",
"events" : {
"duration_in_millis" : 1041,
"in" : 247,
"out" : 247
},
"name" : "csv"
}, {
"id" : "timestamp-system-stats-from-event-time",
"events" : {
"duration_in_millis" : 295,
"in" : 247,
"out" : 247
},
"matches" : 247,
"name" : "date"
}, {
"id" : "set-api-image-flag-to-true",
"events" : {
"duration_in_millis" : 610,
"in" : 119,
"out" : 119
},
"name" : "mutate"
}, {
"id" : "set-api-image-flag-to-false",
"events" : {
"duration_in_millis" : 0,
"in" : 9,
"out" : 9
},
"name" : "mutate"
}, {
"id" : "drop-header-row-from-system-stats",
"name" : "drop"
}, {
"id" : "set-type-to-logs",
"events" : {
"duration_in_millis" : 224,
"in" : 375,
"out" : 375
},
"name" : "mutate"
}, {
"id" : "timestamp-application-event-from-event-time",
"name" : "date"
}, {
"id" : "80e1264da701ce77080d05bcbc5e7b371dcd7df9-4",
"name" : "drop"
}, {
"id" : "timestamp-iis-event-from-event-time",
"events" : {
"duration_in_millis" : 401,
"in" : 128,
"out" : 128
},
"matches" : 128,
"name" : "date"
}, {
"id" : "parse-fields-from-api-path",
"events" : {
"duration_in_millis" : 496,
"in" : 128,
"out" : 128
},
"matches" : 128,
"patterns_per_field" : {
"Path" : 5
},
"name" : "grok"
}, {
"id" : "convert-iis-fields-to-actual-numbers-and-dates",
"events" : {
"duration_in_millis" : 177,
"in" : 128,
"out" : 128
},
"name" : "mutate"
}, {
"id" : "add-fields-common-to-all-events",
"events" : {
"duration_in_millis" : 735,
"in" : 375,
"out" : 375
},
"name" : "mutate"
}, {
"id" : "convert-system-stats-fields-to-actual-numbers",
"events" : {
"duration_in_millis" : 847,
"in" : 247,
"out" : 247
},
"name" : "mutate"
} ],
"outputs" : [ {
"id" : "send-events-to-elk-broker-over-http",
"events" : {
"duration_in_millis" : 5441,
"in" : 375,
"out" : 250
},
"name" : "http"
} ]
},
"reloads" : {
"last_error" : null,
"successes" : 0,
"last_success_timestamp" : null,
"last_failure_timestamp" : null,
"failures" : 0
},
"queue" : {
"type" : "memory"
},
"id" : "main"
}
}
Looking at the output above, it looks like something is wrong with the HTTP output and that's causing pushback?
Honestly at this point I have no idea.