Problem:
LogStash is going OOM and restarting itself causing data loss. Our average document size is in the 1-2KB range. We have a few large* 175MB JSON documents which chew up almost 5GB of RAM going through LogStash for some reason, so we upped the heap to 8GB. Despite this we still see OOM from LogStash 10+ times per day. However, we are seeing LogStash OOM even when large* documents are not coming through; probably because it sees ElasticSearch as being unresponsive.
LogStash every few seconds is showing "Marking URL as dead" because of a connection reset, followed by failing a few bulk operations, and a connection restore. Eventually this leads to an OOM from LogStash and data loss.
*Large documents: For a variety of unfortunate reasons some of the documents we push through are close to 175MB. Most are in the few KB to low 1s or 10s of MB range. For the large files there is a single field taking of most of the space marked as "data" in the mapping. We have increased the max HTTP size in ElasticSearch to 512MB just to have plenty of overhead for ingesting these. Longer term we know we need a different solution for large data blobs.
Question
Does anyone have any ideas about why LogStash is going OOM? ElasticSearch, LogStash, and system monitoring show everything being relatively idle (disk, CPU, etc) even while the heap on LogStash grows and then hits the 8GB limit causing OOM.
Architecture
We have a 3 node ELK cluster set up running 6.4.3 with 2 data+maser nodes and 1 master only node. There are 540 indices, 2074 shards, 490M documents, 390GB of data. Index rate is 500 primary shards/s (1000/s total). Search rate is very low (sub 10/s average) with the biggest spike being 100/s. The indices are split into a few types:
- System logs in general
- Internal application specific data
- Parse failures
The data nodes are Dell R740XDs each running 2 x Xeon 6126 with 12core / 24 thread at 2.6GHz each for a total of 48 threads per server. There is a total of 192GB of RAM and 12 x 4TB 7200RPM HDDs each mounted independently and set up as a separate data directory for ES. They are also both running LogStash and Kibana.
For the data flow everything goes through LogStash rolling a new index every day. For the internal application specific data we are sending documents through LogStash we are hitting a filter chain with an "if/else if" chain that falls down to:
mutate {
    remove_filed => [ "port" ]
}
date {
    match => ["timestamp", "UNIX_MS"]
    remove_field => ["timestamp"]
}
We are running OpenJDK 1.8.0_181 on CentOS 7.5.