- Elasticsearch 1.7.0 2x6-core Intel Xeon 2.0GHZ with HT (total of 24 cores), 30GB heap, 10x1TB partitions.
java version "1.7.0_79" OpenJDK Runtime Environment (IcedTea 2.5.5) (7u79-2.5.5-1~deb8u1) OpenJDK 64-Bit Server VM (build 24.79-b02, mixed mode)
- Our cluster consists of one data+master node and a search-load-balancer node (master/data=false)
- Logstash 1.5.2, OpenVZ container, Debian 7, 8GB memory, 3GB heap, 64 filter threads.
java version "1.7.0_75" OpenJDK Runtime Environment (IcedTea 2.5.4) (7u75-2.5.4-1~deb7u1) OpenJDK 64-Bit Server VM (build 24.75-b04, mixed mode)
- Lumberjack input plugin is latest version and includes the circuit-breaker mechanism.
- Logstash configuration is located in 2nd reply or in the Github issue due to post body size limitation.
In times of low throughput (outside heavy input hours) the event flow seems balanced between CDN logs, and all other log types.
During heavy input hours, we suffer from serious degradation in lumberjack inputs (Nginx / Rails / Postfix), whereas TCP input (CDN log) throughput stays pretty much the same.
These images show the discrepancy between TCP logs and other (lumberjack) logs for a given time period:
Lumberjack (Nginx) records:
Nginx event count, by production host
NB from around 21:00, the TCP logs drop off completely for some time, and the lumberjack input is much lower from then on. This is detailed in the last screenshot above, where each colour is a production app server (we have four high volume production app servers on lumberjack inputs, as well as our low-volume staging servers - each one outputs both nginx and rails via logstash-forwarder). It’s interesting (and possibly relevant) that prior to the hiccough, all four servers were pretty much throwing an equal number of log information, but afterwards, the proportions seemed to change.
Here’s what we assume might be happening, based on the behaviour described above:
- The TCP input which has no circuit-breaker hammers the pipeline and blocks (or throttles) the lumberjack input from pushing events. When lumberjack input can’t push event to its filter event-queue, it refuses new connections.
- The lumberjack circuit-breaker limits shippers one by one, starting with the heavy hitters. Logic is equivalent or at least similar to TCP - repeated cuts in half the throughput of the top abusers until things stabilize, and then a slow raising of limits.
Out of all this, comes the following:
- Is there any basis for our assumptions above?
- Is our use case sane? Is such a setup (mixing circuit-breaker and non-circuit-breaker inputs) sane?
- Can any logstash developers confirm or deny our hypotheses above?
What do we have to do to enable our Logstash instance to cope with this? We’re happy to answer any requests for more detail on our setup, and try any suggestions that the community may have about how to move forward.
Yarden and DevOps Team