Latency with 2 Elasticsearch systems

I have the requirement of collecting logs and then sending some (based on source log file path name) to one elasticsearch server and the rest to another one.

In order to accomplish this, I'm using Filebeat to send to a Logstash server. The Logstash server is collecting, filtering (grok, tagging, etc) and outputting logs to the two Elasticsearch servers.

On Logstash, in the filtering stage, I'm tagging certain logs that need to go to the first ES server and then using an if statement in the output to send tagged logs one way and non-tagged ones the other way.

The primary (untagged) ES server is on the same network as the Logstash server, as is a test ES server. When I'm sending logs to these two servers, everything stays in sync and up to date.

The real second (tagged) ES server is accessed over the internet. When I change the output in Logstash from the test ES server to the real one over the internet, the logs start getting very delayed and never catch up.

First of all, if I'm doing this a dumb way, let me know a better way.

Secondly, it seems to me that the problem is the real ES server across the internet, but the owners of that server say they have the resources to handle what I'm sending. Is there anything on my end that could be contributing to this problem?

I wondered if the problem was related to changing the ES server in the output. i.e. is Logstash trying to "catch up" the real ES server when switching away from the test one? Or does Logstash just send logs to the output based on the time the output was changed?

I have XPack monitoring working, so below are graphs that look relevant to me. If there are better ones that would help, let me know. The large jump in the graphs at 15:19 are when I changed the output from the test to real tagged ES servers.

The latency may be affecting you adversely, despite ~15ms still being relatively short, but there are a few things you can tune.

Optimising batch size

In the Elasticsearch Output Plugin, events are sent to Elasticsearch's batch API, so the effect of increased latency has a direct relationship to the number of batches that are sent. If we increase our batch size, we can reduce the amortised effect of latency per event.

Maximising batch utilisation.

The pipeline handles events in batches (default: 125 events), and passes the subset of events matching an if clause to the enclosed plugin(s) in sequence, each of which handles their subset of the batch.

If we have 2 outputs that behave differently based on tags, and we can generalise that difference in the filterphase so that we can share an output, we can maximise the use of each batch.

For example, in the following each output would emit a subset of the batch; the outputs would run sequentially, so we double the latency overhead:

output {
  if [tags] includes "foo" {
    elasticsearch {
      index => "foo"
    }
  } else {
    elasticsearch {
      index => "bar"
    }
  }
}

While the following would do the same operation, but using a single output means we only incur the latency penalty once:

filter {
  if [tags] includes "foo" {
    mutate {
      add_field => { "[@metadata][index]" => "foo" }
    }
  } else {
    mutate {
      add_field => { "[@metadata][index]" => "foo" }
    }
  }
}
output {
  elasticsearch {
      index => "%{[@metadata][index]}"
  }
}

Optimising worker count

Additionally, if we find that the output is being negatively affected by latency, we can increase the number of workers for the pipeline, which would enable us to have more batches in flight simultaneously.

Wow, thank you for that really detailed response. I appreciate the additional understanding I have now.

It ended up that the second ES server was underpowered. They replaced it and logs had no trouble keeping up. I also adjusted the batch size on the logstash server to 250, which seemed to make the load a little lower on it.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.