Large amounts of bulk data sent from Logstash to Elasticsearch after using pipelines

userR · February 23, 2022, 11:00pm

I recently changed my configuration from one logstash.config file to pipelines to make it easier.

My pipelines.yml (replaced actual names with letters, main re-routes data to correct pipeline based on tag):

- pipeline.id: main
  path.config: "/usr/share/logstash/pipeline/main.config"

- pipeline.id: a
  path.config: "/usr/share/logstash/pipeline/a.config"

- pipeline.id: b
  path.config: "/usr/share/logstash/pipeline/b.config"

- pipeline.id: c
  path.config: "/usr/share/logstash/pipeline/c.config"

- pipeline.id: d
  path.config: "/usr/share/logstash/pipeline/d.config"

- pipeline.id: e
  path.config: "/usr/share/logstash/pipeline/e.config"

- pipeline.id: f
  path.config: "/usr/share/logstash/pipeline/f.config"

I see on Elasticsearch, a large amount of daily data being sent (was usually max 5mb, now it is 1 gb)
I also see this error in the logstash logs:

[ERROR][logstash.outputs.elasticsearch][main][6bdcb4726a198461b0a3bc504bd116ed5ae4dc3a4e92f278a77b790bc12a0ceb] Attempted to send a bulk request but there are no living connections in the pool (perhaps Elasticsearch is unreachable or down?) {:message=>"No Available connections", :exception=>LogStash::Outputs::ElasticSearch::HttpClient::Pool::NoConnectionAvailableError, :will_retry_in_seconds=>16}

[WARN ][logstash.outputs.elasticsearch][main][39c5e157a8fc0ce37f379032b7514bc216a85707441f8b16bfdf1757bb7fd6a6] Marking url as dead. Last error: [LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError] Elasticsearch Unreachable: [http://elasticsearch:9200/][Manticore::ClientProtocolException] elasticsearch:9200 failed to respond {:url=>http://elasticsearch:9200/, :error_message=>"Elasticsearch Unreachable: [http://elasticsearch:9200/][Manticore::ClientProtocolException] elasticsearch:9200 failed to respond", :error_class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError"}

I also cannot query this data on Kibana and get this error:

Error: Batch request failed with status 503

Little confused as to what is going on here or why so much data is being sent.

userR · February 23, 2022, 11:14pm

Looking at the data in Elasticsearch (using elastichead), I see 60 duplicates of data in one field for each log, here's an example of my configuration as I assume the issue is there:

main.config

input {
    beats { 
        port => 5044
        host => "0.0.0.0"
        ssl => false
    }
}
output {
    if [fields][log_type] == "a" {
        pipeline { send_to => a }
    } 
    else if [fields][log_type] == "b" {
        pipeline { send_to => b }
    }
    else if [fields][log_type] == "c" {
        pipeline { send_to => c }
    }
    else if [fields][log_type] == "d" {
        pipeline { send_to => d }
    }
    else if [fields][log_type] == "e" {
        pipeline { send_to => e }
    }
    else if [fields][log_type] == "f" {
        pipeline { send_to => f }
    }
}

a.config

input {
    pipeline {
        address => "a"
    }
}
filter {
  if [fields][log_type] == "a" {
    grok {

    }
    date {
      match => ["logdate", "YYYY-MM-dd HH:mm:ss,SSS"]
      target => "logdate"
    }
  }
}
output {
  if [fields][log_type] == "a" {
    elasticsearch {
      hosts => ["elasticsearch:9200"]
      index => "a-logdata=%{+YYYY.MM.dd}"
    }
  }
}

logstash.yml

http.host: "0.0.0.0"
path.config: /usr/share/logstash/pipeline

userR · February 24, 2022, 12:57am

The issue was path.config: /usr/share/logstash/pipeline in logstash.yml

Not sure why this caused the data fields to be duplicated but everything is working fine now once deleted

Badger · February 24, 2022, 1:18am

Setting path.config prevents pipelines.yml being used. If path.config points to a directory then all the files are concatenated. Events are read from all of the inputs, processed by all of the filters, and then all of the events are sent to all of the filters. New users very often misunderstand this and think each configuration file stands alone.

For events with [fields][log_type] == "a" an event will reach the Elasticsearch output shown once directly from the beats input, and once after going through pipeline output/input pair a.

If you have any outputs with are not wrapped in a test of [fields][log_type] they will receive each event from beats, and duplicates from pipelines a, b, c, d, e, and f.

system · March 24, 2022, 1:19am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.