Duplication of data due to logstash configuration

Hello,

I hope you and your loved ones are safe and healthy.

I am suspecting multiple copies of the same data being created due to faulty logstash configuration. Here is the ingestion pipeline:

  1. Log files in JSON file format are generated remote hosts.
  2. Filebeat is configured to monitor the folder for new *.json files.
  3. Filebeat will read the file and send it to a particular IP and port (7999).
  4. This IP hosts logstash (version 7.13) running on a Raspberry Pi with Ubuntu 20.04.2 LTS.
  5. There are 4 different configurations for logstash all running on differnt ports.

For the particular configuration that I am suspecting of causing duplication (or maybe triplication) of logs. Here is the:

Input

input {
       # filebeats
       beats {
             port => 7999
             type => "cowrie"
       }


and the output:

output {
    if [type] == "cowrie" {
        elasticsearch {
            hosts => ["ES-Node-1","ES-Node-2"]
            ssl => true
            user => 'redacted'
            password => 'redacted'
            cacert => 'certificate path'
            ssl_certificate_verification => false
            ilm_enabled => auto
            ilm_rollover_alias => "cowrie-logstash"
        }
        #file {
        #    path => "/tmp/cowrie-logstash.log"
        #    codec => json
        #}
        #stdout {
            #codec => rubydebug
        #}
    }
}

This data is visible in the discover tab under Index Pattern: Cowrie-* && documents have the _index: cowrie-logstash-2020.08.09-000001.

Important note: I can see type filed having value Cowrie in data under filebeat too:

image

Here are my doubts:

  1. There are no logstash configurations where the output is "filebeat", I am however seeing indexing within filebeat which is at a suspicious rate of Cowrie

  1. I put exact time period but the document count is different. I however found that document count is vastly different:

Filebeat-*

Cowrie-*

The data however seems to similar

Edit - 1 :

I've added index => "cowrie-logstash-%{+yyyy.MM.dd}" to the output with entire configuration being:

output {
    if [type] == "cowrie" {
        elasticsearch {
            hosts => ["ES-Node-1","ES-Node-2"]
            #data_stream => true  #Causes Errors: added after reading this: https://www.elastic.co/guide/en/logstash/current/plugins-outputs-elasticsearch.html#plugins-outputs-elasticsearch-data_streamwhile diagnosing cowrie ingestion causing data duplication.
            index => "cowrie-logstash-%{+yyyy.MM.dd}"
            ssl => true
            user => 'redacted'
            password => 'redacted'
            cacert => 'certificate path'
            ssl_certificate_verification => false
            ilm_enabled => auto
            ilm_rollover_alias => "cowrie-logstash"
        }
        #file {
        #    path => "/tmp/cowrie-logstash.log"
        #    codec => json
        #}
        #stdout {
            #codec => rubydebug
        #}
    }
}

I am however seeing indexing in filebeat and cowrie - same as earlier.

There is reduced rate of ingestion as I've stopped other hosts

How do I:

  1. Stop the duplication of data? and write only to cowrie- index?

  2. Is there a way to merge only the unique documents from filebeat to Cowrie?
    A. I suspect horrow here too since I can see that Index Pattern for Cowrie has 560 fields but Filebeat has 6035 fields.

Until writing this post there was no Index Pattern for Filebeat and I had never suspected duplication of data. It is only because of losing disk space I went through the checks.

Thank you very much.

How are you running logstash? What is the content of your pipelines.yml if you are running it as a service?

If you have 4 configurations, but your pipelines.yml point to a directory with those files instead of one pipeline for each one of your configurations, logstash will merge your files and you will have one big configuration with multiple inputs and outputs.

If this is the case, it doesn't matter that you have different ports on the inputs, the filter and output blocks will be applied to every event that enters the pipeline.

Share your logstash.yml and your pipelines.yml if possible.

1 Like

Hello, thank you very much for replying. Please find the uncommented portion of the configuration file. I'm giving selective portions for brevity. If you need the entire file, please do let me know:

logstash.yml (uncommented section only)

# ------------ Data path ------------------
#
# Which directory should be used by logstash and its plugins
# for any persistent needs. Defaults to LOGSTASH_HOME/data
#
path.data: /var/lib/logstash
#
# ------------ Pipeline Settings --------------
#
# The ID of the pipeline.
#
# pipeline.id: main
#
# Set the number of workers that will, in parallel, execute the filters+outputs
# stage of the pipeline.
#
# This defaults to the number of the host's CPU cores.
#
# pipeline.workers: 2
#
# How many events to retrieve from inputs before sending to filters+workers
#
# pipeline.batch.size: 125
#
# How long to wait in milliseconds while polling for the next event
# before dispatching an undersized batch to filters+outputs
#
# pipeline.batch.delay: 50
#
# Force Logstash to exit during shutdown even if there are still inflight
# events in memory. By default, logstash will refuse to quit until all
# received events have been pushed to the outputs.
#
# WARNING: enabling this can lead to data loss during shutdown
#
# pipeline.unsafe_shutdown: false
#
# Set the pipeline event ordering. Options are "auto" (the default), "true" or "false".
# "auto" will  automatically enable ordering if the 'pipeline.workers' setting
# is also set to '1'.
# "true" will enforce ordering on the pipeline and prevent logstash from starting
# if there are multiple workers.
# "false" will disable any extra processing necessary for preserving ordering.
#
pipeline.ordered: auto

# ------------ Debugging Settings --------------
#
# Options for log.level:
#   * fatal
#   * error
#   * warn
#   * info (default)
#   * debug
#   * trace
#
# log.level: info
path.logs: /var/log/logstash

pipeline.yml (entire file)

# This file is where you define your pipelines. You can define multiple.
# For more information on multiple pipelines, see the documentation:
#   https://www.elastic.co/guide/en/logstash/current/multiple-pipelines.html

- pipeline.id: main
  path.config: "/etc/logstash/conf.d/*.conf"

This configuration here could lead to data duplication:

- pipeline.id: main
  path.config: "/etc/logstash/conf.d/*.conf"

With this you do not have 4 different configurations, you have 1 pipeline called main composed of your 4 configuration files.

Consider this example where if you have files 1.conf and 2.conf in the directory /etc/logstash/conf.d/*.conf

1.conf

input {
  beats {
    port => 5001
 }
}
filter {
  some filters A
}
output {
  elasticsearch {
    hosts => ["http://es-hosts:9200"]
    index => "indexA"
  }
}

2.conf

input {
  beats {
    port => 5002
 }
}
filter {
  some filters B
}
output {
  elasticsearch {
    hosts => ["http://es-hosts:9200"]
    index => "indexB"
  }
}

When you start logstash, it will merge the files and you will have one pipeline with the following configuration:

input {
  beats {
    port => 5001
  }
  beats {
    port => 5002
  }
 }
filter {
  some filters A
  some filters B
}
output {
  elasticsearch {
    hosts => ["http://es-hosts:9200"]
    index => "indexA"
  }
  elasticsearch {
    hosts => ["http://es-hosts:9200"]
    index => "indexB"
  }
}

So, if you do not use conditionals in your filters and outputs, every event received from any of the inputs will pass through all filters and will be sent to every output.

If you want to completely separate your configuration files to avoid using filters you need to change it in your pipelines.yml.

- pipeline.id: pipeline-1
  path.config: "/etc/logstash/conf.d/1.conf"

- pipeline.id: pipeline-2
  path.config: "/etc/logstash/conf.d/2.conf"

This way you have 2 different and isolated pipelines, the events do not mix up.

2 Likes

Thank you very much for this.

I will try this and come back. I added id => "honeypot_ingest" inside each of the configuration but commented it since after that cluster monitoring did not give details about the pipelines (entire cluster is being monitored via metricbeat) -- Should I use this id field in pipelines.yml to separate the configuration?
`

Hence my pipelines.yml would be:

# This file is where you define your pipelines. You can define multiple.
# For more information on multiple pipelines, see the documentation:
#   https://www.elastic.co/guide/en/logstash/current/multiple-pipelines.html

- pipeline.id: honeypot_ingest
  path.config: "/etc/logstash/conf.d/cowrie.conf"

- pipeline.id: beats_ingest
  path.config: "/etc/logstash/conf.d/beats.conf"

- pipeline.id: packetbeat_ingest
  path.config: "/etc/logstash/conf.d/packetbeat.conf"

To separate your pipelines you just need to change your pipelines.yml, there is no id field in the pipelines.yml, just pipeline.id.

The id setting you are talking is the one that is used in inputs, filters and outputs to get metrics about it, not to separate anything, there is no need to use it if you do not want.

Separate your pipelines like this:

- pipeline.id: pipeline-1
  path.config: "/etc/logstash/conf.d/1.conf"

- pipeline.id: pipeline-2
  path.config: "/etc/logstash/conf.d/2.conf"

- pipeline.id: pipeline-3
  path.config: "/etc/logstash/conf.d/3.conf"

- pipeline.id: pipeline-4
  path.config: "/etc/logstash/conf.d/4.conf"

You can change pipeline-X for honeypot_ingest for example, you choose the name of the pipeline , only letters, numbers, - and _ are allowed if I'm not wrong.

You can read more about running multiple pipelines from the documentation

1 Like

Thank you very much @leandrojmp -- You are my savior for today. It is working perfects. I can see the three pipelines :slight_smile:

and I can see indexing in only in cowrie and others (like metricbeat, auditbeat) but none in filebeat

Thank you very much once again. :pray: :pray: :pray: :pray: :pray: :pray:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.