Import CSV into Elasticsearch with Logstash issue

Hello ELK community !

I use the ELK stack to parse CSV files and send them to Elasticsearch after parsing them with logstash.

Unfortunately, I have a problem:

When I send my files to the listening directory of the "input" of my logstash pipeline, the records are doubled, see triplets, without my asking anything ...

Indeed :

This is what my pipeline looks like:

input {
file {
path => "/home/XXX/report/*.csv"
start_position => "beginning"
sincedb_path => "/dev/null"
}
}
filter {
csv {
separator => ";"
columns => ["Name", "Status", "Category", "Type", "EndPoint", "Group", "Policy", "Scanned At", "Reported At", "Affected Application"]
}
}
output {
elasticsearch {
hosts => "http://localhost:9200"
index => "malwarebytes-report"
}
stdout {}
}

When I send my first file containing 28 records in "/home/XXX/report/", this is what Elasticsearch says:

[root @ lrtstfpe1 confd]#curl -XGET 'localhost:9200/_cat/indices?v&pretty'
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
yellow open malwarebytes-report PO4g6rKRTb6yuMDb7i-6sg 5 1 28 0 25.3kb 25.3kb

So currently it's ok, but when I send my second file of 150 records ...:

[root @ lrtstfpe1 confd]#curl -XGET 'localhost:9200/_cat/indices?v&pretty'
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
yellow open malwarebytes-report PO4g6rKRTb6yuMDb7i-6sg 5 1 328 0 263.3kb 263.3kb

The 150 recordings have been doubled and added to the first 28 ...

What's going on ??

Several days that I am stuck on the problem, I really need you ..

I believe your issue is with the logstash file input plugin configuration. This config

    file {
       path => "/home/XXX/report/*.csv"
       start_position => "beginning"
      sincedb_path => "/dev/null"
   }
 }

Tells logstash to read every file in /home/XXX/report/*.csv from the beginning every time it sees a new file.

sincedb_path => "/dev/null" says do not keep track of where we last stopped reading the file and treat all files as new every time the folder is read.

More detailed information is available in the documentation at
https://www.elastic.co/guide/en/logstash/current/plugins-inputs-file.html#plugins-inputs-file-sincedb_path

A more detailed description of how sincedb works is here
https://www.elastic.co/guide/en/logstash/current/plugins-inputs-file.html#_tracking_of_current_position_in_watched_files

I just try to delete theses two lines, same problem... What am I supposed to do to don't read older files when new files comes in the directory

When I use Filebeat to monitor my directory insted of input section of logstash, it is the same behaviour...

My pipeline to use filebeat :

input {
beats {
port => 5044
}
}
filter {
csv {
separator => ";"
columns => ["Name","Status","Category","Type","EndPoint","Group","Policy","Scanned At","Reported At","Affected Application"]
}
}
output {
elasticsearch {
hosts => "http://localhost:9200"
index => "malwarebytes-report"
}
}

My filebeat config :

filebeat.prospectors:
- type: log
  enabled: true
  paths:
    - /home/D_NT_UEM/petotadm/rapport/*.csv
filebeat.config.modules:
  path: ${path.config}/modules.d/*.yml
  reload.enabled: false
setup.template.settings:
  index.number_of_shards: 3
setup.kibana:
output.logstash:
  hosts: ["localhost:5044"]

Moreover, all CSV config that i found on Internet looks like

> start_position => "beginning"
> sincedb_path => "/dev/null"

I just completely uninstalled the ELK stack (rpm -e elasticsearch kibana logstash filebeat) as well as any ELK traces (rm -rf /var/lib/ELK/ var/log/ELK/ etc/default/ELK /usr/share/ELK ...) So, nothing anywhere.

I just reinstall everything:

rpm -ivh elasticsearch-6.2.3.rpm
rpm -ivh kibana-6.2.3-x86_64.rpm
rpm -ivh logstash-6.2.3.rpm

And start the services: service ELK restart

Then, in terms of configurations:
/etc/elasticsearch.yml is completely by default.
/etc/kibana.yml is completely by default.
/etc/logstash.yml is completely by default.

Then, I put my one and ONLY pipeline named "pip.conf" in/etc/logstash/conf.d/
Its configuration:

input {
   file {
     path => "/home/report/*.csv"
     start_position => "beginning"
     sincedb_path => "/dev/null"
  }
}
filter {
  csv {
     separator => ";"
     columns => ["Name","Status","Category","Type","EndPoint","Group","Policy","Scanned At","Reported At","Affected Application"]
  }
}
output {
   elasticsearch {
     hosts => "http://localhost:9200"
     index => "malwarebytes-report"
  }
stdout{}
}

And finally, I launch my pipeline :
I go into /usr/share/logstash and I execute :

bin/logstash -f /etc/logstash/conf.d/pip.conf

After few secondes, my pipeline is listening, and now, I put my file1.csv and my file2.csv into /home/report/.

file1.csv contains 28 records and file2.csv contains 150 records.

But now, when I check my index : curl -XGET 'localhost:9200/_cat/indices?v&pretty'
My index "malwarebytes-report" contains 357 records ... (150x2 + 28x2 ...)

I don't understand NOTHING ....

Thx for help

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.