Reg csv file import into Elasticsearch using Logstash

mohanss08 · January 18, 2021, 4:15pm

Hello Support team,

I have query regarding csv file upload into elasticsearch using logstash.

With this fourm https://www.bmc.com/blogs/elasticsearch-load-csv-logstash/

Able to import our csv files into elasticsearch successfully. However I have noticed that data of my csv file again if I rerun the logstash command. It uploads again and creating duplicate entries. So is there any way to skip the data existing in elasticsearch?

Deleting the file after each upload is completed that the only possible way? Please suggest the best way.

matw · January 19, 2021, 12:09pm

hi @mohanss08

this is a better question for our #elastic-stack:logstash channel, generally I can advise you, if your data has a unique ID, to also use this ID when you insert data into Elasticsearch via Logstash. Then you could replace dataset when instead of inserting them. I think you have to configure the Elasticsearch Logstash output to use a custom document_id. Certainly #elastic-stack:logstash channel could help here if you're stuck.

Best,
Matthias

mohanss08 · January 20, 2021, 6:04am

Hi Matthias, Thanks for the reply. Ok i have moved this ticket to #elastic-stack:logstash channel.

Logstash support group,

The below is my conf file, And in my csv file metric_id row has unique numbers, with that i tried to import my file through logstash, However still it uploads the duplicate entries. please correct me to fix this problem.

./logstash/pipeline/logstash_csv_report.conf

input {
  file {
    path => "/home/user/elk/logstash/report-file.csv"
    start_position => "beginning"
    sincedb_path => "/dev/null"
  }
}
filter {
  csv {
      separator => ","
      columns => ["start_time", "date", "requester", "full-name", "metric_id", "config", "status"]
  }
}
output {
   elasticsearch {
     action => "index"
     hosts => "http://elasticsearch:9200"
     index => "project-info"
    document_id => "%{metric_id}"
  }
stdout {}
}

mohanss08 · January 20, 2021, 6:41am

document_id => "%{id}" seems the right entry to avoid duplicates.

ptamba · January 20, 2021, 2:17pm

with document_id, you will hit issue if the same id exists in the future as it will update the records instead of creating the the new one (unless this is what you want)

sincedb_path is the problem. logstash keeps track of progress in sincedb_path. you’re sending sincedb to /dev/null which effectively tells logstash to reread the file every time it starts.

point the sincedb_path to a file which logstash can write to, and your issue should be solved without document_id

mohanss08 · January 20, 2021, 2:33pm

Okay sure. I ll remove the document_id and will change the sincedb_path to write in a file and check. Thanks.

mohanss08 · January 21, 2021, 5:56am

@ptamba, I have added sincedb_path to write in a txt file in my conf file input section, Also have updated in my docker-compose file as below.

volumes:
  - './logstash/pipeline:/usr/share/logstash/pipeline'
  - './logstash/sincedb:/usr/share/logstash/sincedb:rw'

My ./logstash/pipeline/logstash_csv_report.conf file updated with below.

sincedb_path => "/usr/share/logstash/sincedb/sincedb.txt"

In my native host ./logstash/sincedb/sincedb.txt and even in logstash container /usr/share/logstash/sincedb/sincedb.txt file also it doesn't have any entry. it shows zero KB in size.

I have pushed the same data(csv) to elasticsearch still i can see the duplicate entries. Please let me know what i have configured wrongly?

Thanks in advance for the support.

ptamba · January 24, 2021, 12:45pm

unfortunately i’m not really sure how it works in docker , seems like you’re mounting ./logstash/sincedb directory.

according to this,

“ Bind-mounted configuration files will retain the same permissions and ownership within the container that they have on the host system. Be sure to set permissions such that the files will be readable and, ideally, not writeable by the container’s logstash user (UID 1000).”

anything in the log files indicating error ? my suspect is logstash user can’t write to that file

mohanss08 · January 25, 2021, 4:44am

@ptamba, You are right. i have mounted the file from host system to docker was having root ownership. i have fixed the same thanks a lot for the pointer, now sincedb_path is writing in a file with logstash ownership.

But still i am able to upload the duplicates.

For example, i have uploaded a file and again if i upload the same file i can still the same data goes into Elasticsearch.

ptamba · January 25, 2021, 11:26am

sincedb keeps track of watched files according to this

if your use cases involves manually uploading files, then probably document_id is the best solution. not really sure how Logstash tracks files of the same filename. According to that docs, it uses identifier made up of the inode, major device number and minor device number.

One thing you can try is upload the files, let Logstash process the file and see the content of the sincedb path.

I'm not sure if it shouldn't be processing new files with the same file name.

mohanss08 · January 27, 2021, 3:00am

My use case is that daily from our internal site download a csv file and insert into ELK for the metrics purposes, In-order to evaluate the ELK to avoid duplicate entries, I have been working to find the best possibilities to avoid duplicates going into Elasticsearch through log-stash.

Current evolution as follows.

Upload a csv file which is having 3400 columns.
Delete the csv file from logstash import location.
Manually again place the csv file into the logstash location, But here it uploads the same entries already available in elasticsearch.

Along with already existing 3400 it creates duplicates it becoming 6800. sincedb.txt file is having some entries.

i have tried by adding document_id => "%{id}" here it allows to upload only one duplicates. i think in our document_id is the best solution. here can you guide me that how to avoid even single duplicates going into Elasticsearch?

ptamba · January 28, 2021, 7:28am

if your metric_id is unique , and you want the id to be retained then document_id => "%{id}" should be the best option. you can combine it with action create to avoid the document being overwritten if the document_id already exists.

your existing config shows action => "index" . this will overwrite / update the document if the same document_id exists. it shouldn't upload the same entries, unless you have index the document before without specifying an id, which causes Elasticsearch to automatically create an id

mohanss08 · January 29, 2021, 9:06am

Okay Thanks for all the information. Appreciate your support.

system · February 26, 2021, 9:06am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Import CSV into Elasticsearch with Logstash issue Logstash	6	1645	May 2, 2018
Avoid reloading (duplicate) csv records into same index Logstash	2	849	February 17, 2020
Logstash is indexing the last line of my csv file in elasticsearch Logstash	3	1278	July 6, 2017
Duplicate Entries of Log data Elasticsearch	6	4803	September 29, 2017
Logstash ingest and export to elasticsearch files twice Logstash	16	727	March 16, 2022

Reg csv file import into Elasticsearch using Logstash

Related topics