Question on ingesting multiple csv files with repeated data

richylyq · May 28, 2020, 7:27am

Hi, does elastic auto remove duplicated data when i ingest multiple csv files with repeated data in them. or is there a way to enable the removal?

dadoonet · May 28, 2020, 8:36am

No.

Only if you are using the same _id in which case the duplicated row will overwrite the previous one.

richylyq · May 28, 2020, 9:06am

ohhh i see. thanks for the prompt reply @dadoonet
how do I go about setting the id for the ingest? the data i have is sth like

1.csv
-------
2020-04-19,22,2899,2888,339,429,11,6588
2020-04-20,23,3398,3782,351,449,11,8014
2020-04-21,27,3566,4682,371,468,11,9125
2020-04-22,25,4209,4999,413,483,12,10141
--------

2.csv
--------
2020-04-20,23,1364,5824,351,441,11,8014
2020-04-21,27,1381,6875,371,460,11,9125
2020-04-22,25,1571,7645,413,475,12,10141
2020-04-23,26,1342,8874,434,490,12,11178

The way i am ingesting the data right now looks sth like this, but as per your suggestion, I am not sure on where to set the id, and how to go about doing it. and if the amount of data (in rows) is inconsistent, will the id be able to track that the particular row is repeated?

for x in sorted(files):
    print(x)
    with open(x) as f:
        reader = csv.DictReader(f)
        helpers.bulk(es, reader, index='testindex')

Just found an article on enrich policy with ingest processor,

How to enrich logs and metrics using an Elasticsearch ingest node | Elastic Blog

Is it something like this?

dadoonet · June 2, 2020, 1:32pm

If you don't define an _id, then elasticsearch will generate one for you.
If the unique id in your case is the date of the event, then you can think of using the date as the _id.

So CSV1 will be like:

PUT index/_doc/2020-04-19
{ "date": "2020-04-19", ... }
PUT index/_doc/2020-04-20
{ "date": "2020-04-20", ... }

Then the second CSV will be something like:

PUT index/_doc/2020-04-20
{ "date": "2020-04-20", ... }
PUT index/_doc/2020-04-21
{ "date": "2020-04-21", ... }

Which means that at the end, you will have the following records:

{ "date": "2020-04-19", ... }
{ "date": "2020-04-20", ... }
{ "date": "2020-04-21", ... }

richylyq · June 3, 2020, 9:37am

Thanks for the information! I think i got the solution for the _id for my case already.

p.put_pipeline(id='attachment', body={
    'description': 'setting press release date to _id',
    'processors': [
        {
            "set": {
                "field": "_id",
                "value": "{{Press release date}}"
            }
        }
    ]
})

dadoonet · June 3, 2020, 9:42am

Yes. That will work.

Out of curiosity, why not providing the correct _id in your Python script instead of having to reprocess the json in an ingest pipeline?
It will be faster if you do that on Python side.

richylyq · June 8, 2020, 3:46am

haha, i guess there will always be a better way to work things out, but i think for now i will leave it as it is, i will update it if there is a need to. thanks anyways!

system · July 6, 2020, 3:49am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How not to duplicate data in Elasticserch Logstash ingest-pipeline	2	281	June 20, 2022
Ingestion of multiple csv files in python Elasticsearch	1	338	June 24, 2020
Reg csv file import into Elasticsearch using Logstash Logstash	13	2022	February 26, 2021
Avoid duplicates via node ingest pipelines Logs	10	2401	December 3, 2021
Preventing duplicates when reading the same data multiple times Logstash	3	745	June 22, 2021

Question on ingesting multiple csv files with repeated data

Related topics