Question on ingesting multiple csv files with repeated data

Hi, does elastic auto remove duplicated data when i ingest multiple csv files with repeated data in them. or is there a way to enable the removal?

No.

Only if you are using the same _id in which case the duplicated row will overwrite the previous one.

ohhh i see. thanks for the prompt reply @dadoonet :bowing_man:
how do I go about setting the id for the ingest? the data i have is sth like

1.csv
-------
2020-04-19,22,2899,2888,339,429,11,6588
2020-04-20,23,3398,3782,351,449,11,8014
2020-04-21,27,3566,4682,371,468,11,9125
2020-04-22,25,4209,4999,413,483,12,10141
--------
2.csv
--------
2020-04-20,23,1364,5824,351,441,11,8014
2020-04-21,27,1381,6875,371,460,11,9125
2020-04-22,25,1571,7645,413,475,12,10141
2020-04-23,26,1342,8874,434,490,12,11178

The way i am ingesting the data right now looks sth like this, but as per your suggestion, I am not sure on where to set the id, and how to go about doing it. and if the amount of data (in rows) is inconsistent, will the id be able to track that the particular row is repeated?

for x in sorted(files):
    print(x)
    with open(x) as f:
        reader = csv.DictReader(f)
        helpers.bulk(es, reader, index='testindex')

Just found an article on enrich policy with ingest processor,

How to enrich logs and metrics using an Elasticsearch ingest node | Elastic Blog

Is it something like this?

If you don't define an _id, then elasticsearch will generate one for you.
If the unique id in your case is the date of the event, then you can think of using the date as the _id.

So CSV1 will be like:

PUT index/_doc/2020-04-19
{ "date": "2020-04-19", ... }
PUT index/_doc/2020-04-20
{ "date": "2020-04-20", ... }

Then the second CSV will be something like:

PUT index/_doc/2020-04-20
{ "date": "2020-04-20", ... }
PUT index/_doc/2020-04-21
{ "date": "2020-04-21", ... }

Which means that at the end, you will have the following records:

{ "date": "2020-04-19", ... }
{ "date": "2020-04-20", ... }
{ "date": "2020-04-21", ... }

Thanks for the information! I think i got the solution for the _id for my case already.

p.put_pipeline(id='attachment', body={
    'description': 'setting press release date to _id',
    'processors': [
        {
            "set": {
                "field": "_id",
                "value": "{{Press release date}}"
            }
        }
    ]
})

Yes. That will work.

Out of curiosity, why not providing the correct _id in your Python script instead of having to reprocess the json in an ingest pipeline?
It will be faster if you do that on Python side.

haha, i guess there will always be a better way to work things out, but i think for now i will leave it as it is, i will update it if there is a need to. thanks anyways!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.