Load multiple csv files and update existing index

Hi, I am trying to create a "master table" in elastic, using logstash, where many csv files will be merged according to a common column named 'A'. Given the mapping of the index, can I insert each csv and update the unique column each csv refers to (col1, col2, col3, etc), by grouping regarding the common column 'A' ??

In more detail, I wish for each csv to append its info to the corresponding column (already existing) and if the group column 'A' value doesn't yet exist to insert it and then update the corresponding column, or else to just update the corresponding column for the given value of 'A'.

Any ideas ?? I have already tried merging the csv files using python but run out of memory (64GB).

It sounds like you want to use an elasticsearch output with doc_as_upsert set, and use column 'A' as the document id.

Yes exactly !
I have tried inserting the first csv, in the given mapping, and then insert a second one like that
output{ elasticsearch{ hosts => ["localhost"] index => "mock" action => "update" doc_as_upsert => true }
but I don't know how to use column 'A' as the doc_id

Use the document_id option and a sprintf reference to a field

document_id => "%{[A]}"

will use the contents of field A as the document id.

Thank you ! That was really helpful !

One last question though... given that a csv has multiple data for a unique value of column A, can I append all of them somehow, or will the last one overwrite the previous one and thus end up with only 1 value per value of column A for each csv ?

All csv files have the same number of columns (2) however for example, lets say csv_1 is like this...

column_A | Value
1 10
1 20
2 5
3 4

Would both values 10 and 20 be written for the unique value of column A being 1 or would only one of them survive, using doc_as_upsert => true and action => update ?

Only one would survive.

Another possible approach is shown here, which is to have logstash write out a file that contains document updates and then curl that into elasticsearch.

Yet another option would be to use a variant of this. If an elasticsearch output does not do quite what you want then you can use an http filter to POST into elasticsearch.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.