Hi, I am trying to create a "master table" in elastic, using logstash, where many csv files will be merged according to a common column named 'A'. Given the mapping of the index, can I insert each csv and update the unique column each csv refers to (col1, col2, col3, etc), by grouping regarding the common column 'A' ??
In more detail, I wish for each csv to append its info to the corresponding column (already existing) and if the group column 'A' value doesn't yet exist to insert it and then update the corresponding column, or else to just update the corresponding column for the given value of 'A'.
Any ideas ?? I have already tried merging the csv files using python but run out of memory (64GB).
Yes exactly !
I have tried inserting the first csv, in the given mapping, and then insert a second one like that output{ elasticsearch{ hosts => ["localhost"] index => "mock" action => "update" doc_as_upsert => true }
but I don't know how to use column 'A' as the doc_id
One last question though... given that a csv has multiple data for a unique value of column A, can I append all of them somehow, or will the last one overwrite the previous one and thus end up with only 1 value per value of column A for each csv ?
All csv files have the same number of columns (2) however for example, lets say csv_1 is like this...
column_A | Value
1 10
1 20
2 5
3 4
Would both values 10 and 20 be written for the unique value of column A being 1 or would only one of them survive, using doc_as_upsert => true and action => update ?
Another possible approach is shown here, which is to have logstash write out a file that contains document updates and then curl that into elasticsearch.
Yet another option would be to use a variant of this. If an elasticsearch output does not do quite what you want then you can use an http filter to POST into elasticsearch.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.