Is there anyway to keep duplicate rows as they are and maybe include a number which differentiates them. The duplicate records aren't necessarily duplicates so much as they are multiple instances of the same item. They are all needed. This is for CSV files.
Well the thing is, if no document_id is set then it's generated by elastic, however, the next time those logstash config files are run it will just duplicate the data by adding another new id to each row. However, the duplicates already exists and are valid because they are just multiple instances of the same object. There is no field/column that differentiates it though. So would there be a way of including I guess a field to the id that would separate all of the multiple instances and then change those if next time there are more or less of those objects.
Or what if i could add a field to take into account the number of occurrences for those duplicates. Then when the logstash file runs again would it just be possible to overwrite that number should the number of duplicates increase or lower?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.