Using aggregation, I am able query out doc_count: 272152 of duplicates instances in my elasticsearch database.
The problem now is if I were to simply run a _delete_by_query, it will delete everything including the original.
What effective strategy can I use to retain my original file?
Reading online, I've read that one possible solution is to run a first request to get the min value of the timestamp field with a min aggregation. And then exclude this value in your search body.
Thank you for the idea -> however as mentioned there are at least 200k instances. So that means I would have to manually eyeball and go through 200k IDs for your method to work (it took me 2 hours to sieve through 100 txns).
Is there a way to quickly extract all 'wrong' ids?
When you say extract them all to spreadsheet, how do I do that?
From what i know, elastic search plugin > structured query can download CSV files. But does not exactly allow for aggregation query itself to be executed.
If you don't mind could you advise from my existing query, how I can effectively extract the data to an excel spread sheet?
POST
testindex/_sql?format=csv
{"query":"SELECT * FROM testindex WHERE created_date < '2020-12-16'"}
And it threw this error
* "type": "illegal_argument_exception",
* "reason": "Rejecting mapping update to [testindex] as the final mapping would have more than 1 type: [_doc, _sql]"
Then i changed the script to:
POST
testindex/_**doc**?format=csv
{"query":"SELECT * FROM testindex WHERE created_date < '2020-12-16
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.