Hi Everyone,
Using aggregation, I am able query out doc_count: 272152 of duplicates instances in my elasticsearch database.
The problem now is if I were to simply run a _delete_by_query, it will delete everything including the original.
What effective strategy can I use to retain my original file?
Reading online, I've read that one possible solution is to run a first request to get the min value of the timestamp field with a min aggregation. And then exclude this value in your search body.
Please advise.
{
"size": 0,
"aggs": {
"duplicateDocs": {
"filter": {
"bool": {
"must": [
{
"range": {
"createdDate": {
"from": "2020-06-01",
"to": null,
"include_lower": true,
"include_upper": true,
"boost": 1
}
}
},
{
"bool": {
"should": [
{
"terms": {
"messageType": [
"short",
"long",
"superlong"
]
}
},
{
"prefix": {
"messagetype": "veryshort"
}
}
],
"must_not": [
{
"terms": {
"matchingType": [
"six",
"one"
]
}
}
]
}
}
]
}
},
"aggs": {
"duplicateCount": {
"terms": {
"field": "Txn_Ref.keyword",
"min_doc_count": 2,
"size": 1000
}
}
}
}
}
}