Hi everyone,
So recently we ran into a problem using elastic, we are attempting to detect records which have a duplicate value (hash) and to patch them with a flag so that we iterate all records as we go. (27 million records in total, out of which 6 million populate with said hash).
This workflow worked fine on an index sitting on a single node. Eventually the index grew and we moved it to multiple nodes so that we retain performance.
After the move the aggregation result was not accurate anymore as some of the records which we know for sure have keys with more than 1 doc count, do not get returned. I tried to run the query with a min_doc_count of 1 and again the aggregation does not return some of the values (hashes) which should be returned.
In the query below if we remove the must_not and add a different condition which should return the missing hashes will not work either. If we add a condition defining an explicit equal: key = value (hash) which we know is missing then the aggregation will return the correct result and count, otherwise the key is missing all together.
I am hoping that there are a few of few who can explain what is going on and if there is something that we might be doing wrong or if there is a way to rectify the situation.
{
"query": {
"bool": {
"must_not": [
{
"bool": {
"filter": [
{
"exists": {
"field": "$type",
"boost": 1
}
},
{
"term": {
"kmeta:Misc": {
"value": "KBXD-R-1611392400028",
"boost": 1
}
}
}
],
"adjust_pure_negative": true,
"boost": 1
}
}
],
"adjust_pure_negative": true,
"boost": 1
}
}
"aggregations": {
"kmeta:fileHash": {
"terms": {
"field": "kmeta:fileHash",
"size": 10000,
"shard_size": 10000,
"min_doc_count": 2,
"shard_min_doc_count": 0
}
}
}
}