Hi,
I am using Elasticsearch 7.4. I noticed some unexpected scenarios during data aggregation and data curation.
Let me explain the scenario. I have 100 user_id and among them are AA, AB, AC, and AD. Now if I try to aggregate those user_id, I get some more extra data, such as BA, and BB.
Please note those user_id's start with
A
andB
, they live in Cluster 1. And this issue arises when I started to migrate all data starting withA
in the Cluster 2.
The query I was using
GET sturdent_data-2022*/_search
{
"size": 0,
"query": {
"bool": {
"should": [
{
"match": {
"user_id": "AA"
}
},
{
"match": {
"user_id": "AB"
}
},
{
"match": {
"user_id": "AC"
}
},
{
"match": {
"user_id": "AD"
}
} ]
}
},
"aggs": {
"NAME": {
"terms": {
"field": "user_id.keyword",
"size": 100
}
}
}
}
The output of the query is
"aggregations" : {
"NAME" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "AA",
"doc_count" : 98
},
{
"key" : "AB",
"doc_count" : 74
},
{
"key" : "BA",
"doc_count" : 68
},
{
"key" : "AD",
"doc_count" : 54
},
{
"key" : "AC",
"doc_count" : 35
},
{
"key" : "BB",
"doc_count" : 12
}
]
}
}
In the output, I received user_id BA
and BB
as extra unexpected data.
The same thing happens during the curation using _delete_by_query
.
I found a solution during aggregation and that is use the keyword
during data searching.
{
"match": {
"user_id.keyword": "AD"
}
But this does not work during curation, it removes all the data.
Now I need to get rid of these unexpected data during searching or data curation.