I have two indices that should be identical, but have a slightly different document count. I'd like to determine which documents only exist in one of the indexes.
What I initially tried failed and I'll get to that in a moment. After the failure I tried to see if I could even find duplicates in each individual index:
{
"size": 0,
"aggs": {
"type_count": {
"cardinality": {
"field": "db_filename",
"precision_threshold": 40000
}
}
}
}
This returns 1967988. If I visit the _count
endpoint I get the number: 1972559. This number is larger, so I'm assuming some duplicates. However neither of these queries yield any results.
{
"size": 0,
"aggs": {
"group_doc_by_id": {
"terms": {
"field": "db_filename"
},
"aggs": {
"select_group_with_count_1": {
"bucket_selector": {
"buckets_path": {
"the_doc_count": "_count"
},
"script": "params.the_doc_count > 1"
}
}
}
}
}
}
OR
{
"size": 0,
"query": {
"match_all": {}
},
"aggregations": {
"doc_duplicates": {
"terms": {
"field": "db_filename",
"min_doc_count": 2
}
}
}
}
This makes me think I have little chance of identifying the non duplicated documents across two indices with the first query above, or the following
{
"size": 0,
"aggs": {
"group_by_uid": {
"terms": {
"field": "db_filename"
},
"aggs": {
"count_indices": {
"cardinality": {
"field": "_index"
}
},
"values_bucket_filter_by_index_count": {
"bucket_selector": {
"buckets_path": {
"count": "count_indices"
},
"script": "params.count < 2"
}
}
}
}
}
}
Neither of these queries show any results across the two indices. These queries are running against an index across 6 nodes, so the doc_count_error_upper_bound
is reported as 6 when running on a single index and 12 when running across two indices. Are these queries failing due to the limitations of the approximation algorithm used? The two indices I'm comparing have around two million documents each with a discrepancy in count of around 2000, so at least that many documents should be non-duplicated and therefore returned by these queries.
Is there another query I should use, or some tool that checks consistency across multiple indices and reports which docs are different?