How do I determine the difference between two indices?


(Brett Anderson) #1

I have two indices that should be identical, but have a slightly different document count. I'd like to determine which documents only exist in one of the indexes.

What I initially tried failed and I'll get to that in a moment. After the failure I tried to see if I could even find duplicates in each individual index:

{
  "size": 0,
  "aggs": {
    "type_count": {
      "cardinality": {
        "field": "db_filename",
        "precision_threshold": 40000
      }
    }
  }
}

This returns 1967988. If I visit the _count endpoint I get the number: 1972559. This number is larger, so I'm assuming some duplicates. However neither of these queries yield any results.

{
  "size": 0,
  "aggs": {
    "group_doc_by_id": {
      "terms": {
        "field": "db_filename"
      },
      "aggs": {
        "select_group_with_count_1": {
          "bucket_selector": {
            "buckets_path": {
              "the_doc_count": "_count"
            },
            "script": "params.the_doc_count > 1"
          }
        }
      }
    }
  }
}

OR

{
  "size": 0,
  "query": {
    "match_all": {}
  },
  "aggregations": {
    "doc_duplicates": {
      "terms": {
        "field": "db_filename",
        "min_doc_count": 2
      }
    }
  }
}

This makes me think I have little chance of identifying the non duplicated documents across two indices with the first query above, or the following

{
  "size": 0,
  "aggs": {
    "group_by_uid": {
      "terms": {
        "field": "db_filename"
      },
      "aggs": {
        "count_indices": {
          "cardinality": {
            "field": "_index"
          }
        },
        "values_bucket_filter_by_index_count": {
          "bucket_selector": {
            "buckets_path": {
              "count": "count_indices"
            },
            "script": "params.count < 2"
          }
        }
      }
    }
  }
}

Neither of these queries show any results across the two indices. These queries are running against an index across 6 nodes, so the doc_count_error_upper_bound is reported as 6 when running on a single index and 12 when running across two indices. Are these queries failing due to the limitations of the approximation algorithm used? The two indices I'm comparing have around two million documents each with a discrepancy in count of around 2000, so at least that many documents should be non-duplicated and therefore returned by these queries.

Is there another query I should use, or some tool that checks consistency across multiple indices and reports which docs are different?


(Christoph) #2

The Cardinality Aggregation is an approximate count, so if it does not match the exact "_count" value, thats expected.

About finding differences of two indices, can you re-formulate the question? I assume at first you are interested in documents that appear in one index but not in the other, and not about detecting differences in document with the same "_id", since the latter probably requires iterating over all documents. Do the documents have a unique field? Is the "_id" supposed to be the same in both indices?


(system) #3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.