Finding all documents with duplicate properties

I have to find every document in Elasticsearch that has duplicate properties. My mapping looks something like this:

        "type": {
        "properties": {
            "thisProperty": {
                "properties" : {
                    "id":{
                        "type": "keyword"
                    },
                    "other_id":{
                        "type": "keyword"
                    }
                }
            }

The documents I have to find have a pattern like this:

    "thisProperty": [
    {
        "other_id": "123",
        "id": "456"
    },
    {
        "other_id": "123",
        "id": "456"
    },
    {
        "other_id": "4545",
        "id": "789"
    }]

So, I need to find any document by type that has repeat property fields. Also I cannot search by term because I do not what the value of either Id field is. So far the API hasn't soon a clear way to do this via query. Is it possible? If so, how?

The first thing you'd need to do is define thisProperty to be a nested type instead of a regular object. Without defining thisProperty as nested, the relationship between the id and other_id pairs is lost. The docs go into more details on that topic. The resulting mapping would look like this:

{
  "mappings": {
    "type": {
      "properties": {
        "thisProperty": {
          "type": "nested", 
          "properties": {
            "id": {
              "type": "keyword"
            },
            "other_id": {
              "type": "keyword"
            }
          }
        }
      }
    }
  }
}

Next, you could use a nested aggregation to aggregate on the nested objects. The goal here is to find duplicate objects, which is something you could achieve by running a scripted terms aggregation that concatenates the document's _id, the value of id and of other_id. If we find any duplicates of the resulting concatenated field, we know that this document has a repeating set of properties. Adding "min_doc_count": 2 to the terms aggregation will allow you to see just those duplicates.

Putting all of this together would look like this:

GET test/_search
{
  "size": 0,
  "aggs": {
    "my_nested": {
      "nested": {
        "path": "thisProperty"
      },
      "aggs": {
        "dupes": {
          "terms": {
            "script": """return doc['_id'].value + "_" + doc['thisProperty.id'].value + "_" + doc['thisProperty.other_id'].value;""",
            "size": 100,
            "min_doc_count": 2
          }
        }
      }
    }
  }
}

If your original document had an _id of 1, the resulting output of this aggregation would return:

        "buckets": [
          {
            "key": "1_456_123",
            "doc_count": 2
          }
        ]

... telling you that document 1 has a duplicate set of properties where id is 456 and other_id is 123`.

Hi Abdon,
Thank you for your thoughtful reply! One question, I do not need to preserve the relationship between id and other_id (I removed the nested type for this reason). Can this same query be written without the nested type?

If you do not keep the relationship between the id and other_id, what does it mean for properties to be duplicates? Are you just looking for multiple occurrences of id or of other_id in a single document?

Because without setting up a nested type, It's not possible to distinguish at search time between your document and this doc (notice none of the objects in this document are actual duplicates):

{
  "thisProperty": [
    {
        "other_id": "123",
        "id": "456"
    },
    {
        "other_id": "123",
        "id": "789"
    },
    {
        "other_id": "4545",
        "id": "456"
    }]
}

or even this one:

{
  "thisProperty.id": ["456", "456", "789"],
  "thisProperty.otherId": ["123", "123", "4545"]
}

Hi Abdon,
We are looking for multiple occurrences of id and other_id in a single document whose values are repeating (i.e id has the same same value more then once). I'm not sure I follow you on the nested explanation for this particular issue.

Sorry about that. I was trying to make the point that I don’t think you can do what you’re trying to do without setting up the repeating property as a nested type.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.