Trying to find duplicates in a nested array


(John Harding) #1

I have a document type "foo" with an array of nested documents called "properties". I want to find any "foo"s that have duplicate entries in the "properties" array.

e.g. given two "foos":

"foo1": { "properties", [ { "display": "prop1"}, {"display": "prop2"} ]

and

"foo2": { "properties", [ { "display": "prop1"}, {"display": "prop1"} ]

I'm trying to construct a search that returns "foo2" but not "foo1" (foo2 has two properties with a display field of "prop1").

I know how to get counts of all values across all documents. But I can't figure out how to create a buckets array per top level document.

i.e. when I issue this query

GET /7/_search
{
    "size": 0,
    "query": {
        "match_all": {}
    },
    "aggs": {
        "prop_counts": {
            "nested" : {
                "path": "properties"            
            },
            "aggs": {
                "inner_prop_counts": {
                    "terms": {
                        "field": "properties.display",
                    "size": 1000
                    
                    }
                }
            }            
        }    
    }
}

I get results across all the top level documents:

{
   "took": 24,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 764,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "prop_counts": {
         "doc_count": 35754,
         "inner_prop_counts": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
               {
                  "key": "name",
                  "doc_count": 3788
               },
               {
                  "key": "type",
                  "doc_count": 3192
               },
               {
                  "key": "family",
                  "doc_count": 2228
               },
               {
                  "key": "offset",
                  "doc_count": 2202
               },
              ....

How can I get a set of results that is limited to each top level document? And then how do I filter that result to just give back results where doc_count > 2?

Any help greatly appreciated.

Thanks,
John H


(Dan Tuffery) #2

You can use the min_doc_count property to only return results where doc count is > 1. It is not possible to aggregate on the _id field but it is possible to aggregate on the _uid field which is a combination of the type and id:

{
  "size":0,
  "query":{
      "match_all":{}
  },
  "aggs":{
	"prop_counts":{
	  "nested":{
		"path":"properties"
	  },
	  "aggs":{
		"inner_prop_counts":{
		  "terms":{
			"field":"properties.display",
			"min_doc_count":2
		  },
		  "aggs":{
			"doc_id":{
			  "terms":{
				"field":"_uid",
				"min_doc_count":2
			   }
		       }
		   }
		}
	    }
	}
    }
}

This returns the following response:

"aggregations": {
    "prop_counts": {
        "doc_count": 4,
        "inner_prop_counts": {
            "doc_count_error_upper_bound": 0,
            {
                "sum_other_doc_count": 0,
                "buckets": [
                    {
                        "key": "prop1",
                        "doc_count": 3,
                        "doc_id": {
                            "doc_count_error_upper_bound": 0,
                            "sum_other_doc_count": 0,
                            "buckets": [
                                {
                                    "key": "foo#2",
                                    "doc_count": 2
                                }
                            ]
                        }
                    }
                ]
            }
        }
    }
}

If you want to return the actual documents you can use the top hits aggregation.


(John Harding) #3

Many thanks!


(system) #4