Query to return documents where the list field in each document contains duplicate item values

I am wondering how to construct a query that can return documents where each contains a list field, but with duplicate values in that field. Here's an example setup:

PUT nestedtest
{
  "mappings": {
    "type": {
      "properties": {
        "history": {
          "type": "nested"
        }
      }
    }
  }
}

PUT nestedtest/type/1
{
    "name" : "Bob",
    "history" : [ 
    { 
        "date" : "2016-01-01", 
        "status" : "None" , 
        "color" : "red" 
    }, 
    { 
        "date" : "2016-01-01",
        "status" : "None" , 
        "color" : "blue" 
    },
    { 
        "date" : "2016-01-02", 
        "status" : "None", 
        "color" : "green" 
    } 
    ]
}

PUT nestedtest/type/2
{
    "name" : "Jane",
    "history" : [ 
    { 
        "date" : "2016-01-01", 
        "status" : "None",
         "color" : "red" 
    }, 
    {
         "date" : "2016-01-01",
         "status" : "Done", 
        "color" : "blue" 
    },
    { 
        "date" : "2016-01-02", 
        "status" : "None", 
        "color" : "green"
    } 
    ]
}

My interest is in querying the "history" field, which has been mapped as a nested field, in each document and checking for cases where the "date" field matches and where the "status"=="None" field match. So in this case, the query will return the first document with "_id" : 1, because the first and second item of it's "history" field contains the same value for "date" and "status"=="None" respectively.

I think that it is not possible to do this with your current mapping, see also the discussion here: https://github.com/elastic/elasticsearch/issues/16380
Similar to the suggestion on the ticket, you could duplicate the _id into the nested docs and then use a multi-fields term aggregation (scripted terms agg concatenating date, status, and the id of the parent) with min_doc_count = 2.

Hi @ywelsch, thanks for the link. I'm looking to avoid duplicating the _id field into the list as I don't really have much control over the mapping of fields in the documents since we stream data from our production MongoDB database into Elasticsearch; the example I have shown here is a much simplified version of the problem I'm dealing with in my production data. I figure this query could be implemented via the use of scripted metric aggregations and Groovy scripts. Some ideas or suggestions along these lines would be greatly appreciated.

Here's a Groovy scripted query that exactly solved my problem! It's a bit slow but does the job:

POST nestedtest/_search
{
  "query": {
    "script": {
      "script": {
        "inline": "_source.history.findAll {it.status == 'None'}.countBy {it.date}.find {it.value > 1} != null"
      }
    }
  }
}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.