Aggregation still returning duplicates

mrbushid0 · January 15, 2018, 12:01am

Hi guys,

I'm try to get all unique values for a field that is not analyzed but I am still getting duplicates:

Mapping:

 {
   "County":{
  "type":"string",
  "fielddata":true,
  "fields":{
     "raw":{
        "type":"string",
        "index":"not_analyzed"
     }
  }
   }
}

Query:

{
"size":999,
"_source":["County"],
"aggs": {
  "group_by_name": {
    "terms": { "field":"County.raw" },
    "aggs": {
      "remove_dups": {
        "top_hits": {
          "size": 1,
          "_source": false
        }
      }
    }
  }
}
}

Results:

{"_index":"properties","_type":"industrial","_id":"645","_score":1.0,"_source":{"County":"Oakland"}},{"_index":"properties","_type":"industrial","_id":"646","_score":1.0,"_source":{"County":"Oakland"}}

Any ideas as to why the aggregation is being ignored ?

dadoonet · January 15, 2018, 2:24am

Is it with an old version of elasticsearch?

mrbushid0 · January 15, 2018, 2:36am

I'm currently running version 5.6.5

dadoonet · January 15, 2018, 7:40am

I tried this:

DELETE test
PUT test
{
  "mappings": {
    "doc": {
      "properties": {
        "County": {
          "type": "text",
          "fields": {
            "raw": {
              "type": "keyword"
            }
          }
        }
      }
    }
  }
}
PUT test/doc/1
{
  "County": "Oakland"
}
PUT test/doc/2
{
  "County": "Oakland"
}
GET test/_search
{
  "size": 999,
  "_source": [
    "County"
  ],
  "aggs": {
    "group_by_name": {
      "terms": {
        "field": "County.raw"
      },
      "aggs": {
        "remove_dups": {
          "top_hits": {
            "size": 1,
            "_source": false
          }
        }
      }
    }
  }
}

And I'm getting this:

{
  "took": 12,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 1,
    "hits": [
      {
        "_index": "test",
        "_type": "doc",
        "_id": "2",
        "_score": 1,
        "_source": {
          "County": "Oakland"
        }
      },
      {
        "_index": "test",
        "_type": "doc",
        "_id": "1",
        "_score": 1,
        "_source": {
          "County": "Oakland"
        }
      }
    ]
  },
  "aggregations": {
    "group_by_name": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "Oakland",
          "doc_count": 2,
          "remove_dups": {
            "hits": {
              "total": 2,
              "max_score": 1,
              "hits": [
                {
                  "_index": "test",
                  "_type": "doc",
                  "_id": "2",
                  "_score": 1
                }
              ]
            }
          }
        }
      ]
    }
  }
}

mrbushid0 · January 15, 2018, 3:04pm

Should the query not be returning only a single "Oakland" or am I mistaken ?

val · January 15, 2018, 3:07pm

The hits section returns all matching documents, two in your case, which is correct.
The aggregations sections returns the unique counties, one in your case, which is correct.

system · February 12, 2018, 3:08pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Using aggregation in Elasticsearch to find duplicate data in the same index Elasticsearch painless	6	14539	January 22, 2021
Help with aggregation to identify dups Elasticsearch	3	1129	March 4, 2019
Querying the distinct values from ES index Elasticsearch	10	551	January 30, 2019
Aggregation module - value_count clarification/problem Elasticsearch	2	299	July 6, 2017
Aggregation Module - value_count problem Elasticsearch	2	399	July 6, 2017

Aggregation still returning duplicates

Related topics