Searching Duplicate terms in Index based on ID, splits the Id with limiter "-"

Hello, I'm trying to find the duplicate based on a term "TDID" in a index.
Basically the TDID : "00000-00000-0000000-00000"

GET ttd-2019-04*/_search
{
"size" : 0,
"_source": ["TDID"],
"query": {

"term": {
  "eventName": "impressions"
}

},
"aggs":{
"duplicate_aggs":{
"terms":{
"field":"TDID",
"min_doc_count":2
}
}}
}

I got result as,
"agg regations" : {
"duplicate_aggs" : {
"doc_count_error_upper_bound" : 5448,
"sum_other_doc_count" : 77814191,
"buckets" : [
{
"key" : "0000",
"doc_count" : 261766
},
{
"key" : "00000000",
"doc_count" : 261542
},
{
"key" : "000000000000",
"doc_count" : 261542
},
{
"key" : "4cc0",
"doc_count" : 3540
}}

As you can see, that the TDID is splitted and then aggreation was done. Could you please help me with that. Thanks for the support

Vinothkumar,

Are you using any special mappings for these indexes? Which version of Elasticsearch are you running?

My suspicion is that your TDID field is mapped as a text datatype. A text field's contents will be split into tokens for searching. The mapping you want for an identifier like this is probably a keyword datatype, which stores exact values like email addresses or phone numbers efficiently for searching.

You might have a keyword field as a default mapping. If so, you could use the following aggregation:

{
  [...],
  "aggs": {
    "duplicate_aggs": {
      "terms": {
        "field": "key.keyword",
        "min_doc_count": 2
      }
    }
  }
}

If that doesn't work, the correct answer will depend on which Elasticsearch version you are running and what your index's mappings are.

-William

1 Like

@William_Brafford Thanks for your response. Above works, as I could see the duplicates. A Clarification, if possible. I need to create a new document based on grouping(Merging) by the TDID with specific fields to combine and create a document to be inserted as index.

In Simple: All the documents should be based on individual TDID, where its eventname like impression, clicks are combined as list in the same document.

Index: tdid-based
Doc:
field: TDID : 000-000-03251-0000
field: events: ['Clicks', 'impression']

So, I could see that the TDID with different events

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.