Searching Duplicate terms in Index based on ID, splits the Id with limiter "-"

Vinothkumar_Ganeshan · October 23, 2019, 3:20pm

Hello, I'm trying to find the duplicate based on a term "TDID" in a index.
Basically the TDID : "00000-00000-0000000-00000"

GET ttd-2019-04*/_search
{
"size" : 0,
"_source": ["TDID"],
"query": {
"term": {
  "eventName": "impressions"
}
},
"aggs":{
"duplicate_aggs":{
"terms":{
"field":"TDID",
"min_doc_count":2
}
}}
}

I got result as,
"agg regations" : {
"duplicate_aggs" : {
"doc_count_error_upper_bound" : 5448,
"sum_other_doc_count" : 77814191,
"buckets" : [
{
"key" : "0000",
"doc_count" : 261766
},
{
"key" : "00000000",
"doc_count" : 261542
},
{
"key" : "000000000000",
"doc_count" : 261542
},
{
"key" : "4cc0",
"doc_count" : 3540
}}

As you can see, that the TDID is splitted and then aggreation was done. Could you please help me with that. Thanks for the support

William_Brafford · October 23, 2019, 9:07pm

Vinothkumar,

Are you using any special mappings for these indexes? Which version of Elasticsearch are you running?

My suspicion is that your TDID field is mapped as a text datatype. A text field's contents will be split into tokens for searching. The mapping you want for an identifier like this is probably a keyword datatype, which stores exact values like email addresses or phone numbers efficiently for searching.

You might have a keyword field as a default mapping. If so, you could use the following aggregation:

{
  [...],
  "aggs": {
    "duplicate_aggs": {
      "terms": {
        "field": "key.keyword",
        "min_doc_count": 2
      }
    }
  }
}

If that doesn't work, the correct answer will depend on which Elasticsearch version you are running and what your index's mappings are.

-William

Vinothkumar_Ganeshan · October 24, 2019, 8:28am

@William_Brafford Thanks for your response. Above works, as I could see the duplicates. A Clarification, if possible. I need to create a new document based on grouping(Merging) by the TDID with specific fields to combine and create a document to be inserted as index.

In Simple: All the documents should be based on individual TDID, where its eventname like impression, clicks are combined as list in the same document.

Index: tdid-based
Doc:
field: TDID : 000-000-03251-0000
field: events: ['Clicks', 'impression']

So, I could see that the TDID with different events

system · November 21, 2019, 8:35am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Finding all documents with duplicate properties Elasticsearch	6	6212	April 23, 2018
Merge Documents based on field value? Elasticsearch	2	6149	September 25, 2017
Как отображать в выдаче только дубликаты - аггрегация, поиск дубликатов Вопросы на русском языке	10	1108	May 11, 2021
Transform vs copy_to, text concatenation used for aggregation Elasticsearch	3	3510	July 5, 2017
Elasticsearch aggregation script not maintaining document ID integrity Elasticsearch	3	463	July 6, 2017

Searching Duplicate terms in Index based on ID, splits the Id with limiter "-"

Related topics