Getting All records with doc_count >= 2

I have 2 es indices. There are some records common in both the indices, I want to fetch them. My preliminary query is :-

GET index1,index2/_search
{
  "size": 0,
  "aggs": {
    "duplicate record": {
      "terms": {
        "field": "Id"
      },
      "aggs": {
        "get duplicate": {
          "bucket_selector": {
            "buckets_path": {
              "doc_count": "_count"
            },
            "script": "params.doc_count == 2"
          }
        }
      }
    }
  }
}

This query returns the following result:-

{
  "took" : 687,
  "timed_out" : false,
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "duplicate record" : {
      "doc_count_error_upper_bound" : 2,
      "sum_other_doc_count" : 3000689,
      "buckets" : [
        {
          "key" : "1989861726",
          "doc_count" : 2
        },
        {
          "key" : "1989861734",
          "doc_count" : 2
        },
        {
          "key" : "1989861860",
          "doc_count" : 2
        },
        {
          "key" : "1989862267",
          "doc_count" : 2
        },
        {
          "key" : "1989862280",
          "doc_count" : 2
        },
        {
          "key" : "1989862642",
          "doc_count" : 2
        },
        {
          "key" : "1989862730",
          "doc_count" : 2
        },
        {
          "key" : "2004088312",
          "doc_count" : 2
        },
        {
          "key" : "2004088315",
          "doc_count" : 2
        },
        {
          "key" : "2004088321",
          "doc_count" : 2
        }
      ]
    }
  }
}

I want full records associated with these id's. Can someone help me with that?

1 Like

you do not want: params.doc_count >= 2 ?

hi there ,
i am confused with your question can you elaborate a bit ...like what are the fields containing in those indices ? are both indices consist of same fields ? or only id field is common?

The terms aggregation has a min_doc_count setting which does this but can be inaccurate if there are high numbers of ids that occur infrequently. It’s what I call the Elizabeth Taylor problem
If you have many unique rare terms you may need to look at a different strategy which we can discuss.

both the indices contain the same field. and all entries are unique in an index. However, 2 indices may have common records with the same Id.
My Question:- Give me the list of all records which are present in both the indices.

My question is how to get records from the term aggregation? Like how to filter hits based on the aggregation?

For low cardinality terms you’d use A nested top hits agg but for high cardinality you’d need to issue a second follow-up request.
Your main problem is likely to be firstly correctly identifying the >2 terms.

I did not get your solution for higher cardinality terms? Can you please elaborate?

In the nested top hits agg query, my main concern is how to get all the buckets without specifying the size.

you can get all bucket by using pagination .For that you can use bucket sort or composite aggregation

I am unable to find how to write my use case in composite aggregation. (with top_hit). Can you point to the relevant doc?

composite aggregation doesn't support ordering so use bucket sort to find top hit .I don't know from where i found this query but i think it might help you out.

GET /billdetail/_search
{
  "size": 0,
  "aggs": {
    "topData": {
      "terms": {
        "field": "billid.keyword",
        "size": 100
      },
      
      "aggs": {
        "paging": {
          "bucket_sort": {
            "sort": [
              {
                "_count": {
                  "order": "desc"
                }
              }
            ],
            "from": 0,
            "size": 10
          }
        }
      }
    }
  }
}

take it as a reference and also look about bucket sort in Elasticsearch documentation.

Put simply, if the data you want to link together is spread across multiple machines and there’s lots of individual keys to be linked on, then that’s a tough ask of any system.
To offer fast analysis you need to prepare the data better by bringing related data closer together. Custom document “routing” can be used to keep related content on the same machine while the new “transforms” api can be used to copy related data into the same document.

1 - why you have mentioned the inner size = 100 when we don't know how many buckets will be there

1 Like

oh sorry about that ...
in my project i just needed top 100 so i used that ......
It is recommended to keep it large i don't know exact value but in documentation it told to keep large ...
According to my understanding it is like a ocean which collects all the bucket and sort that bucket ..
And one more thing i didn't find any difference in result on keeping size to 100 and size to 1000 .

If you find my concept is wrong then feel free to correct it and give suggestions thank you.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.