Getting All records with doc_count >= 2

minhaj_shakeel · April 14, 2022, 9:01am

I have 2 es indices. There are some records common in both the indices, I want to fetch them. My preliminary query is :-

GET index1,index2/_search
{
  "size": 0,
  "aggs": {
    "duplicate record": {
      "terms": {
        "field": "Id"
      },
      "aggs": {
        "get duplicate": {
          "bucket_selector": {
            "buckets_path": {
              "doc_count": "_count"
            },
            "script": "params.doc_count == 2"
          }
        }
      }
    }
  }
}

This query returns the following result:-

{
  "took" : 687,
  "timed_out" : false,
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "duplicate record" : {
      "doc_count_error_upper_bound" : 2,
      "sum_other_doc_count" : 3000689,
      "buckets" : [
        {
          "key" : "1989861726",
          "doc_count" : 2
        },
        {
          "key" : "1989861734",
          "doc_count" : 2
        },
        {
          "key" : "1989861860",
          "doc_count" : 2
        },
        {
          "key" : "1989862267",
          "doc_count" : 2
        },
        {
          "key" : "1989862280",
          "doc_count" : 2
        },
        {
          "key" : "1989862642",
          "doc_count" : 2
        },
        {
          "key" : "1989862730",
          "doc_count" : 2
        },
        {
          "key" : "2004088312",
          "doc_count" : 2
        },
        {
          "key" : "2004088315",
          "doc_count" : 2
        },
        {
          "key" : "2004088321",
          "doc_count" : 2
        }
      ]
    }
  }
}

I want full records associated with these id's. Can someone help me with that?

RabBit_BR · April 14, 2022, 1:06pm

you do not want: params.doc_count >= 2 ?

anime_lover · April 15, 2022, 5:24am

hi there ,
i am confused with your question can you elaborate a bit ...like what are the fields containing in those indices ? are both indices consist of same fields ? or only id field is common?

Mark_Harwood · April 15, 2022, 12:39pm

The terms aggregation has a min_doc_count setting which does this but can be inaccurate if there are high numbers of ids that occur infrequently. It’s what I call the Elizabeth Taylor problem
If you have many unique rare terms you may need to look at a different strategy which we can discuss.

minhaj_shakeel · April 18, 2022, 6:18am

both the indices contain the same field. and all entries are unique in an index. However, 2 indices may have common records with the same Id.
My Question:- Give me the list of all records which are present in both the indices.

minhaj_shakeel · April 18, 2022, 6:19am

My question is how to get records from the term aggregation? Like how to filter hits based on the aggregation?

Mark_Harwood · April 18, 2022, 6:28am

For low cardinality terms you’d use A nested top hits agg but for high cardinality you’d need to issue a second follow-up request.
Your main problem is likely to be firstly correctly identifying the >2 terms.

minhaj_shakeel · April 18, 2022, 7:21am

I did not get your solution for higher cardinality terms? Can you please elaborate?

minhaj_shakeel · April 18, 2022, 9:50am

In the nested top hits agg query, my main concern is how to get all the buckets without specifying the size.

anime_lover · April 18, 2022, 11:32am

you can get all bucket by using pagination .For that you can use bucket sort or composite aggregation

minhaj_shakeel · April 18, 2022, 11:33am

I am unable to find how to write my use case in composite aggregation. (with top_hit). Can you point to the relevant doc?

anime_lover · April 18, 2022, 11:51am

composite aggregation doesn't support ordering so use bucket sort to find top hit .I don't know from where i found this query but i think it might help you out.

GET /billdetail/_search
{
  "size": 0,
  "aggs": {
    "topData": {
      "terms": {
        "field": "billid.keyword",
        "size": 100
      },
      
      "aggs": {
        "paging": {
          "bucket_sort": {
            "sort": [
              {
                "_count": {
                  "order": "desc"
                }
              }
            ],
            "from": 0,
            "size": 10
          }
        }
      }
    }
  }
}

take it as a reference and also look about bucket sort in Elasticsearch documentation.

Mark_Harwood · April 18, 2022, 12:00pm

Put simply, if the data you want to link together is spread across multiple machines and there’s lots of individual keys to be linked on, then that’s a tough ask of any system.
To offer fast analysis you need to prepare the data better by bringing related data closer together. Custom document “routing” can be used to keep related content on the same machine while the new “transforms” api can be used to copy related data into the same document.

minhaj_shakeel · April 18, 2022, 12:43pm

1 - why you have mentioned the inner size = 100 when we don't know how many buckets will be there

anime_lover · April 19, 2022, 5:03am

oh sorry about that ...
in my project i just needed top 100 so i used that ......
It is recommended to keep it large i don't know exact value but in documentation it told to keep large ...
According to my understanding it is like a ocean which collects all the bucket and sort that bucket ..
And one more thing i didn't find any difference in result on keeping size to 100 and size to 1000 .

If you find my concept is wrong then feel free to correct it and give suggestions thank you.

system · May 17, 2022, 5:04am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Min doc sub aggregation (find duplicates) Elasticsearch	1	483	October 7, 2017
Query for counting docs in all indices Elasticsearch	9	3621	October 18, 2018
How to get the count of the documents inside term aggregation that have doc_count as 1? Elasticsearch	1	496	January 11, 2018
Different aggregation count for the same value Elasticsearch	1	317	March 2, 2023
Elastic Query unexcepted aggregations doc_count Elasticsearch	4	126	April 29, 2024

Getting All records with doc_count >= 2

Related topics