Bucket_selector aggregation and size. optimizations

I have question about bucket_selector aggregation.
(Environment tested: ES6.8 and ES7 basic on centos7)

In my use case I need to drop documents if there are dupes by selected property. Index is not big about 2mln records.
Query to find those records looks like this:

GET index_id1/_search
{
  "size": 0,
  "aggs": {
    "byNested": {
      "nested": {
        "path": "nestedObjects"
      },
      "aggs": {
        "sameIds": {
          "terms": {
            "script": {
              "lang": "painless",
              "source": "return doc['nestedObjects.id'].value"
            },
            "size": 1000
          },
          "aggs": {
            "byId": {
              "reverse_nested": {}
            },
            "byId_bucket_filter": {
              "bucket_selector": {
                "buckets_path": {
                  "totalCount": "byId._count"
                },
                "script": {
                  "source": "params.totalCount > 1"
                }
              }
            }
          }
        }
      }
    }
  }
}

I get the buckets back. But to relax the query and the load. I do it by size: 1000. So, next query issued to get more dupes until zero is back.
The problem is however - too small amount of dupes. I checked the result of the query by setting size: 2000000:

GET index_id1/_search
{
  "size": 0,
  "aggs": {
    "byNested": {
      "nested": {
        "path": "nestedObjects"
      },
      "aggs": {
        "sameIds": {
          "terms": {
            "script": {
              "lang": "painless",
              "source": "return doc['nestedObjects.id'].value"
            },
            "size": 2000000  <-- too big
          },
          "aggs": {
            "byId": {
              "reverse_nested": {}
            },
            "byId_bucket_filter": {
              "bucket_selector": {
                "buckets_path": {
                  "totalCount": "byId._count"
                },
                "script": {
                  "source": "params.totalCount > 1"
                }
              }
            }
          }
        }
      }
    }
  }
}

As I understand first step is: it actually creates the buckets as stated in the query and then bucket_selector filters only what i need. And that's why i see this kind of behavior. In order to get all the buckets I have to adjust "search.max_buckets" to 2000000.

Converted to query with composite aggregation:

GET index_id1/_search
{
  "aggs": {
    "byNested": {
      "nested": {
        "path": "nestedObjects"
      },
      "aggs": {
        "compositeAgg": {
          "composite": {
            "after": {
              "termsAgg": "03f10a7d-0162-4409-8647-c643274d6727"
            },
            "size": 1000,
            "sources": [
              {
                "termsAgg": {
                  "terms": {
                    "script": {
					  "lang": "painless",
					  "source": "return doc['nestedObjects.id'].value"
					}
                  }
                }
              }
            ]
          },
          "aggs": {
            "byId": {
              "reverse_nested": {}
            },
            "byId_bucket_filter": {
              "bucket_selector": {
                "script": {
                  "source": "params.totalCount > 1"
                },
                "buckets_path": {
                  "totalCount": "byId._count"
                }
              }
            }
          }
        }
      }
    }
  },
  "size": 0
}

As I understand it does the same thing except that I need to make 2000 calls (size: 1000 each) to go over the whole index.
Is composite agg caches the results or why this is better?
Maybe there is a better approach in this case?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.