Get first N values, then aggregate


#1

Good night

I have test documents like the following:

PUT /megacorp/employees/1
{
"text": "text1",
"number": 1
}

PUT /megacorp/employees/2
{
"text": "text1",
"number": 2
}

PUT /megacorp/employees/3
{
"text": "text2",
"number": 3
}

PUT /megacorp/employees/4
{
"text": "text2",
"number": 4
}

PUT /megacorp/employees/5
{
"text": "text2",
"number": 5
}

I would like to:

  1. Query for all documents matching a certain criteria (ie. "number" > 0)
  2. Sort results (ie. "number" ASC)
  3. Take the first N results (ie. N=3)
  4. Then aggregate (ie. get the maximum value "number").

I have tried using limit filter, "size" parameter, filter agg... but the result is always the same: although hits array displays 3 elements, both "total" and "max agg" values are 5. How can I limit results then aggregate only the N first results?


(David Pilato) #2

What did you do so far?


#3

Hi dadoonet, thank you for asking!

First thing: I forgot to mention that we are using ElasticSearch 2.4.5 (I know, such an old version).

My attempts so far (In this simple case I want to take only 3 first ordered hits and aggregate those so the max-number agg return 3 instead of 5 (the maximum of all documents)

(1) Original query:

GET /megacorp/_search
{
  "query": {
    "filtered": {
      "filter": {
        "range" : {
          "number": {
            "gte": 0
          }
        }
      },
      "query": {
        "match_all": {}
      }
    }
  },
  "aggs": {
    "max-number": {"max": {"field": "number"}}
  },
  "sort": [
      { "number" : {"order" : "asc"}}
    ]
}

(2) Original query - result

{
  "took": 8,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 5,
    "max_score": null,
    "hits": [
      {
        "_index": "megacorp",
        "_type": "employees",
        "_id": "1",
        "_score": null,
        "_source": {
          "text": "text1",
          "number": 1
        },
        "sort": [
          1
        ]
      },
      {
        "_index": "megacorp",
        "_type": "employees",
        "_id": "2",
        "_score": null,
        "_source": {
          "text": "text1",
          "number": 2
        },
        "sort": [
          2
        ]
      },
      {
        "_index": "megacorp",
        "_type": "employees",
        "_id": "3",
        "_score": null,
        "_source": {
          "text": "text2",
          "number": 3
        },
        "sort": [
          3
        ]
      },
      {
        "_index": "megacorp",
        "_type": "employees",
        "_id": "4",
        "_score": null,
        "_source": {
          "text": "text2",
          "number": 4
        },
        "sort": [
          4
        ]
      },
      {
        "_index": "megacorp",
        "_type": "employees",
        "_id": "5",
        "_score": null,
        "_source": {
          "text": "text2",
          "number": 5
        },
        "sort": [
          5
        ]
      }
    ]
  },
  "aggregations": {
    "max-number": {
      "value": 5
    }
  }
}

(2) Limit filter in aggregation (if I change the limit filter with another condition - ie. number < 3 - I get max = 5, but I need to limit results by number of hits)

GET /megacorp/_search
{
  "query": {
    "filtered": {
      "filter": {
        "range" : {
          "number": {
            "gte": 0
          }
        }
      },
      "query": {
        "match_all": {}
      }
    }
  },
  "aggs": {
    "foo": {
      "filter": {"limit" : {"value" : 2}},
      "aggs": {
        "max-number": {"max": {"field": "number"}}
      }
    }
  },
  "sort": [
      { "number" : {"order" : "asc"}}
    ]
}

(2) Limit filter in aggregation - result:

{
  "took": 13,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 5,
    "max_score": null,
    "hits": [
      {
        "_index": "megacorp",
        "_type": "employees",
        "_id": "1",
        "_score": null,
        "_source": {
          "text": "text1",
          "number": 1
        },
        "sort": [
          1
        ]
      },
      {
        "_index": "megacorp",
        "_type": "employees",
        "_id": "2",
        "_score": null,
        "_source": {
          "text": "text1",
          "number": 2
        },
        "sort": [
          2
        ]
      },
      {
        "_index": "megacorp",
        "_type": "employees",
        "_id": "3",
        "_score": null,
        "_source": {
          "text": "text2",
          "number": 3
        },
        "sort": [
          3
        ]
      },
      {
        "_index": "megacorp",
        "_type": "employees",
        "_id": "4",
        "_score": null,
        "_source": {
          "text": "text2",
          "number": 4
        },
        "sort": [
          4
        ]
      },
      {
        "_index": "megacorp",
        "_type": "employees",
        "_id": "5",
        "_score": null,
        "_source": {
          "text": "text2",
          "number": 5
        },
        "sort": [
          5
        ]
      }
    ]
  },
  "aggregations": {
    "foo": {
      "doc_count": 5,
      "max-number": {
        "value": 5
      }
    }
  }
}

#4

(3) Sampler aggregation

GET /megacorp/_search
{
  "query": {
    "filtered": {
      "filter": {
        "range" : {
          "number": {
            "gte": 0
          }
        }
      },
      "query": {
        "match_all": {}
      }
    }
  },
  "aggs": {
     "bestDocs": {
         "sampler": {
              "shard_size": 3
         },
         "aggs": {
                "max_number" : { "max" : { "field" : "number" } }
         }
      }
  },
  "sort": [
        { "number" : {"order" : "desc"}}
      ]
}

(3) Sampler aggregation - result:

{
  "took": 13,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 5,
    "max_score": null,
    "hits": [
      {
        "_index": "megacorp",
        "_type": "employees",
        "_id": "5",
        "_score": null,
        "_source": {
          "text": "text2",
          "number": 5
        },
        "sort": [
          5
        ]
      },
      {
        "_index": "megacorp",
        "_type": "employees",
        "_id": "4",
        "_score": null,
        "_source": {
          "text": "text2",
          "number": 4
        },
        "sort": [
          4
        ]
      },
      {
        "_index": "megacorp",
        "_type": "employees",
        "_id": "3",
        "_score": null,
        "_source": {
          "text": "text2",
          "number": 3
        },
        "sort": [
          3
        ]
      },
      {
        "_index": "megacorp",
        "_type": "employees",
        "_id": "2",
        "_score": null,
        "_source": {
          "text": "text1",
          "number": 2
        },
        "sort": [
          2
        ]
      },
      {
        "_index": "megacorp",
        "_type": "employees",
        "_id": "1",
        "_score": null,
        "_source": {
          "text": "text1",
          "number": 1
        },
        "sort": [
          1
        ]
      }
    ]
  },
  "aggregations": {
    "bestDocs": {
      "doc_count": 5,
      "max_number": {
        "value": 5
      }
    }
  }
}

#5

(4) Terminate after + size (query params)

POST /megacorp/_search?terminate_after=3&size=3
{
  "query": {
    "match_all": {}
  },
  "sort": [
      { "number" : {"order" : "asc"}}
    ],
  "aggs": {
    "max-number": {
      "max": {"field": "number"}
    }
  }
}

(4) Terminate after + size (query params) - result:

{
  "took": 19,
  "timed_out": false,
  "terminated_early": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 5,
    "max_score": null,
    "hits": [
      {
        "_index": "megacorp",
        "_type": "employees",
        "_id": "1",
        "_score": null,
        "_source": {
          "text": "text1",
          "number": 1
        },
        "sort": [
          1
        ]
      },
      {
        "_index": "megacorp",
        "_type": "employees",
        "_id": "2",
        "_score": null,
        "_source": {
          "text": "text1",
          "number": 2
        },
        "sort": [
          2
        ]
      },
      {
        "_index": "megacorp",
        "_type": "employees",
        "_id": "3",
        "_score": null,
        "_source": {
          "text": "text2",
          "number": 3
        },
        "sort": [
          3
        ]
      }
    ]
  },
  "aggregations": {
    "max-number": {
      "value": 5
    }
  }
}

(5) Size in body

POST /megacorp/_search
{
  "size": 3,
  "query": {
    "match_all": {}
  },
  "aggs": {
    "max-number": {
      "max": {"field": "number"}
    }
  }
}

(5) Size in body - result

{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 5,
    "max_score": 1,
    "hits": [
      {
        "_index": "megacorp",
        "_type": "employees",
        "_id": "5",
        "_score": 1,
        "_source": {
          "text": "text2",
          "number": 5
        }
      },
      {
        "_index": "megacorp",
        "_type": "employees",
        "_id": "2",
        "_score": 1,
        "_source": {
          "text": "text1",
          "number": 2
        }
      },
      {
        "_index": "megacorp",
        "_type": "employees",
        "_id": "4",
        "_score": 1,
        "_source": {
          "text": "text2",
          "number": 4
        }
      }
    ]
  },
  "aggregations": {
    "max-number": {
      "value": 5
    }
  }
}

Thank you!


(David Pilato) #6

Why does the 1st query is not correct in you case?

Do you mean that you want to compute aggregations only on top of a part of the resultset but not the whole resultset?


#7

Yes, I want to add a limit to the first query, so first I get the N first hits from the resultset and then apply a aggregation on them.


(David Pilato) #8

I don't know how to do this apart doing that on client side as this looks super easy to solve on this side.


#9

Hi dadoonet

We indeed ended doing a little workaround on client side.

Thank you!