Filter index by max field value

Ok, this seems easy from a relational perspective, but I'm becoming unable to achieve it on ES:

I have an index where, among others, there is a numeric field called year. I want to filter my index data, retrieving only those documents which year is equal to the maximum value of year whitin the whole index.

In a relational database using SQL you'd use something more or less like this:

select * from index where year = (select max(year) from index);

On ES I know that I can get my maximum year value through an aggregation:

{
  "aggs": {
    "maxYear": {
      "max": {
        "field": "year"
      }
    }
  }

However, I'm being unable to use that maxYear calculated value on the same query. I've been trying to use different pipelines and aggregation combinations to no success.

Also I thought to store my maximum value on a runtime field, but now know really how to obtain that aggregated value on a runtime field.

Any insight on the matter will be much appreciated.

Hello @PMF

What if we sort the bucket with size 1 & only get records for that bucket?

{
  "size": 0,
  "aggs": {
    "year_buckets": {
      "terms": {
        "field": "year",
        "order": { "_key": "desc" },
        "size": 1
      },
      "aggs": {
        "top_docs": {
          "top_hits": {
            "size": 100
          }
        }
      }
    }
  }
}

Thanks!!

Hi @Tortoise , thank you for your response.

Indeed, your proposal seems to retrieve the expected data. The only problem here is that top_hits force to put a limit to the number of elements retrieved.

I need to obtain all possible results. There is any way I can return all results without a top_hits size?

Hello @PMF

You are correct. I believe then we need to use 2 step approach of finding the max year & using that to fetch all the records.

I see we can increase the docs as part of top_hits from 100 to greater value but that is risky & not recommended in Production which will impact the performance/cluster.

Thanks!!

Hello @Tortoise, sorry for leting pass so much time without a response, but I've been abroad and hadn't so much time.

I'm not sure I'm getting what you refer as a 2 step approach. AFAIK, in elastic, is not possible to execute a query on the results of a previous one.

Do you mean to retrieve maximum year on a query and later, with a different query, filter data by the retrieved value? If that so, I'm not very sure this approach will work for my use case...

Would be possible to retrieve max year value with

{
  "size": 0,"aggs": {
    "year_buckets": {
      "terms": {
        "field": "year",
        "order": { 
          "_key": "desc" 
        },
        "size": 1
      }
    }
  }
}

as a runtime field?

Maybe if I store the maximum current year on a runtime field I’d be able to query it on the run when needed.

Yes, the approach described requires 2 separate queries.

You should be able to retrieve the maximum year using a max aggregation on the year field.

Runtime field are calculated on a per document basis so can not contain the maximum across a set of documents.

Hello @PMF

This is not the solution but i see there is one more way where we can fetch the records via Watcher Chain Input, this generally will be used for alerting (if that is the usecase you are looking for) where we want to pass dynamic value as input from output of previous query :

{
  "trigger": {
    "schedule": {
      "interval": "1h"
    }
  },
  "input": {
    "chain": {
      "inputs": [
        {
          "max_year": {
            "search": {
              "request": {
                "indices": ["year_index"],
                "body": {
                  "size": 0,
                  "aggs": {
                    "maxYear": {
                      "max": {
                        "field": "year"
                      }
                    }
                  }
                }
              }
            }
          }
        },
        {
          "latest_year_docs": {
            "search": {
              "request": {
                "indices": ["year_index"],
                "body": {
                  "query": {
                    "term": {
                      "year": {
                        "value": "{{ctx.payload.max_year.aggregations.maxYear.value}}"
                      }
                    }
                  }
                }
              }
            }
          }
        }
      ]
    }
  },
  "actions": {
    "log": {
      "logging": {
        "text": "Documents from max year: {{ctx.payload.latest_year_docs.hits.total}} found."
      }
    }
  }
}

Thanks!!

Thank you both for your kind responses.

I’m starting to think that there is no direct solution for my use case, so I’ll try to look for an alternative solution.

Thank you again.

1 Like