Aggregating by a given item in array

My big documents contain arrays of objects, for example:

_doc : {
   countries: {
	[
		{"country":"czechia", "name":"novak", ...}, 
		{"country":"england", "name":"jonson", ...} 
	]
   },
   ...
}

In my query, I specify a "country"+ a filter query, and need to get the count of each distinct "name".
If I'd do terms aggregation on the "name" field, I would get the count of _docs each name appears in. but I need the count of name at all docs for a given country.

I need the following query to return only names for 'england'

POST /stats/_search?size=0
{
  "query": {
    "term" : {"countries.country" : "england"}
  },
    
  "aggs" : {
      "top-messages": {"terms": {"field": "countries.name","size": 10}}
  }
}

Is that possible without fetching the whole document and doing (huge) aggregation offline?

It's possible, but you'll need to use the nested datatypes instead. Regular json objects (like your example) are "flattened" internally and lose their relationship. E.g. internally that data will be stored as:

{
  countries.country: ["czechia","england"],
  countries.name: ["novak","jonson"]
}

The nested data type internally uses a different storage mechanism to keep the relations between items in an object, so that you can run queries/aggregrations like you requested.

Thank you!
Yes, I did tried that eventually, but since I have more then 200k distinct nested terms, the nested aggregations had too many buckets.
The solution I found is to store everything flat with string prefixes, and use script regexp to query subsets of the tree :

names: ["czechia/novak","czechia/benes","england/jonson",...]

{
"aggs": {
"evt_buckets": {
"terms": {
"field": "names",
"include": "czechia.*"
}
}
}
}

I considered also

Yeah agreed, 200k nested objects would be too many :slight_smile:

I'd probably try to restructure things so that each doc is one of the distinct pairs and they are all related by some kind of identifier. Basically denormalize each relation to it's own document. ES tends to be easier to work with when you're more denormalized... the relational tools are only rough approximations of real RDBMs functionality.

Yes, my case is problematic since I have large documents (2MB each),
Each document contain a list of all events occurred on a single client per week.
I tried to store it flat (_doc = event), but i cannot make queries such as
"for a given group of clients, find significant event id's".
Further reducing the time window will break relations between events.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.