Aggregating by a given item in array

Yonatan_Omer · September 4, 2019, 7:56am

My big documents contain arrays of objects, for example:

_doc : {
   countries: {
	[
		{"country":"czechia", "name":"novak", ...}, 
		{"country":"england", "name":"jonson", ...} 
	]
   },
   ...
}

In my query, I specify a "country"+ a filter query, and need to get the count of each distinct "name".
If I'd do terms aggregation on the "name" field, I would get the count of _docs each name appears in. but I need the count of name at all docs for a given country.

I need the following query to return only names for 'england'

POST /stats/_search?size=0
{
  "query": {
    "term" : {"countries.country" : "england"}
  },
    
  "aggs" : {
      "top-messages": {"terms": {"field": "countries.name","size": 10}}
  }
}

Is that possible without fetching the whole document and doing (huge) aggregation offline?

polyfractal · September 9, 2019, 5:46pm

It's possible, but you'll need to use the nested datatypes instead. Regular json objects (like your example) are "flattened" internally and lose their relationship. E.g. internally that data will be stored as:

{
  countries.country: ["czechia","england"],
  countries.name: ["novak","jonson"]
}

The nested data type internally uses a different storage mechanism to keep the relations between items in an object, so that you can run queries/aggregrations like you requested.

Yonatan_Omer · September 10, 2019, 2:20pm

Thank you!
Yes, I did tried that eventually, but since I have more then 200k distinct nested terms, the nested aggregations had too many buckets.
The solution I found is to store everything flat with string prefixes, and use script regexp to query subsets of the tree :

names: ["czechia/novak","czechia/benes","england/jonson",...]

{
"aggs": {
"evt_buckets": {
"terms": {
"field": "names",
"include": "czechia.*"
}
}
}
}

I considered also

polyfractal · September 12, 2019, 5:17pm

Yeah agreed, 200k nested objects would be too many

I'd probably try to restructure things so that each doc is one of the distinct pairs and they are all related by some kind of identifier. Basically denormalize each relation to it's own document. ES tends to be easier to work with when you're more denormalized... the relational tools are only rough approximations of real RDBMs functionality.

Yonatan_Omer · September 18, 2019, 6:54am

Yes, my case is problematic since I have large documents (2MB each),
Each document contain a list of all events occurred on a single client per week.
I tried to store it flat (_doc = event), but i cannot make queries such as
"for a given group of clients, find significant event id's".
Further reducing the time window will break relations between events.

system · October 16, 2019, 6:54am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Group by terms and get count of nested array property? Elasticsearch	3	1834	January 11, 2020
Aggregate query on filtered array of objects across documents Elasticsearch	1	458	August 19, 2020
Count arrays as an object Elasticsearch	2	332	July 21, 2023
Aggregate on array objects Elasticsearch	1	344	July 30, 2018
Elasticsearch aggregations, counting nested objects Elasticsearch	3	2922	June 20, 2018

Aggregating by a given item in array

Related topics