Elasticsearch aggregations, counting nested objects


#1

I have an index with nested objects and am trying to derive some statistics about the data, but am having trouble doing so.

The documents look like this simplified version:

{
  "rn": 1,
  "t": "X",
  "a": [
    {
	  "r": [
	    {
		  "rt": "A",
		  "p": [
		    {
			  "sn": 1
			},
			{
			  "sn": 2
			}
		  ]
		},
		{
		  "rt": "B",
		  "p": [
		    {
			  "sn": 1
			},
			{
			  "sn": 2
			}
		  ]
		},
		{
		  "rt": "C",
		  "p": [
		    {
			  "sn": 1
			},
			{
			  "sn": 2
			}
		  ]
		}
      ]
    }
  ]
}

{
  "rn": 2,
  "t": "X",
  "a": [
    {
	  "r": [
	    {
		  "rt": "A",
		  "p": [
		    {
			  "sn": 1
			},
			{
			  "sn": 2
			}
		  ]
		},
		{
		  "rt": "B",
		  "p": [
		    {
			  "sn": 1
			},
			{
			  "sn": 2
			}
		  ]
		}
      ]
    }
  ]
}

The arrays, a, r and p, are mapped as nested objects.

I want to write an aggregation to return:

  • The maximum number of occurrences of r in a document (3 in this example)
  • The average number of occurrences of r in a document (2.5 in this example)
  • Determine the above two within the buckets for terms in t (just X in this example, but there will be a dozen or so other terms; I can just run a query a dozen times with different values for a filter on t if needed)

In a non-nested version of this, I used a painless script to count the elements in each document and ran stats over the result. But now, since r is nested, I can't just count the elements in r because r is now split into separate document. I think I need to have an agg that counts the nested documents per document, then another agg to run stats over those counts. But I don't see how to do that. Do I need to basically make a bucket for each document? There are nearly 1B documents in the index.

I am having trouble even working out which aggs I can use, and in what order. I have tried variations of value_count, max_buckets, reverse_agg and a number of others, without success and am clearly missing some understanding of how this agg should work.

Any help would be greatly appreciated.


(Rahul Desai) #2

There are 2 approaches that come to mind:

1 - Keep a count of how many nested documents each nested type has in the root of the document. Then you can use a sum aggregation to just sum the values stored in the root document.

2 - Store an identity field or just a simple static int column which has the value 1 inside each nested document and then use a nested aggregation to sum up all the 1's

Hope that helps.


#3

Thanks @RahulD. These are good ideas. However, this is an ad hoc query to gather some stats about our data and I can't add a field to the documents. But I may be able to create a new index using Logstash to derive this info, so thanks again. Counting the elements of the r array in Logstash should be straight forward, but I also need to do a similar query for the p array, to get a total count of p elements per document (so 6 and 4 for the two example documents above) - not sure how easy that is to do.

But I'm surprised we can't seem to access a count of an array after it's converted to nested, that seems like a useful function.


(system) #4

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.