Pipeline aggregation to compute histogram of average values of a field, bucketed by term?

I'm trying to figure out how to compute a histogram of the average values of fields within "term" buckets. I think this is probably a case for a sibling aggregation, but I can't figure out the approach. For example, consider an index which stores records like this, which describes how many goals each soccer player scored in a game: {"player_name": , "goals_scored": , ... }. I can see how to compute the average of "goals_scored" for each player, by nesting an "avg" aggregation inside a "terms" aggregation on "player_name", like this:

{
	"aggs" : {
		"group_by_name" : {
			"terms": { "field": "player_name.keyword" },
			"aggs": {
				"avg_goals": {
					"avg": { "field": "goals_scored" }
				}
			}
		}
	}
}

But can I compute a histogram of those per-player bucket averages? IOW, to see the distribution of the average goal-scoring value for all players?

Ah. This post here, https://github.com/elastic/elasticsearch/issues/21097, seems to confirm that there is no way to pipelined histograms of aggregated data like this. It references these two other previous discussions: https://github.com/elastic/elasticsearch/issues/17590 , https://github.com/elastic/elasticsearch/issues/22537 . My example is somewhat similar, involving an "event-based" database of soccer scores, and my wish to "fuse" this data at query time, to derive higher-level analytics. In this case, the advice seems to be to instead maintain an additional "entity-based" index, where some back-end process collates and stores the relevant "fused" data. In my case, I interpret that to mean an index which maintains the "average goals scored per game" data. Then that index can be used as the basis of a regular "histogram" aggregation in Elasticsearch.

Can someone confirm that I've interpreted all this correctly? I confess I am a bit disappointed that Elasticsearch apparently can't do this natively, but posts in those links above seem to be adamant that this functionality can't be implemented in a satisfactory way at scale. That's unfortunate, as it does push extra complexity onto the user's back-end, to maintain the entity-centric index (in my case, some code to compute and update the player-centric average goal score).

We now have dataframes (being renamed to "transforms") to help you manage the creation of entity centric indexes.

Transforms look pretty cool, and should definitely ease the burden of maintaining the "summary" index. Thanks for pointing those out to me. I just tried out my toy "soccer" example, and it worked fine, producing an index with "name.keyword" and "goals_scored.avg" fields (one record per player). However, I have run into a slight inconvenience with trying to put "saved search" visualizations based on these two indices in the same dashboard, and making them filter each other based on the common "name" data. The saved search for the old index uses the text field "name", whereas the new index uses "name.keyword". If I click on "+" next to one of the names in the new index's "saved search" visualization, this activates a "name.keyword: " filter, which does filter the data in the old index's visualization. So far, so good. However, the reverse is not true - if I try to filter by clicking on one of the names in the old index's "saved search" visualization, it activates a "name: " filter, which does not match any records in the new index's display (which only knows about "name.keyword"). If I do want this two-way filtering to work, I guess I might have to hack the saved search for the old index so that it too uses "name.keyword" (export the search, edit it, and re-import it). Or is there another solution?

Probably more of a question for the Kibana forum but my assumption is that you’d need both indices to have the same mapping for the name field.

Right, I'll pose this question in the Kibana forum. Meanwhile, I'm at least content that I'm following best practices with the general approach of the entity-centric ancillary index. Thanks for your help!

1 Like

Mark, AFAICS there is one significant drawback to the "entity-centric index" approach, and that is that it forces you to decide up-front how you are going to aggregate the data from which you produce the summary statistics. To return to my soccer example, I might want to produce a histogram of the per-player average scores over various different time scales (last month, last year, etc.). AFAICS, you'd need to decide on those time scales ahead of time, in order to produce the associated statistics in the entity-centric index. But what if the selection of those time scales was being driven by user exploration in Kibana, with possibly extra filtering parameters being thrown in too? Thus, it seems to me that the entity-centric index approach is suitable only for situations with limited degrees of freedom. Would you agree?

Ultimately it's about data locality.
If you have related data there are 3 levels of physical location:

  1. Same document (entity-centric indexes)
  2. Same shard/node (single index with document routing)
  3. Remote (related data spread across nodes)

Many users log events into multiple time-based indices which means they are in the worst scenario (option 3) for doing any form of behavioural analysis.
Some things become more manageable if you go for option 2 - scripts/aggs can operate on documents grouped by a key and know that all the related data for a key is held on the machine where they run (we start to run into issues of accuracy when related data is spread across multiple machines). The down-side with option 2 is data locality is lost if you use time-based indices.

You might be able to hold multiple summaries of the raw data on the docs or derive them on the fly using scripts (they would at least be operating on localised data).

In my (real) case, we have Option 2 - not a vast data set, but one which we'd like to analyze flexibly. It just seems really unfortunate that ES/Kibana can effortlessly produce some very useful summary data (averaging a field according to user-driven aggregations), but then it's missing that "meta" level where I'd like to further aggregate that summary data (in a histogram). Because of this, we're forced to consider these awkward and less-flexible work-arounds . It appears to me that this is functionality that could be very appropriate and useful to some users, at least, although I acknowledge your concerns for general usage.

Our working assumption is that data is always distributed - we only offer features that are designed to work at scale.
That does mean that some forms of analysis require custom scripts where we have found it is not feasible to offer a declarative way to do this reliably at scale.

You can script what you want using a scripted metric, and then use Vega to visualize the output.
You can have Vega pay attention to context/filters and the selected time range if you so desire.

I would really love to be able to connect the scripted-metric-output directly to Discover or Visualize.
I realize that _source + index is used to "help" Discover/Visualize, but I'd be happy to supply the equivalent of _source either via a GUI or even in the output of the scripted metric.

I agree that there is one more abstraction level of visualization that would be useful.
I personally don't see why scaling is related to claims of "not feasible to offer a declarative way to do this reliably at scale".

A scripted metric is just a map-reduce in disguise and map-reduces are the tool of choice for many folks analyzing large data sets. Maybe the objection is that there is no easy "declarative way".

Map-reduce in the Hadoop sense is based on streaming data and is not constrained by RAM. In the scripted-metric agg sense it is limited to what data can reasonably fit into the memory required for a single request/response exchange.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.