Aggregating across indices

mc_teo · March 26, 2020, 2:20pm

Hi everyone,

I've been dabbling with Elasticsearch on and off for a year, but I've recently dusted off an old project and wanted to retrofit it with Elasticsearch.

One thing that has me very stumped is being able to aggregate across indices, using the result in your predicate, similar to using a "HAVING" clause in SQL-land.

Here is a worked example. Given the following sample data:

POST /_bulk
{ "index": { "_index": "user-index", "_type": "user", "_id": "111" }}
{ "name": "frodo" }
{ "index": { "_index": "user-index", "_type": "user", "_id": "222" }}
{ "name": "samwise" }
{ "index": { "_index": "user-index", "_type": "user", "_id": "333" }}
{ "name": "gandalf" }
{ "index": { "_index": "user-index", "_type": "user", "_id": "444" }}
{ "name": "sauron" }

POST /_bulk
{ "index": { "_index": "order-index", "_type": "order", "_id": "1" }}
{ "user_id": "111", "status": "shipped" }
{ "index": { "_index": "order-index", "_type": "order", "_id": "2" }}
{ "user_id": "222", "status": "pending" }
{ "index": { "_index": "order-index", "_type": "order", "_id": "3" }}
{ "user_id": "111", "status": "delivered" }
{ "index": { "_index": "order-index", "_type": "order", "_id": "4" }}
{ "user_id": "333", "status": "delivered" }

What we want is a paginated list of users, with status counts, with (pending + shipped + delivered) > 0, with an “f” in their name. (This simulates the abilities of the UI this powers, where end users can select a ton of filters to drill down into the data. Other use cases are expanded upon here.)

The old query that did this looks like:

SELECT u.*,
  COUNT(CASE WHEN o.status = 'pending' THEN 1 END) pending,
  COUNT(CASE WHEN o.status = 'shipped' THEN 1 END) shipped,
  COUNT(CASE WHEN o.status = 'delivered' THEN 1 END) delivered
FROM users u
JOIN orders o
  ON u.id = o.user_id
WHERE LOWER(u.name) LIKE '%f%'
GROUP BY u.id
HAVING (pending + shipped + delivered) > 0
;

(A fully worked example in SQL Fiddle can be found here.)

I've made a few stabs at it myself, and originally the only method I had was to pre-calculate the "pending", "shipped", and "delivered" fields, for each user in the user-index, then every time a status changed, decrement and increment the relevant fields. This feels messy, and adds the potential race conditions if there are multiple updates happening concurrently.

My second stab was to get the page of users, then in application code feed the list of user ids into a second query against the orders. This returns the order counts correctly, but doesn't let us filter based on the order counts (i.e. the HAVING clause). I could add an extra layer where I (1) get page x of users matching conditions, (2) aggregate relevant orders, (3) drop users that now don't match, (4) if a page isn't complete, return to (1). This feels a little messy, but I'm sure I'm missing something major in ES that's mentally blocking me.

(Stop using Elasticsearch and go use a SQL database is also a valid outcome from this discussion )

Thanks

warkolm · March 30, 2020, 12:01am

This is a join, which ES cannot do.

The recommended option would be to either change the user_id to the textual name value, or add that text name along with the id in each document.

mc_teo · March 30, 2020, 9:19am

Hi @warkolm, thanks for the reply.

The sample of data I have above has many fields removed for ease of reading; in reality there are another 10-20 fields added per item. I was trying to keep this redundancy to a minimum, but I guess disk space is cheap, and the more I think of this, the better it sounds.

I've tried nesting the orders beneath the user index, but while this gives you a slightly nicer method of aggregation, doesn't let you use the aggregated value in my original predicate.

Thanks

system · April 27, 2020, 9:19am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Filtering by a aggregation (like SQL having clause) Elasticsearch	4	6036	November 5, 2018
Aggregation across multiple indices Elasticsearch	9	8424	June 17, 2021
Aggregation across indices Elasticsearch	2	304	July 6, 2017
ElasticSearch Having-Clause? Elasticsearch	2	2889	July 6, 2017
Aggregations across multiple indices Elasticsearch	3	6126	July 6, 2017

Aggregating across indices

Related topics