Aggregating across indices

Hi everyone,

I've been dabbling with Elasticsearch on and off for a year, but I've recently dusted off an old project and wanted to retrofit it with Elasticsearch.

One thing that has me very stumped is being able to aggregate across indices, using the result in your predicate, similar to using a "HAVING" clause in SQL-land.

Here is a worked example. Given the following sample data:

POST /_bulk
{ "index": { "_index": "user-index", "_type": "user", "_id": "111" }}
{ "name": "frodo" }
{ "index": { "_index": "user-index", "_type": "user", "_id": "222" }}
{ "name": "samwise" }
{ "index": { "_index": "user-index", "_type": "user", "_id": "333" }}
{ "name": "gandalf" }
{ "index": { "_index": "user-index", "_type": "user", "_id": "444" }}
{ "name": "sauron" }
POST /_bulk
{ "index": { "_index": "order-index", "_type": "order", "_id": "1" }}
{ "user_id": "111", "status": "shipped" }
{ "index": { "_index": "order-index", "_type": "order", "_id": "2" }}
{ "user_id": "222", "status": "pending" }
{ "index": { "_index": "order-index", "_type": "order", "_id": "3" }}
{ "user_id": "111", "status": "delivered" }
{ "index": { "_index": "order-index", "_type": "order", "_id": "4" }}
{ "user_id": "333", "status": "delivered" }

What we want is a paginated list of users, with status counts, with (pending + shipped + delivered) > 0, with an “f” in their name. (This simulates the abilities of the UI this powers, where end users can select a ton of filters to drill down into the data. Other use cases are expanded upon here.)

The old query that did this looks like:

SELECT u.*,
  COUNT(CASE WHEN o.status = 'pending' THEN 1 END) pending,
  COUNT(CASE WHEN o.status = 'shipped' THEN 1 END) shipped,
  COUNT(CASE WHEN o.status = 'delivered' THEN 1 END) delivered
FROM users u
JOIN orders o
  ON u.id = o.user_id
WHERE LOWER(u.name) LIKE '%f%'
GROUP BY u.id
HAVING (pending + shipped + delivered) > 0
;

(A fully worked example in SQL Fiddle can be found here.)

I've made a few stabs at it myself, and originally the only method I had was to pre-calculate the "pending", "shipped", and "delivered" fields, for each user in the user-index, then every time a status changed, decrement and increment the relevant fields. This feels messy, and adds the potential race conditions if there are multiple updates happening concurrently.

My second stab was to get the page of users, then in application code feed the list of user ids into a second query against the orders. This returns the order counts correctly, but doesn't let us filter based on the order counts (i.e. the HAVING clause). I could add an extra layer where I (1) get page x of users matching conditions, (2) aggregate relevant orders, (3) drop users that now don't match, (4) if a page isn't complete, return to (1). This feels a little messy, but I'm sure I'm missing something major in ES that's mentally blocking me.

(Stop using Elasticsearch and go use a SQL database is also a valid outcome from this discussion :smile:)

Thanks

This is a join, which ES cannot do.

The recommended option would be to either change the user_id to the textual name value, or add that text name along with the id in each document.

Hi @warkolm, thanks for the reply.

The sample of data I have above has many fields removed for ease of reading; in reality there are another 10-20 fields added per item. I was trying to keep this redundancy to a minimum, but I guess disk space is cheap, and the more I think of this, the better it sounds.

I've tried nesting the orders beneath the user index, but while this gives you a slightly nicer method of aggregation, doesn't let you use the aggregated value in my original predicate.

Thanks

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.