Hi everyone,
I've been dabbling with Elasticsearch on and off for a year, but I've recently dusted off an old project and wanted to retrofit it with Elasticsearch.
One thing that has me very stumped is being able to aggregate across indices, using the result in your predicate, similar to using a "HAVING" clause in SQL-land.
Here is a worked example. Given the following sample data:
POST /_bulk
{ "index": { "_index": "user-index", "_type": "user", "_id": "111" }}
{ "name": "frodo" }
{ "index": { "_index": "user-index", "_type": "user", "_id": "222" }}
{ "name": "samwise" }
{ "index": { "_index": "user-index", "_type": "user", "_id": "333" }}
{ "name": "gandalf" }
{ "index": { "_index": "user-index", "_type": "user", "_id": "444" }}
{ "name": "sauron" }
POST /_bulk
{ "index": { "_index": "order-index", "_type": "order", "_id": "1" }}
{ "user_id": "111", "status": "shipped" }
{ "index": { "_index": "order-index", "_type": "order", "_id": "2" }}
{ "user_id": "222", "status": "pending" }
{ "index": { "_index": "order-index", "_type": "order", "_id": "3" }}
{ "user_id": "111", "status": "delivered" }
{ "index": { "_index": "order-index", "_type": "order", "_id": "4" }}
{ "user_id": "333", "status": "delivered" }
What we want is a paginated list of users, with status counts, with (pending + shipped + delivered) > 0, with an “f” in their name. (This simulates the abilities of the UI this powers, where end users can select a ton of filters to drill down into the data. Other use cases are expanded upon here.)
The old query that did this looks like:
SELECT u.*,
COUNT(CASE WHEN o.status = 'pending' THEN 1 END) pending,
COUNT(CASE WHEN o.status = 'shipped' THEN 1 END) shipped,
COUNT(CASE WHEN o.status = 'delivered' THEN 1 END) delivered
FROM users u
JOIN orders o
ON u.id = o.user_id
WHERE LOWER(u.name) LIKE '%f%'
GROUP BY u.id
HAVING (pending + shipped + delivered) > 0
;
(A fully worked example in SQL Fiddle can be found here.)
I've made a few stabs at it myself, and originally the only method I had was to pre-calculate the "pending", "shipped", and "delivered" fields, for each user in the user-index, then every time a status changed, decrement and increment the relevant fields. This feels messy, and adds the potential race conditions if there are multiple updates happening concurrently.
My second stab was to get the page of users, then in application code feed the list of user ids into a second query against the orders. This returns the order counts correctly, but doesn't let us filter based on the order counts (i.e. the HAVING clause). I could add an extra layer where I (1) get page x of users matching conditions, (2) aggregate relevant orders, (3) drop users that now don't match, (4) if a page isn't complete, return to (1). This feels a little messy, but I'm sure I'm missing something major in ES that's mentally blocking me.
(Stop using Elasticsearch and go use a SQL database is also a valid outcome from this discussion )
Thanks