Querying results of an aggregation


(diogo.gsa) #1

Is there a way to, in a single Elasticsearch query expression, perform both an aggregation and then execute a bool query over the resultset produced by that aggregation?

For instance, consider the following dataset and data transformations. I want elastic to execute:

1) Initial Dataset (content within the Elasticsearch index):

{ "pivot_id":  "A", "version": 1, "value": "x" }
{ "pivot_id":  "A", "version": 2, "value": "y" }
{ "pivot_id":  "A", "version": 3, "value": "z" }
{ "pivot_id":  "B", "version": 1, "value": "w" }
{ "pivot_id":  "B", "version": 2, "value": "k" }

2) Aggregate by pivot_id:

{ "pivot_id":  "A", "version": 1, "value": "x" ] // bucket: A
{ "pivot_id":  "A", "version": 2, "value": "y" }
{ "pivot_id":  "A", "version": 3, "value": "z" }

{ "pivot_id":  "B", "version": 1, "value": "w" } // bucket: B
{ "pivot_id":  "B", "version": 2, "value": "k" }

3) Pick the most recent version of each pivot_id/bucket:

{ "pivot_id":  "A", "version": 1, "value": "x" }
{ "pivot_id":  "B", "version": 1, "value": "w" }

4) And then execute a simple bool query with pagination and sorting over the dataset in 3)

That is, how can I in a single Elasticsearch query expression perform the above sequence/pipeline of data transformations 1->2->3->4?

I've tried the below approach but I had no success because the query bool is being executed to feed the aggs caluse, and not to be used over the resultset produced by the trx_id_group aggs.
That is, elastic is executing: 4->1->2->3, and not 1->2->3->4 as I wish.

Is there a way to enforce bool queries to be applied on top of aggregated results (aggs->query), and not (query->aggs)?

I'm using Elasticsearch 6.2.4.

My unsuccessful attempt:

GET /index/_search
{
	"from": 0,
	"size": 10,

	"query": {
		"bool": {
			"filter": [{
				"terms": {
					"value.keyword": ["y", "w"]
				}
			}]
		}
	},

	"aggs": {
		"trx_id_group": {
			"terms": {
				"field": "trx_id.keyword"
			},
			"aggs": {
				"most_recent_per_group": {
					"top_hits": {
						"size": 1,
						"sort": [{
							"version": {
								"order": "desc"
							}
						}]
					}
				}
			}
		}
	},
	"sort": [{
		"version": "asc"
	}]
}

For this query I was expecting no results, since the most recent versions of documents A and B doesn't have the y and w values. Yet the query is fetching and grouping events A.v2 and B.v1.

Last note: In SQL the above query would be:

SELECT *
FROM (SELECT a.PIVOT_ID, a.VERSION, a.VALUE
	      FROM ES_INDEX AS a
	      INNER JOIN (SELECT PIVOT_ID, MAX(VERSION) AS VERSION
		   		      FROM ES_INDEX
			          GROUP BY PIVOT_ID) AS b
	      ON a.PIVOT_ID = b.PIVOT_ID AND a.VERSION = b.VERSION) AS TRX_ID_MOST_RECENT_VERSION
 WHERE // ... your custom SEARCH clauses here ...

Thank you all! :slight_smile:


(Mark Harwood) #2

No, queries are designed to operate on physical docs.
You can use aggregations to perform grouping, sorting and filtering on document matches found in queries but these aggregate operations can be complicated by the fact that in distributed systems, related data is often spread across multiple machines, each of which may have returned partial results. Increasing the numbers of results returned from each machine improves accuracy but with added cost.
If you want optimise both query performance and accuracy you can bring related data closer together at index time by maintaining entity-centric indexes


(system) #3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.