ES, aggregation and pagination


(bob) #1

Hi everyone,

I'm currently working on an ES based project (ES version in use 1.0.1) and I'm fiddling with the ES syntax to make aggregations work.
Here is what I'm trying to do : I have a set of versioned documents on which I'm trying to perform a search query with pagination, having the following two steps :
-for each document, get the highest version number (1st inner bucket)
-then for each document, group by their id and sort the resulting set with the metric Score (from the inner bucket) in order to maintain a decent pagination

Anyways, here is the search query :

{
"fields":[

],
"size":0,
"from":0,
"sort":[
"_score"
],
"query":{
"filtered":{
"query":{
"bool":{
"should":[
{
"multi_match":{
"query":"general",
"fields":[
"Title.Value.original^4",
"Title.Value.partial"
]
}
}
],
"minimum_should_match":1
}
}
}
},
"aggregations":{
"Count":{
"cardinality":{
"field":"IDDocument"
}
},
"IDDocumentPage":{
"terms":{
"field":"IDDocument",
"order":{
"IDDocument>Score":"desc"
}
},
"aggregations":{
"IDDocument":{
"terms":{
"field":"IDDocument",
"order":{
"Score":"desc"
}
},
"aggregations":{
"VersionNumber":{
"max":{
"field":"VersionNumber"
}
},
"Score":{
"max":{
"script":"_doc.score"
}
}
}
}
}
}
}
}

Which fails, yielding the following error :

AggregationExecutionException[terms aggregation [IDDocumentPage] is configured with a sub-aggregation order [IDDocument>Score] but no sub aggregation with this name is configured];

Am I missing something here, syntax-wise ?

Here is another attempt based on a filter range syntax, which runs alright but doesn't produce any result.

{
"fields":[

],
"size":0,
"from":0,
"sort":[
"_score"
],
"query":{
"filtered":{
"query":{
"bool":{
"should":[
{
"multi_match":{
"query":"general",
"fields":[
"Title.Value.original^4",
"Title.Value.partial"
]
}
}
],
"minimum_should_match":1
}
}
}
},
"aggregations":{
"Count":{
"cardinality":{
"field":"IDDocument"
}
},
"IDDocumentPage":{
"filter" : { "range" : { "IDDocument>Score" : { "gt" : 0 } } },
"aggregations":{
"IDDocument":{
"terms":{
"field":"IDDocument",
"order":{
"Score":"desc"
}
},
"aggregations":{
"VersionNumber":{
"max":{
"field":"VersionNumber"
}
},
"Score":{
"max":{
"script":"_doc.score"
}
}
}
}
}
}
}
}

Any idea on what I'm doing wrong here ? And more generally what would be the (best) way to perform a group by and pagination in ES at the same time ?


(Adrien Grand) #2

Hi Bob,

Although you reported using Elasticsearch 1.0.1, you seem to be using
features that are only available in Elasticsearch 1.1.0: the cardinality
aggregation and the ability to sort according by several levels of nested
aggregations. That might partially explain the issue that you are
encoutering?

Regarding pagination of the terms aggregation (which is the closest thing
we have to a GROUP BY), this is not supported.

On Wed, Mar 26, 2014 at 9:39 AM, bob bkoenig@groupama-ge.fr wrote:

Hi everyone,

I'm currently working on an ES based project (ES version in use 1.0.1) and
I'm fiddling with the ES syntax to make aggregations work.
Here is what I'm trying to do : I have a set of versioned documents on
which
I'm trying to perform a search query with pagination, having the following
two steps :
-for each document, get the highest version number (1st inner bucket)
-then for each document, group by their id and sort the resulting set with
the metric Score (from the inner bucket) in order to maintain a decent
pagination

Anyways, here is the search query :

{
"fields":[

],
"size":0,
"from":0,
"sort":[
"_score"
],
"query":{
"filtered":{
"query":{
"bool":{
"should":[
{
"multi_match":{
"query":"general",
"fields":[
"Title.Value.original^4",
"Title.Value.partial"
]
}
}
],
"minimum_should_match":1
}
}
}
},
"aggregations":{
"Count":{
"cardinality":{
"field":"IDDocument"
}
},
"IDDocumentPage":{
"terms":{
"field":"IDDocument",
"order":{
"IDDocument>Score":"desc"
}
},
"aggregations":{
"IDDocument":{
"terms":{
"field":"IDDocument",
"order":{
"Score":"desc"
}
},
"aggregations":{
"VersionNumber":{
"max":{
"field":"VersionNumber"
}
},
"Score":{
"max":{
"script":"_doc.score"
}
}
}
}
}
}
}
}

Which fails, yielding the following error :

AggregationExecutionException[terms aggregation [IDDocumentPage] is
configured with a sub-aggregation order [IDDocument>Score] but no sub
aggregation with this name is configured];

Am I missing something here, syntax-wise ?

Here is another attempt based on a filter range syntax, which runs alright
but doesn't produce any result.

{
"fields":[

],
"size":0,
"from":0,
"sort":[
"_score"
],
"query":{
"filtered":{
"query":{
"bool":{
"should":[
{
"multi_match":{
"query":"general",
"fields":[
"Title.Value.original^4",
"Title.Value.partial"
]
}
}
],
"minimum_should_match":1
}
}
}
},
"aggregations":{
"Count":{
"cardinality":{
"field":"IDDocument"
}
},
"IDDocumentPage":{
"filter" : { "range" : { "IDDocument>Score" : { "gt" : 0 }
} },
"aggregations":{
"IDDocument":{
"terms":{
"field":"IDDocument",
"order":{
"Score":"desc"
}
},
"aggregations":{
"VersionNumber":{
"max":{
"field":"VersionNumber"
}
},
"Score":{
"max":{
"script":"_doc.score"
}
}
}
}
}
}
}
}

Any idea on what I'm doing wrong here ? And more generally what would be
the
(best) way to perform a group by and pagination in ES at the same time ?

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/ES-aggregation-and-pagination-tp4052774.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/1395823190787-4052774.post%40n3.nabble.com
.
For more options, visit https://groups.google.com/d/optout.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j5MPJmp06g1P0PrUj-ECX4JAZzhAViNzSCP2XjoDva7eQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(bob) #3

Hi Adrien,

thanks for your answer. You are right about cardinality, not available on the 1.0.1, but I'm using a plugin to cover this part.
I switched my ES version to the 1.1.0 and managed to run the following query :

{
"fields":[

],
"size":0,
"from":0,
"sort":[
"_score"
],
"query":{
"filtered":{
"query":{
"bool":{
"should":[
{
"multi_match":{
"query":"general",
"fields":[
"Titre.Valeur.original^4",
"Titre.Valeur.partial",
"Resume.Valeur.original^2",
"Resume.Valeur.partial",
"Corps.Valeur.original^2",
"Corps.Valeur.partial"
]
}
}
],
"minimum_should_match":1
}
}
}
},
"aggs" : {
"Count":{
"cardinality":{
"field":"IDDocument"
}
},
"docagg" : {
"terms" : {
"field" : "IDDocument",
"order" : { "version_max>score" : "desc" }
},
"aggs" : {
"version_max" : {
"filter" : { "range" : { "NumeroVersion" : { "gt" : 0 } }},
"aggs" : {
"vMax" : { "max" : { "field" : "NumeroVersion" }},
"score" : { "max" : { "script":"_doc.score"}}
}
}
}
}
}
}

As I understand it, pagination can't be achieved with ES, but we are trying to do so anyway in a different manner. Based on the previous query, the idea would be to be able to filter (eg : range filter or else) on the score that is brought up by the inner bucket (version_max) and set a given size on the docagg bucket.
The concept here is to set up the very same technique used in SQL with CTEs, here is the example :

declare @Document table (
IDDocument uniqueidentifier,
NumeroVersion int,
Score decimal(18,2)
);

declare @idDoc1 uniqueidentifier;
declare @idDoc2 uniqueidentifier;
declare @idDoc3 uniqueidentifier;

set @idDoc1 = newid();
set @idDoc2 = newid();
set @idDoc3 = newid();

insert into @Document values (@idDoc1, 1, 1.75);
insert into @Document values (@idDoc1, 2, 1.5);
insert into @Document values (@idDoc2, 1, 0.75);
insert into @Document values (@idDoc2, 2, 1.25);
insert into @Document values (@idDoc2, 3, 1.95);
insert into @Document values (@idDoc3, 1, 2);

select d.IDDocument, max(d.NumeroVersion) as NumeroVersion, max(d.Score) as ScoreDocument
from @Document d
group by d.IDDocument;

with VersionMax as (
select d.IDDocument, max(d.NumeroVersion) as NumeroVersion, max(d.Score) as ScoreDocument
from @Document d
group by d.IDDocument
)

select top 2 *
from VersionMax v
where v.ScoreDocument < 1.9
order by v.ScoreDocument desc

This last query is the one I want to translate to ES syntax :
-top 2 => in docagg / size
-v.ScoreDocument < 1.9 => term /range filter ?

Any idea on how to do that ?


(Adrien Grand) #4

On Thu, Mar 27, 2014 at 9:45 AM, bob bkoenig@groupama-ge.fr wrote:

select top 2 *
from VersionMax v
where v.ScoreDocument < 1.9
order by v.ScoreDocument desc

This last query is the one I want to translate to ES syntax :
-top 2 => in docagg / size
-v.ScoreDocument < 1.9 => term /range filter ?

Any idea on how to do that ?

You should be able to do that by using a filter aggregation[1] with a
script filter[2] in order to only run the aggregation on a specific score
range.

[1]
http://www.elasticsearch.org/guide/en/elasticsearch/reference/1.x/search-aggregations-bucket-filter-aggregation.html
[2]
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-script-filter.html

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j6nUMbtoTXRAHPt0cRmQyTRW_NWx0Tgvx-%3DiQmab92fdw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(bob) #5

Alright, I finally managed to limit the resulting set of documents by adding a size keyword :

{
"fields":[

],
"size":0,
"from":0,
"sort":[
"_score"
],
"query":{
"filtered":{
"query":{
"bool":{
"should":[
{
"multi_match":{
"query":"general",
"fields":[
"Titre.Valeur.original^4",
"Titre.Valeur.partial",
"Resume.Valeur.original^2",
"Resume.Valeur.partial",
"Corps.Valeur.original^2",
"Corps.Valeur.partial"
]
}
}
],
"minimum_should_match":1
}
}
}
},

"aggs" : {

	"Count":{
		 "cardinality":{
			"field":"IDDocument"
		 }
	  },
	  
	  "doc_limit_agg" :{
	  
		"filter" : {
			"script" : {
				"script" : "_doc.score <0.05"
			}
		},
	
		"aggs" : {
			"docagg" : {

				"terms" : {
					"field" : "IDDocument",
					"order" : { "score" : "desc" }
					,"size":4
				},
				"aggs" : {
					"vMax" : { "max" : { "field" : "NumeroVersion" }},
					"score" : { "max" : { "script":"_doc.score"}}
				}
			}
		}
    }
}

}

Pagination could then be addressed based on highest score order and its filtered value (see script above).

Last question, and yet maybe something not supported by ES : is there a way to dynamically calculate a column (like a ROW_NUMBER() in SQL) and use it to sort and filter like above. Actually I'm seeking a better and more robust way to filter/sort than the score column.

Again the SQL equivalent to illustrate what I'm aiming at :

with VersionMax as (
select d.IDDocument, max(d.NumeroVersion) as NumeroVersion, max(d.Score) as ScoreDocument,
ROW_NUMBER() OVER(ORDER BY max(d.Score) DESC, d.IDDocument DESC) as RowNumber
from @Document d
group by d.IDDocument
)
select top 2 *
from VersionMax v
where v.RowNumber < 4
order by v.RowNumber desc


(system) #6