Terms aggregation with a limit


(Guillermo Arias del Río) #1

Hi!

I am using a terms aggregation to get the 10 best terms that match a query.
The problem is that since I am performing a query that returns a lot of
documents, the number of distinct terms is very big, as it is the number of
documents per bucket. An example would be:

{
"aggregations" : {
... // filters
"not_exact" : {
"doc_count" : 2257428,
"text" : {
"buckets" : [ {
"key" : "abb",
"doc_count" : 135686
}, {
"key" : "ansprache",
"doc_count" : 118570
}, {
"key" : "aus",
"doc_count" : 106023
}, {
"key" : "auf",
"doc_count" : 74338
}, {
"key" : "archiv",
"doc_count" : 54315
}, {
"key" : "außen",
"doc_count" : 52444
}, {
"key" : "am",
"doc_count" : 52178
}, {
"key" : "ab",
"doc_count" : 45723
}, {
"key" : "an",
"doc_count" : 44656
}, {
"key" : "athen",
"doc_count" : 32070
} ]
},
...
}

I am not interested in the actual number of documents, and I would even be
willing to sacrifice precision if I can speed up the query (which now takes
6 seconds), so my question is: is there a way to tell the terms aggregation
to stop counting at a certain limit? Imagine, for instance I could specify
this value to be 50 000. I could get the top buckets in the wrong order,
but I could live with that. Elasticsearch would take less time, I suppose.
I would be even happy if the limit was set to 10 000 and I would end up
with different keys, because as the user specifies more, values above 10
000 become less and less probable.

And if that is possible, the next question would be: is there a way of
making this value dependable on the number of total matches (in that case 2
257 428)?

For those who are interested: the background of this request is an
autocompletion request that matches against different fields, with ngram or
edge_ngram depending on the field type, and returns the best matches. In
the example above, the user types "a" and gets those results. If you are
thinking about caching results, this is generally not possible, since the
query is constrained by document types and fields; data chages; and
finally, each user can have a different view on it (so, a user may not see
a any documents containing "athen" and then that match would be incorrect).

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1278cc4c-83b6-4019-b7bf-17a1aae45e0a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Adrien Grand) #2

Unfortunately, I don't think stopping incrementing counts after a certain
limit would improve response times, given that most time is not spent
incrementing this counter but reading the hash table to figure out whether
the current term is new or has already been seen. Maybe you can specify a
timeout (
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-body.html#_parameters_4)
on your requests? This will return partial results in case the timeout is
exceeded.

It is not possible to configure aggregations depending on the number of
matches because the number of matches is computed in parallel with
aggregations.

On Wed, May 28, 2014 at 11:54 AM, Guillermo Arias del Río <
ariasdelrio@gmail.com> wrote:

Hi!

I am using a terms aggregation to get the 10 best terms that match a
query. The problem is that since I am performing a query that returns a lot
of documents, the number of distinct terms is very big, as it is the number
of documents per bucket. An example would be:

{
"aggregations" : {
... // filters
"not_exact" : {
"doc_count" : 2257428,
"text" : {
"buckets" : [ {
"key" : "abb",
"doc_count" : 135686
}, {
"key" : "ansprache",
"doc_count" : 118570
}, {
"key" : "aus",
"doc_count" : 106023
}, {
"key" : "auf",
"doc_count" : 74338
}, {
"key" : "archiv",
"doc_count" : 54315
}, {
"key" : "außen",
"doc_count" : 52444
}, {
"key" : "am",
"doc_count" : 52178
}, {
"key" : "ab",
"doc_count" : 45723
}, {
"key" : "an",
"doc_count" : 44656
}, {
"key" : "athen",
"doc_count" : 32070
} ]
},
...
}

I am not interested in the actual number of documents, and I would even be
willing to sacrifice precision if I can speed up the query (which now takes
6 seconds), so my question is: is there a way to tell the terms aggregation
to stop counting at a certain limit? Imagine, for instance I could specify
this value to be 50 000. I could get the top buckets in the wrong order,
but I could live with that. Elasticsearch would take less time, I suppose.
I would be even happy if the limit was set to 10 000 and I would end up
with different keys, because as the user specifies more, values above 10
000 become less and less probable.

And if that is possible, the next question would be: is there a way of
making this value dependable on the number of total matches (in that case 2
257 428)?

For those who are interested: the background of this request is an
autocompletion request that matches against different fields, with ngram or
edge_ngram depending on the field type, and returns the best matches. In
the example above, the user types "a" and gets those results. If you are
thinking about caching results, this is generally not possible, since the
query is constrained by document types and fields; data chages; and
finally, each user can have a different view on it (so, a user may not see
a any documents containing "athen" and then that match would be incorrect).

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/1278cc4c-83b6-4019-b7bf-17a1aae45e0a%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/1278cc4c-83b6-4019-b7bf-17a1aae45e0a%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j6jbmNsi_%2BvQdvtvY%2B9p8K7EjNN5URqf0qi3u4eEQkUPQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(system) #3