Micheal: I'd would expect that setting the size
option on the terms agg
to a smaller value would have a positive impact on the total query time.
Feels like I'm missing something, can you run hot threads api (
Elasticsearch Platform — Find real-time answers at scale | Elastic)
while your run the search request that you've shared before? This basically
gives a cluster wide stack dump and can perhaps give me an insight why your
search request is slow.
Setting the size
option of terms agg to 0 will return all buckets of that
can be found on the fingerprint field (which can be millions of buckets),
so I can see how this can bring down your cluster, because that simply
doesn't fit in the Java heap space.
Dustin: The top_hits
aggregation is always nested under a bucket
aggregator (for example the terms
bucket aggregator). For each bucket the
terms aggregator create the top_hits aggregator will create a priority
queue, where this top_hits aggregator is going to maintain the top N docs
that fall under the bucket it is in. So the time spent by the top_hits
aggregator, like any other nested aggregator depends on the number of
buckets being maintained during the execution of the search request. With
the top_hits this is more noticeable compared to for example a metric agg
(min, max, avg etc.), because of what the top_hits aggregator does.
On 7 January 2015 at 20:29, Dustin Boswell dboswell@gmail.com wrote:
I'm curious what the underlying algorithm is for TopHits.
My mental model for ordinary aggregations is that there's basically a hash
table of (field_value -> count) maintained (for each field being
aggregated), and that hash table count is incremented once per document,
and then the top K elements of that hash table are returned to the user.
So there's O(1) work for each document scored, and then a final O(N*logN)
sort on that hash table to get the top K, where N is the number of unique
field_values. It makes sense to me why this implementation would be very
fast.
My mental model for a top_hits aggregation is that there's a hash table of
(field_value -> array(pair(doc_id, score))). And for each document being
scored, that (doc_id, score) is appended to the corresponding array. Again,
there's only O(1) work for each document. At the end, you have to sort
each array, and then sort the hash table, and take the top K1 arrays, and
the top K2 elements of each array, and then for each doc_id, pull out the
relevant fields to return to the user. So definitely more work (and a lot
more memory), but I'm not sure if this would result in the 30x increase in
runtime we're seeing. (And actually, for the special case where
top_hits->size == 1, you only need the top (doc_id, score) seen, not a
whole array, so that would be a lot faster and less memory. But I
understand it needs to be able to handle more general cases.)
Is this at all close to how it works?
On Tuesday, January 6, 2015 11:20:08 PM UTC-8, Martijn v Groningen wrote:
Hi Michael,
In general the more buckets being returned by the parent aggregator the
top_hits is nested in, the more work the top_hits agg needs to do, but I
didn't come across performance issues with size
on terms agg being set to
50 and the time it takes to execute increasing 30 times when top_hits is
used. To exclude this on your side, can you play around with the size
option on terms agg?
Also perhaps the _source of your documents are relatively large. How does
the top_hits agg perform without the _source
option on the top_hits agg?
Martijn
On 6 January 2015 at 22:29, Michael Irani irani....@gmail.com wrote:
Sure. I simplified the query to keep things focused.
This query takes about 3 seconds to run:
{
"size": 0,
"aggs": {
"top-fingerprints": {
"terms": {
"field": "fingerprint",
"size": 50
},
"aggs": {
"top_tag_hits": {
"top_hits": {
"size": 1,
"_source": {
"include": [
"title"
]
}
}
}
}
}
}
}
This one takes about 80 milliseconds:
{
"size": 0,
"aggs": {
"fingerprints": {
"terms": {
"field": "fingerprint",
"size": 100
}
}
}
}
The result's a bit too big to paste here. Anything specific about it you want me to expose?
Michael.
On Tuesday, January 6, 2015 12:14:55 PM UTC-8, Itamar Syn-Hershko wrote:
Can you share the query and example results please?
--
Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/
On Tue, Jan 6, 2015 at 10:11 PM, Michael Irani irani....@gmail.com
wrote:
Hello,
I'm working on a corpus of size approximately 10 million documents.
The issue I'm running into right now is that the top scoring documents that
come back from my query are essentially all the same result. I'm trying to
find a way to get back unique results.
I've looked into modeling the data differently with nested objects or
parent-child relationships, but neither layout seems to fit the bill. The
nested model won't work because some of the documents have too many closely
related objects. On the flip side there are also too many unique documents
for the parent-child relationship to fit.
I then tried the "top hits aggregation" and it's exactly what I'm
looking for, except the running time of the query is approximately 30x
slower than the query without the aggregation. Are there known performance
issues with "top hits"? Any ideas on what I should use to make these
queries? Here's the aggregation piece:
"aggs": {
"top-fingerprints": {
"terms": {
"field": "fingerprint",
"size": 50
},
"aggs": {
"top_tag_hits": {
"top_hits": {
"size": 1,
"_source": {
"include": [
"title"
]
}
}
}
}
}
}
Thanks,
Michael
--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/29fce15c-79b7-4756-b033-93e490204095%40goo
glegroups.com
https://groups.google.com/d/msgid/elasticsearch/29fce15c-79b7-4756-b033-93e490204095%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/14e4a31c-3168-409a-8b2b-cb1e432ef433%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/14e4a31c-3168-409a-8b2b-cb1e432ef433%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.
--
Met vriendelijke groet,
Martijn van Groningen
--
Met vriendelijke groet,
Martijn van Groningen
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CA%2BA76TyG9hR4diPzgsJKfdiJ1jD8e5dhQ5JRuunBMwqR28VdYw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.