Bucket query results | top hits performance

Hello,
I'm working on a corpus of size approximately 10 million documents. The
issue I'm running into right now is that the top scoring documents that
come back from my query are essentially all the same result. I'm trying to
find a way to get back unique results.

I've looked into modeling the data differently with nested objects or
parent-child relationships, but neither layout seems to fit the bill. The
nested model won't work because some of the documents have too many closely
related objects. On the flip side there are also too many unique documents
for the parent-child relationship to fit.

I then tried the "top hits aggregation" and it's exactly what I'm looking
for, except the running time of the query is approximately 30x slower than
the query without the aggregation. Are there known performance issues with
"top hits"? Any ideas on what I should use to make these queries? Here's
the aggregation piece:
"aggs": {

"top-fingerprints": {
    "terms": {
        "field": "fingerprint",
        "size": 50
    },
    "aggs": {
        "top_tag_hits": {
            "top_hits": {
                "size": 1,
                "_source": {
                   "include": [
                      "title"
                   ]
                }
            }
        }
    }
}

}

Thanks,
Michael

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/29fce15c-79b7-4756-b033-93e490204095%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Can you share the query and example results please?

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Tue, Jan 6, 2015 at 10:11 PM, Michael Irani irani.michael@gmail.com
wrote:

Hello,
I'm working on a corpus of size approximately 10 million documents. The
issue I'm running into right now is that the top scoring documents that
come back from my query are essentially all the same result. I'm trying to
find a way to get back unique results.

I've looked into modeling the data differently with nested objects or
parent-child relationships, but neither layout seems to fit the bill. The
nested model won't work because some of the documents have too many closely
related objects. On the flip side there are also too many unique documents
for the parent-child relationship to fit.

I then tried the "top hits aggregation" and it's exactly what I'm looking
for, except the running time of the query is approximately 30x slower than
the query without the aggregation. Are there known performance issues with
"top hits"? Any ideas on what I should use to make these queries? Here's
the aggregation piece:
"aggs": {

"top-fingerprints": {
    "terms": {
        "field": "fingerprint",
        "size": 50
    },
    "aggs": {
        "top_tag_hits": {
            "top_hits": {
                "size": 1,
                "_source": {
                   "include": [
                      "title"
                   ]
                }
            }
        }
    }
}

}

Thanks,
Michael

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/29fce15c-79b7-4756-b033-93e490204095%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/29fce15c-79b7-4756-b033-93e490204095%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHTr4Zv6oymHVY8ambsshh6CMtD%2BMJrf-VSA0hoKAeYwvVQL8w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Sure. I simplified the query to keep things focused.

This query takes about 3 seconds to run:

{

"size": 0,

"aggs": {
    "top-fingerprints": {
        "terms": {
            "field": "fingerprint",
            "size": 50
        },
        "aggs": {
            "top_tag_hits": {
                "top_hits": {
                    "size": 1,
                    "_source": {
                       "include": [
                          "title"
                       ]
                    }
                }
            }
        }
    }
}

}

This one takes about 80 milliseconds:

{

"size": 0,

"aggs": {
    "fingerprints": {
        "terms": {
            "field": "fingerprint",
            "size": 100
        }
    }
}

}

The result's a bit too big to paste here. Anything specific about it you want me to expose?

Michael.

On Tuesday, January 6, 2015 12:14:55 PM UTC-8, Itamar Syn-Hershko wrote:

Can you share the query and example results please?

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Tue, Jan 6, 2015 at 10:11 PM, Michael Irani <irani....@gmail.com
<javascript:>> wrote:

Hello,
I'm working on a corpus of size approximately 10 million documents. The
issue I'm running into right now is that the top scoring documents that
come back from my query are essentially all the same result. I'm trying to
find a way to get back unique results.

I've looked into modeling the data differently with nested objects or
parent-child relationships, but neither layout seems to fit the bill. The
nested model won't work because some of the documents have too many closely
related objects. On the flip side there are also too many unique documents
for the parent-child relationship to fit.

I then tried the "top hits aggregation" and it's exactly what I'm looking
for, except the running time of the query is approximately 30x slower than
the query without the aggregation. Are there known performance issues with
"top hits"? Any ideas on what I should use to make these queries? Here's
the aggregation piece:
"aggs": {

"top-fingerprints": {
    "terms": {
        "field": "fingerprint",
        "size": 50
    },
    "aggs": {
        "top_tag_hits": {
            "top_hits": {
                "size": 1,
                "_source": {
                   "include": [
                      "title"
                   ]
                }
            }
        }
    }
}

}

Thanks,
Michael

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/29fce15c-79b7-4756-b033-93e490204095%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/29fce15c-79b7-4756-b033-93e490204095%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/14e4a31c-3168-409a-8b2b-cb1e432ef433%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hi Michael,

In general the more buckets being returned by the parent aggregator the
top_hits is nested in, the more work the top_hits agg needs to do, but I
didn't come across performance issues with size on terms agg being set to
50 and the time it takes to execute increasing 30 times when top_hits is
used. To exclude this on your side, can you play around with the size
option on terms agg?

Also perhaps the _source of your documents are relatively large. How does
the top_hits agg perform without the _source option on the top_hits agg?

Martijn

On 6 January 2015 at 22:29, Michael Irani irani.michael@gmail.com wrote:

Sure. I simplified the query to keep things focused.

This query takes about 3 seconds to run:

{

"size": 0,

"aggs": {
    "top-fingerprints": {
        "terms": {
            "field": "fingerprint",
            "size": 50
        },
        "aggs": {
            "top_tag_hits": {
                "top_hits": {
                    "size": 1,
                    "_source": {
                       "include": [
                          "title"
                       ]
                    }
                }
            }
        }
    }
}

}

This one takes about 80 milliseconds:

{

"size": 0,

"aggs": {
    "fingerprints": {
        "terms": {
            "field": "fingerprint",
            "size": 100
        }
    }
}

}

The result's a bit too big to paste here. Anything specific about it you want me to expose?

Michael.

On Tuesday, January 6, 2015 12:14:55 PM UTC-8, Itamar Syn-Hershko wrote:

Can you share the query and example results please?

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Tue, Jan 6, 2015 at 10:11 PM, Michael Irani irani....@gmail.com
wrote:

Hello,
I'm working on a corpus of size approximately 10 million documents. The
issue I'm running into right now is that the top scoring documents that
come back from my query are essentially all the same result. I'm trying to
find a way to get back unique results.

I've looked into modeling the data differently with nested objects or
parent-child relationships, but neither layout seems to fit the bill. The
nested model won't work because some of the documents have too many closely
related objects. On the flip side there are also too many unique documents
for the parent-child relationship to fit.

I then tried the "top hits aggregation" and it's exactly what I'm
looking for, except the running time of the query is approximately 30x
slower than the query without the aggregation. Are there known performance
issues with "top hits"? Any ideas on what I should use to make these
queries? Here's the aggregation piece:
"aggs": {

"top-fingerprints": {
    "terms": {
        "field": "fingerprint",
        "size": 50
    },
    "aggs": {
        "top_tag_hits": {
            "top_hits": {
                "size": 1,
                "_source": {
                   "include": [
                      "title"
                   ]
                }
            }
        }
    }
}

}

Thanks,
Michael

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/29fce15c-79b7-4756-b033-93e490204095%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/29fce15c-79b7-4756-b033-93e490204095%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/14e4a31c-3168-409a-8b2b-cb1e432ef433%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/14e4a31c-3168-409a-8b2b-cb1e432ef433%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
Met vriendelijke groet,

Martijn van Groningen

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CA%2BA76Tzqo48VW0xTkR3zMpZ4Ys1CxwjB7J8dGTdp19N_1rYO3Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Martijn,
Thanks for thinking about this. I tried changing the size on terms agg to
1, 5, 10, 25, 50 and timing didn't change much. Interestingly I also set
the size to 0 which in turn took down our cluster. I tried removing the
_source option and that didn't have any noticeable effect on performance.
The payload for each of our documents is about 5k.

Michael.

On Tuesday, January 6, 2015 11:20:08 PM UTC-8, Martijn v Groningen wrote:

Hi Michael,

In general the more buckets being returned by the parent aggregator the
top_hits is nested in, the more work the top_hits agg needs to do, but I
didn't come across performance issues with size on terms agg being set to
50 and the time it takes to execute increasing 30 times when top_hits is
used. To exclude this on your side, can you play around with the size
option on terms agg?

Also perhaps the _source of your documents are relatively large. How does
the top_hits agg perform without the _source option on the top_hits agg?

Martijn

On 6 January 2015 at 22:29, Michael Irani <irani....@gmail.com
<javascript:>> wrote:

Sure. I simplified the query to keep things focused.

This query takes about 3 seconds to run:

{

"size": 0,

"aggs": {
    "top-fingerprints": {
        "terms": {
            "field": "fingerprint",
            "size": 50
        },
        "aggs": {
            "top_tag_hits": {
                "top_hits": {
                    "size": 1,
                    "_source": {
                       "include": [
                          "title"
                       ]
                    }
                }
            }
        }
    }
}

}

This one takes about 80 milliseconds:

{

"size": 0,

"aggs": {
    "fingerprints": {
        "terms": {
            "field": "fingerprint",
            "size": 100
        }
    }
}

}

The result's a bit too big to paste here. Anything specific about it you want me to expose?

Michael.

On Tuesday, January 6, 2015 12:14:55 PM UTC-8, Itamar Syn-Hershko wrote:

Can you share the query and example results please?

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Tue, Jan 6, 2015 at 10:11 PM, Michael Irani irani....@gmail.com
wrote:

Hello,
I'm working on a corpus of size approximately 10 million documents. The
issue I'm running into right now is that the top scoring documents that
come back from my query are essentially all the same result. I'm trying to
find a way to get back unique results.

I've looked into modeling the data differently with nested objects or
parent-child relationships, but neither layout seems to fit the bill. The
nested model won't work because some of the documents have too many closely
related objects. On the flip side there are also too many unique documents
for the parent-child relationship to fit.

I then tried the "top hits aggregation" and it's exactly what I'm
looking for, except the running time of the query is approximately 30x
slower than the query without the aggregation. Are there known performance
issues with "top hits"? Any ideas on what I should use to make these
queries? Here's the aggregation piece:
"aggs": {

"top-fingerprints": {
    "terms": {
        "field": "fingerprint",
        "size": 50
    },
    "aggs": {
        "top_tag_hits": {
            "top_hits": {
                "size": 1,
                "_source": {
                   "include": [
                      "title"
                   ]
                }
            }
        }
    }
}

}

Thanks,
Michael

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/29fce15c-79b7-4756-b033-93e490204095%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/29fce15c-79b7-4756-b033-93e490204095%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/14e4a31c-3168-409a-8b2b-cb1e432ef433%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/14e4a31c-3168-409a-8b2b-cb1e432ef433%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
Met vriendelijke groet,

Martijn van Groningen

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/807b8f8f-a944-4301-b476-185c46ede468%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

I'm curious what the underlying algorithm is for TopHits.

My mental model for ordinary aggregations is that there's basically a hash
table of (field_value -> count) maintained (for each field being
aggregated), and that hash table count is incremented once per document,
and then the top K elements of that hash table are returned to the user.
So there's O(1) work for each document scored, and then a final O(N*logN)
sort on that hash table to get the top K, where N is the number of unique
field_values. It makes sense to me why this implementation would be very
fast.

My mental model for a top_hits aggregation is that there's a hash table of
(field_value -> array(pair(doc_id, score))). And for each document being
scored, that (doc_id, score) is appended to the corresponding array. Again,
there's only O(1) work for each document. At the end, you have to sort
each array, and then sort the hash table, and take the top K1 arrays, and
the top K2 elements of each array, and then for each doc_id, pull out the
relevant fields to return to the user. So definitely more work (and a lot
more memory), but I'm not sure if this would result in the 30x increase in
runtime we're seeing. (And actually, for the special case where
top_hits->size == 1, you only need the top (doc_id, score) seen, not a
whole array, so that would be a lot faster and less memory. But I
understand it needs to be able to handle more general cases.)

Is this at all close to how it works?

On Tuesday, January 6, 2015 11:20:08 PM UTC-8, Martijn v Groningen wrote:

Hi Michael,

In general the more buckets being returned by the parent aggregator the
top_hits is nested in, the more work the top_hits agg needs to do, but I
didn't come across performance issues with size on terms agg being set to
50 and the time it takes to execute increasing 30 times when top_hits is
used. To exclude this on your side, can you play around with the size
option on terms agg?

Also perhaps the _source of your documents are relatively large. How does
the top_hits agg perform without the _source option on the top_hits agg?

Martijn

On 6 January 2015 at 22:29, Michael Irani <irani....@gmail.com
<javascript:>> wrote:

Sure. I simplified the query to keep things focused.

This query takes about 3 seconds to run:

{

"size": 0,

"aggs": {
    "top-fingerprints": {
        "terms": {
            "field": "fingerprint",
            "size": 50
        },
        "aggs": {
            "top_tag_hits": {
                "top_hits": {
                    "size": 1,
                    "_source": {
                       "include": [
                          "title"
                       ]
                    }
                }
            }
        }
    }
}

}

This one takes about 80 milliseconds:

{

"size": 0,

"aggs": {
    "fingerprints": {
        "terms": {
            "field": "fingerprint",
            "size": 100
        }
    }
}

}

The result's a bit too big to paste here. Anything specific about it you want me to expose?

Michael.

On Tuesday, January 6, 2015 12:14:55 PM UTC-8, Itamar Syn-Hershko wrote:

Can you share the query and example results please?

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Tue, Jan 6, 2015 at 10:11 PM, Michael Irani irani....@gmail.com
wrote:

Hello,
I'm working on a corpus of size approximately 10 million documents. The
issue I'm running into right now is that the top scoring documents that
come back from my query are essentially all the same result. I'm trying to
find a way to get back unique results.

I've looked into modeling the data differently with nested objects or
parent-child relationships, but neither layout seems to fit the bill. The
nested model won't work because some of the documents have too many closely
related objects. On the flip side there are also too many unique documents
for the parent-child relationship to fit.

I then tried the "top hits aggregation" and it's exactly what I'm
looking for, except the running time of the query is approximately 30x
slower than the query without the aggregation. Are there known performance
issues with "top hits"? Any ideas on what I should use to make these
queries? Here's the aggregation piece:
"aggs": {

"top-fingerprints": {
    "terms": {
        "field": "fingerprint",
        "size": 50
    },
    "aggs": {
        "top_tag_hits": {
            "top_hits": {
                "size": 1,
                "_source": {
                   "include": [
                      "title"
                   ]
                }
            }
        }
    }
}

}

Thanks,
Michael

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/29fce15c-79b7-4756-b033-93e490204095%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/29fce15c-79b7-4756-b033-93e490204095%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/14e4a31c-3168-409a-8b2b-cb1e432ef433%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/14e4a31c-3168-409a-8b2b-cb1e432ef433%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
Met vriendelijke groet,

Martijn van Groningen

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/52497ce2-cc18-4d75-a36e-dbc884288672%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Micheal: I'd would expect that setting the size option on the terms agg
to a smaller value would have a positive impact on the total query time.
Feels like I'm missing something, can you run hot threads api (
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/cluster-nodes-hot-threads.html#cluster-nodes-hot-threads)
while your run the search request that you've shared before? This basically
gives a cluster wide stack dump and can perhaps give me an insight why your
search request is slow.

Setting the size option of terms agg to 0 will return all buckets of that
can be found on the fingerprint field (which can be millions of buckets),
so I can see how this can bring down your cluster, because that simply
doesn't fit in the Java heap space.

Dustin: The top_hits aggregation is always nested under a bucket
aggregator (for example the terms bucket aggregator). For each bucket the
terms aggregator create the top_hits aggregator will create a priority
queue, where this top_hits aggregator is going to maintain the top N docs
that fall under the bucket it is in. So the time spent by the top_hits
aggregator, like any other nested aggregator depends on the number of
buckets being maintained during the execution of the search request. With
the top_hits this is more noticeable compared to for example a metric agg
(min, max, avg etc.), because of what the top_hits aggregator does.

On 7 January 2015 at 20:29, Dustin Boswell dboswell@gmail.com wrote:

I'm curious what the underlying algorithm is for TopHits.

My mental model for ordinary aggregations is that there's basically a hash
table of (field_value -> count) maintained (for each field being
aggregated), and that hash table count is incremented once per document,
and then the top K elements of that hash table are returned to the user.
So there's O(1) work for each document scored, and then a final O(N*logN)
sort on that hash table to get the top K, where N is the number of unique
field_values. It makes sense to me why this implementation would be very
fast.

My mental model for a top_hits aggregation is that there's a hash table of
(field_value -> array(pair(doc_id, score))). And for each document being
scored, that (doc_id, score) is appended to the corresponding array. Again,
there's only O(1) work for each document. At the end, you have to sort
each array, and then sort the hash table, and take the top K1 arrays, and
the top K2 elements of each array, and then for each doc_id, pull out the
relevant fields to return to the user. So definitely more work (and a lot
more memory), but I'm not sure if this would result in the 30x increase in
runtime we're seeing. (And actually, for the special case where
top_hits->size == 1, you only need the top (doc_id, score) seen, not a
whole array, so that would be a lot faster and less memory. But I
understand it needs to be able to handle more general cases.)

Is this at all close to how it works?

On Tuesday, January 6, 2015 11:20:08 PM UTC-8, Martijn v Groningen wrote:

Hi Michael,

In general the more buckets being returned by the parent aggregator the
top_hits is nested in, the more work the top_hits agg needs to do, but I
didn't come across performance issues with size on terms agg being set to
50 and the time it takes to execute increasing 30 times when top_hits is
used. To exclude this on your side, can you play around with the size
option on terms agg?

Also perhaps the _source of your documents are relatively large. How does
the top_hits agg perform without the _source option on the top_hits agg?

Martijn

On 6 January 2015 at 22:29, Michael Irani irani....@gmail.com wrote:

Sure. I simplified the query to keep things focused.

This query takes about 3 seconds to run:

{

"size": 0,

"aggs": {
    "top-fingerprints": {
        "terms": {
            "field": "fingerprint",
            "size": 50
        },
        "aggs": {
            "top_tag_hits": {
                "top_hits": {
                    "size": 1,
                    "_source": {
                       "include": [
                          "title"
                       ]
                    }
                }
            }
        }
    }
}

}

This one takes about 80 milliseconds:

{

"size": 0,

"aggs": {
    "fingerprints": {
        "terms": {
            "field": "fingerprint",
            "size": 100
        }
    }
}

}

The result's a bit too big to paste here. Anything specific about it you want me to expose?

Michael.

On Tuesday, January 6, 2015 12:14:55 PM UTC-8, Itamar Syn-Hershko wrote:

Can you share the query and example results please?

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Tue, Jan 6, 2015 at 10:11 PM, Michael Irani irani....@gmail.com
wrote:

Hello,
I'm working on a corpus of size approximately 10 million documents.
The issue I'm running into right now is that the top scoring documents that
come back from my query are essentially all the same result. I'm trying to
find a way to get back unique results.

I've looked into modeling the data differently with nested objects or
parent-child relationships, but neither layout seems to fit the bill. The
nested model won't work because some of the documents have too many closely
related objects. On the flip side there are also too many unique documents
for the parent-child relationship to fit.

I then tried the "top hits aggregation" and it's exactly what I'm
looking for, except the running time of the query is approximately 30x
slower than the query without the aggregation. Are there known performance
issues with "top hits"? Any ideas on what I should use to make these
queries? Here's the aggregation piece:
"aggs": {

"top-fingerprints": {
    "terms": {
        "field": "fingerprint",
        "size": 50
    },
    "aggs": {
        "top_tag_hits": {
            "top_hits": {
                "size": 1,
                "_source": {
                   "include": [
                      "title"
                   ]
                }
            }
        }
    }
}

}

Thanks,
Michael

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/29fce15c-79b7-4756-b033-93e490204095%40goo
glegroups.com
https://groups.google.com/d/msgid/elasticsearch/29fce15c-79b7-4756-b033-93e490204095%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/14e4a31c-3168-409a-8b2b-cb1e432ef433%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/14e4a31c-3168-409a-8b2b-cb1e432ef433%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
Met vriendelijke groet,

Martijn van Groningen

--
Met vriendelijke groet,

Martijn van Groningen

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CA%2BA76TyG9hR4diPzgsJKfdiJ1jD8e5dhQ5JRuunBMwqR28VdYw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

1 Like

Micheal & Dustin, what should reduce the query time a lot is if you set
collect_mode to breadth_first on the top-fingerprints agg. Like this:
GET /_search?search_type=count
{
"aggs": {
"top-fingerprints": {
"terms": {
"field": "fingerprint",
"size": 50,
"collect_mode": "breadth_first"
},
"aggs": {
"top_tag_hits": {
"top_hits": {
"size": 1,
"_source": {
"include": [
"title"
]
},
"sort": {
"_doc": {}
}
}
}
}
}
}
}

By default the the top_hits agg will create and maintain a priority hit
queue for all buckets that are created by the terms agg, so also the ones
outside of the top 50, which can potentially be millions. By telling the
terms agg to run in breadth_first mode the top_hits only creates and
maintains a priority hit queue for the top 50 buckets instead of all
buckets. This should make things much better performance wise. There is one
catch to it, the top_hits can't sort by score any more (which is the
default), because the breadth_first collect mode doesn't buffer scores.
That is why the sort is defined on the top_hits agg. In this example I sort
by Lucene docid, which is a kind of arbitrary, because you can't have
control over these sort values, but you can sort by any field in your
mapping.

More information about collect mode:
1)
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#_collect_mode
2)
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_preventing_combinatorial_explosions.html#_depth_first_versus_breadth_first

On 8 January 2015 at 10:56, Martijn v Groningen <
martijn.v.groningen@gmail.com> wrote:

Micheal: I'd would expect that setting the size option on the terms agg
to a smaller value would have a positive impact on the total query time.
Feels like I'm missing something, can you run hot threads api (
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/cluster-nodes-hot-threads.html#cluster-nodes-hot-threads)
while your run the search request that you've shared before? This basically
gives a cluster wide stack dump and can perhaps give me an insight why your
search request is slow.

Setting the size option of terms agg to 0 will return all buckets of
that can be found on the fingerprint field (which can be millions of
buckets), so I can see how this can bring down your cluster, because that
simply doesn't fit in the Java heap space.

Dustin: The top_hits aggregation is always nested under a bucket
aggregator (for example the terms bucket aggregator). For each bucket the
terms aggregator create the top_hits aggregator will create a priority
queue, where this top_hits aggregator is going to maintain the top N docs
that fall under the bucket it is in. So the time spent by the top_hits
aggregator, like any other nested aggregator depends on the number of
buckets being maintained during the execution of the search request. With
the top_hits this is more noticeable compared to for example a metric agg
(min, max, avg etc.), because of what the top_hits aggregator does.

On 7 January 2015 at 20:29, Dustin Boswell dboswell@gmail.com wrote:

I'm curious what the underlying algorithm is for TopHits.

My mental model for ordinary aggregations is that there's basically a
hash table of (field_value -> count) maintained (for each field being
aggregated), and that hash table count is incremented once per document,
and then the top K elements of that hash table are returned to the user.
So there's O(1) work for each document scored, and then a final O(N*logN)
sort on that hash table to get the top K, where N is the number of unique
field_values. It makes sense to me why this implementation would be very
fast.

My mental model for a top_hits aggregation is that there's a hash table
of (field_value -> array(pair(doc_id, score))). And for each document
being scored, that (doc_id, score) is appended to the corresponding array.
Again, there's only O(1) work for each document. At the end, you have to
sort each array, and then sort the hash table, and take the top K1 arrays,
and the top K2 elements of each array, and then for each doc_id, pull out
the relevant fields to return to the user. So definitely more work (and a
lot more memory), but I'm not sure if this would result in the 30x increase
in runtime we're seeing. (And actually, for the special case where
top_hits->size == 1, you only need the top (doc_id, score) seen, not a
whole array, so that would be a lot faster and less memory. But I
understand it needs to be able to handle more general cases.)

Is this at all close to how it works?

On Tuesday, January 6, 2015 11:20:08 PM UTC-8, Martijn v Groningen wrote:

Hi Michael,

In general the more buckets being returned by the parent aggregator the
top_hits is nested in, the more work the top_hits agg needs to do, but I
didn't come across performance issues with size on terms agg being set to
50 and the time it takes to execute increasing 30 times when top_hits is
used. To exclude this on your side, can you play around with the size
option on terms agg?

Also perhaps the _source of your documents are relatively large. How
does the top_hits agg perform without the _source option on the top_hits
agg?

Martijn

On 6 January 2015 at 22:29, Michael Irani irani....@gmail.com wrote:

Sure. I simplified the query to keep things focused.

This query takes about 3 seconds to run:

{

"size": 0,

"aggs": {
    "top-fingerprints": {
        "terms": {
            "field": "fingerprint",
            "size": 50
        },
        "aggs": {
            "top_tag_hits": {
                "top_hits": {
                    "size": 1,
                    "_source": {
                       "include": [
                          "title"
                       ]
                    }
                }
            }
        }
    }
}

}

This one takes about 80 milliseconds:

{

"size": 0,

"aggs": {
    "fingerprints": {
        "terms": {
            "field": "fingerprint",
            "size": 100
        }
    }
}

}

The result's a bit too big to paste here. Anything specific about it you want me to expose?

Michael.

On Tuesday, January 6, 2015 12:14:55 PM UTC-8, Itamar Syn-Hershko wrote:

Can you share the query and example results please?

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Tue, Jan 6, 2015 at 10:11 PM, Michael Irani irani....@gmail.com
wrote:

Hello,
I'm working on a corpus of size approximately 10 million documents.
The issue I'm running into right now is that the top scoring documents that
come back from my query are essentially all the same result. I'm trying to
find a way to get back unique results.

I've looked into modeling the data differently with nested objects or
parent-child relationships, but neither layout seems to fit the bill. The
nested model won't work because some of the documents have too many closely
related objects. On the flip side there are also too many unique documents
for the parent-child relationship to fit.

I then tried the "top hits aggregation" and it's exactly what I'm
looking for, except the running time of the query is approximately 30x
slower than the query without the aggregation. Are there known performance
issues with "top hits"? Any ideas on what I should use to make these
queries? Here's the aggregation piece:
"aggs": {

"top-fingerprints": {
    "terms": {
        "field": "fingerprint",
        "size": 50
    },
    "aggs": {
        "top_tag_hits": {
            "top_hits": {
                "size": 1,
                "_source": {
                   "include": [
                      "title"
                   ]
                }
            }
        }
    }
}

}

Thanks,
Michael

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/29fce15c-79b7-4756-b033-93e490204095%40goo
glegroups.com
https://groups.google.com/d/msgid/elasticsearch/29fce15c-79b7-4756-b033-93e490204095%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/14e4a31c-3168-409a-8b2b-cb1e432ef433%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/14e4a31c-3168-409a-8b2b-cb1e432ef433%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
Met vriendelijke groet,

Martijn van Groningen

--
Met vriendelijke groet,

Martijn van Groningen

--
Met vriendelijke groet,

Martijn van Groningen

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CA%2BA76TxAAopyqdVgfC5Zb2iYA4%2BtxNROo2TK7Yw9p09aOEHS%2Bw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

1 Like