Distinct count for field and High Cardinality Facets

Good morning people,

I actually have 2 issues with ElasticSearch queries that I haven't found a
way to solve yet.

Let's says that what I'm indexing is "sharings" of documents, and I have a
field that is the document_id

  1. I want to have a "count" of how many documents that have been shared on
    all the sharings that have been found on the query. Is there a way to do it
    ? Because the count that is returned is the sharings count and not the
    document count... basically I want to do a count on distinct document id.

  2. I'm also doing a facet on id_user on those sharing to know how many
    sharing each users has on the query response... this one is working, but
    it's like slowing down the request by a factor of 5... so the mean query
    time is going from 200ms to 1sec. The id_user is a high cardinality field
    and I know that ES as some issues with this kind of facetting...

For the point #2, I am increasing the size of the query and doing it
manually... but it's returning an amount of document according to the size
of the query... I can live with that, but let's say I would like to have
the 10 first users with the most document shared... I could do that with
facets ( I think... ) or how could I do that ?

So, is it clear enough and more importantly, is there a way to do/improve
this ?

--

Ok, first factor in my response knowing I'm running on only 4 hours sleep,
and only 3 espressos so far!

On 8 November 2012 02:40, Jérôme Gagnon jerome.gagnon.1@gmail.com wrote:

....

  1. I want to have a "count" of how many documents that have been shared on
    all the sharings that have been found on the query. Is there a way to do it
    ? Because the count that is returned is the sharings count and not the
    document count... basically I want to do a count on distinct document id.

One way to do this is to setup a Parent/Child relationship inside
Elasticsearch, with the Parent being the document. Each share of the
document becomes a Child of the Document parent. Then you could do a
has_child query against the Document (the parent) using a query filter
based on your current share search criteria. The returned search should
return all the matching parents, which the size of the search result is
your # unique shared documents. Since you only need the count, you could
change your search to a Count search query, using the has_child as a filter.

You'll need to factor memory considerations of this (see bottom of

)

  1. I'm also doing a facet on id_user on those sharing to know how many
    sharing each users has on the query response... this one is working, but
    it's like slowing down the request by a factor of 5... so the mean query
    time is going from 200ms to 1sec. The id_user is a high cardinality field
    and I know that ES as some issues with this kind of facetting...

For the point #2, I am increasing the size of the query and doing it
manually... but it's returning an amount of document according to the size
of the query... I can live with that, but let's say I would like to have
the 10 first users with the most document shared... I could do that with
facets ( I think... ) or how could I do that ?

Are you doing a term facet on id_user? Not clear, but I presume so. a term
statistical facet on id_user based on count, setting size to 10 should show
the top 10 users by the shares being searched?

Hopefully someone with more sleep/coffee and/or smarter can help here.

cheers,

Paul

--

On Wednesday, November 7, 2012 3:47:31 PM UTC-5, tallpsmith wrote:

Ok, first factor in my response knowing I'm running on only 4 hours sleep,
and only 3 espressos so far!

On 8 November 2012 02:40, Jérôme Gagnon <jerome....@gmail.com<javascript:>

wrote:

....

  1. I want to have a "count" of how many documents that have been shared
    on all the sharings that have been found on the query. Is there a way to do
    it ? Because the count that is returned is the sharings count and not the
    document count... basically I want to do a count on distinct document id.

One way to do this is to setup a Parent/Child relationship inside
Elasticsearch, with the Parent being the document. Each share of the
document becomes a Child of the Document parent. Then you could do a
has_child query against the Document (the parent) using a query filter
based on your current share search criteria. The returned search should
return all the matching parents, which the size of the search result is
your # unique shared documents. Since you only need the count, you could
change your search to a Count search query, using the has_child as a filter.

You'll need to factor memory considerations of this (see bottom of
http://www.elasticsearch.org/guide/reference/query-dsl/has-child-filter.html
)

Yes I thought that may be feasible this way... but I just can't really
index all the document + the sharing since it would just make too much
document sadly...

  1. I'm also doing a facet on id_user on those sharing to know how many
    sharing each users has on the query response... this one is working, but
    it's like slowing down the request by a factor of 5... so the mean query
    time is going from 200ms to 1sec. The id_user is a high cardinality field
    and I know that ES as some issues with this kind of facetting...

For the point #2, I am increasing the size of the query and doing it
manually... but it's returning an amount of document according to the size
of the query... I can live with that, but let's say I would like to have
the 10 first users with the most document shared... I could do that with
facets ( I think... ) or how could I do that ?

Are you doing a term facet on id_user? Not clear, but I presume so. a
term statistical facet on id_user based on count, setting size to 10 should
show the top 10 users by the shares being searched?

That's exactly what I am doing... but do you think that term_stats facet
would be better (in performancE) than standard terms facet ??

Hopefully someone with more sleep/coffee and/or smarter can help here.

cheers,

Paul

--

On 9 November 2012 06:23, Jérôme Gagnon jerome.gagnon.1@gmail.com wrote:

You'll need to factor memory considerations of this (see bottom of

http://www.elasticsearch.org/guide/reference/query-dsl/
has-child-filter.htmlhttp://www.elasticsearch.org/guide/reference/query-dsl/has-child-filter.html
)

Yes I thought that may be feasible this way... but I just can't really
index all the document + the sharing since it would just make too much
document sadly...

After a bit better sleep, the only real true way I can see this being done
in a statistical/facet like collection mode during your search, will
actually require custom ES plugins to be developed. Keeping track of
distinct values as the Facet collection code walks the field values is a
tricky thing to do resource efficiently (particularly memory).

In the case where the distinct values being tracked is a numeric, one
obvious solution is to keep a bitset data structure around and flip the Nth
bit when you see value N. For large numeric values though, the Bitset gets
pretty large. Another option would be to use Trie style structures (which
Lucene uses a fair bit) to keep a more memory efficient structure, then the
distinct values is just the count of the nodes and could be tracked simply
as new nodes are added. Tries could be used even for distinct string
value tracking (this is how dictionaries are often done), though probably
taking more memory than a just a number Trie.

It's not built into ES though, someone would need to write a Facet
collector that could do this, it would be a pretty darn good feature I
would think. Anyone else got any bright ideas?

Are you doing a term facet on id_user? Not clear, but I presume so. a
term statistical facet on id_user based on count, setting size to 10 should
show the top 10 users by the shares being searched?

That's exactly what I am doing... but do you think that term_stats facet
would be better (in performancE) than standard terms facet ??

High cardinality and faceting are right now, pre-Lucene 4 (ES 0.21 I
believe) probably going to take a hit performance wise. I'm hopeful this
will improve over time, but I think we're just going to need to ensure we
provide enough resources to satisfy it. I can't think of any other way
right now to do this.

Paul

--

Hi Jerome,

If I understand correctly you want to count how many documents were shared
for a given query and how many users has shared a document that matched a
query. If so, it both boiles down to knowing how many distinct values these
fields have within the query set.

Andrew Clegg gave a talk in the last London meetup about counting distinct
values in very large datasets. If you can live with the potential
inaccuracies if might find it
good: ElasticSearch approx talk - Google Slides

If you want to have the top 10 most shared documents or the top 10 most
sharing users - facets are indeed your friend. Are these ids numbers or
strings? if strings you might run into memory issues when faceting on them.

Cheers,
Boaz

On Wednesday, November 7, 2012 4:40:34 PM UTC+1, Jérôme Gagnon wrote:

Good morning people,

I actually have 2 issues with Elasticsearch queries that I haven't found a
way to solve yet.

Let's says that what I'm indexing is "sharings" of documents, and I have a
field that is the document_id

  1. I want to have a "count" of how many documents that have been shared on
    all the sharings that have been found on the query. Is there a way to do it
    ? Because the count that is returned is the sharings count and not the
    document count... basically I want to do a count on distinct document id.

  2. I'm also doing a facet on id_user on those sharing to know how many
    sharing each users has on the query response... this one is working, but
    it's like slowing down the request by a factor of 5... so the mean query
    time is going from 200ms to 1sec. The id_user is a high cardinality field
    and I know that ES as some issues with this kind of facetting...

For the point #2, I am increasing the size of the query and doing it
manually... but it's returning an amount of document according to the size
of the query... I can live with that, but let's say I would like to have
the 10 first users with the most document shared... I could do that with
facets ( I think... ) or how could I do that ?

So, is it clear enough and more importantly, is there a way to do/improve
this ?

--

Wow, thank you, the presentation is awesome !! Now, I would be pleased to
use this if that existed for something else than date histogram... like
terms facet without any time constraints. :slight_smile:

Both of my facets type are long... So I still didn't run into any memory
issue but faceting on both of these fields is actually problematic since it
slows down too much the query.

As far as potential inaccuracies, I'm already implementing some code that
use this feature ... so if the askedCount is smaller than the returned
count, I count myself the distinct documents AND the sharing... But the use
case that still bugs me is to have a facet size of let's say N users from a
query... with offsetting and size limit.

Jerome

On Friday, November 9, 2012 3:02:59 AM UTC-5, Boaz Leskes wrote:

Hi Jerome,

If I understand correctly you want to count how many documents were shared
for a given query and how many users has shared a document that matched a
query. If so, it both boiles down to knowing how many distinct values these
fields have within the query set.

Andrew Clegg gave a talk in the last London meetup about counting distinct
values in very large datasets. If you can live with the potential
inaccuracies if might find it good:
ElasticSearch approx talk - Google Slides

If you want to have the top 10 most shared documents or the top 10 most
sharing users - facets are indeed your friend. Are these ids numbers or
strings? if strings you might run into memory issues when faceting on them.

Cheers,
Boaz

On Wednesday, November 7, 2012 4:40:34 PM UTC+1, Jérôme Gagnon wrote:

Good morning people,

I actually have 2 issues with Elasticsearch queries that I haven't found
a way to solve yet.

Let's says that what I'm indexing is "sharings" of documents, and I have
a field that is the document_id

  1. I want to have a "count" of how many documents that have been shared
    on all the sharings that have been found on the query. Is there a way to do
    it ? Because the count that is returned is the sharings count and not the
    document count... basically I want to do a count on distinct document id.

  2. I'm also doing a facet on id_user on those sharing to know how many
    sharing each users has on the query response... this one is working, but
    it's like slowing down the request by a factor of 5... so the mean query
    time is going from 200ms to 1sec. The id_user is a high cardinality field
    and I know that ES as some issues with this kind of facetting...

For the point #2, I am increasing the size of the query and doing it
manually... but it's returning an amount of document according to the size
of the query... I can live with that, but let's say I would like to have
the 10 first users with the most document shared... I could do that with
facets ( I think... ) or how could I do that ?

So, is it clear enough and more importantly, is there a way to do/improve
this ?

--

Hi Jerome

I'm not sure I follow what you want to achieve. Can you give an example of
what you mean with: "But the use case that still bugs me is to have a facet
size of let's say N users from a query... with offsetting and size limit."

Boaz

On Friday, November 9, 2012 2:51:11 PM UTC+1, Jérôme Gagnon wrote:

Wow, thank you, the presentation is awesome !! Now, I would be pleased to
use this if that existed for something else than date histogram... like
terms facet without any time constraints. :slight_smile:

Both of my facets type are long... So I still didn't run into any memory
issue but faceting on both of these fields is actually problematic since it
slows down too much the query.

As far as potential inaccuracies, I'm already implementing some code that
use this feature ... so if the askedCount is smaller than the returned
count, I count myself the distinct documents AND the sharing... But the use
case that still bugs me is to have a facet size of let's say N users from a
query... with offsetting and size limit.

Jerome

On Friday, November 9, 2012 3:02:59 AM UTC-5, Boaz Leskes wrote:

Hi Jerome,

If I understand correctly you want to count how many documents were
shared for a given query and how many users has shared a document that
matched a query. If so, it both boiles down to knowing how many distinct
values these fields have within the query set.

Andrew Clegg gave a talk in the last London meetup about counting
distinct values in very large datasets. If you can live with the
potential inaccuracies if might find it good:
ElasticSearch approx talk - Google Slides

If you want to have the top 10 most shared documents or the top 10 most
sharing users - facets are indeed your friend. Are these ids numbers or
strings? if strings you might run into memory issues when faceting on them.

Cheers,
Boaz

On Wednesday, November 7, 2012 4:40:34 PM UTC+1, Jérôme Gagnon wrote:

Good morning people,

I actually have 2 issues with Elasticsearch queries that I haven't found
a way to solve yet.

Let's says that what I'm indexing is "sharings" of documents, and I have
a field that is the document_id

  1. I want to have a "count" of how many documents that have been shared
    on all the sharings that have been found on the query. Is there a way to do
    it ? Because the count that is returned is the sharings count and not the
    document count... basically I want to do a count on distinct document id.

  2. I'm also doing a facet on id_user on those sharing to know how many
    sharing each users has on the query response... this one is working, but
    it's like slowing down the request by a factor of 5... so the mean query
    time is going from 200ms to 1sec. The id_user is a high cardinality field
    and I know that ES as some issues with this kind of facetting...

For the point #2, I am increasing the size of the query and doing it
manually... but it's returning an amount of document according to the size
of the query... I can live with that, but let's say I would like to have
the 10 first users with the most document shared... I could do that with
facets ( I think... ) or how could I do that ?

So, is it clear enough and more importantly, is there a way to
do/improve this ?

--

Let say I do a query on sharing, and I want a fixed number of user....
size:10, from:10 ... so I cannot set the size of the query large enough to
be sure that I have enough user (if the case is). I believe it could be
done with terms_facet and setting the size... But it actually slows down
the request from 200ms to > 1 sec ... Even if I'm asking for more document,
I can never be sure to achieve this...

The query would look like something like this ...
https://gist.github.com/bb6ee6d66e25ac805bea

On Friday, November 9, 2012 10:09:54 AM UTC-5, Boaz Leskes wrote:

Hi Jerome

I'm not sure I follow what you want to achieve. Can you give an example of
what you mean with: "But the use case that still bugs me is to have a facet
size of let's say N users from a query... with offsetting and size limit."

Boaz

On Friday, November 9, 2012 2:51:11 PM UTC+1, Jérôme Gagnon wrote:

Wow, thank you, the presentation is awesome !! Now, I would be pleased to
use this if that existed for something else than date histogram... like
terms facet without any time constraints. :slight_smile:

Both of my facets type are long... So I still didn't run into any memory
issue but faceting on both of these fields is actually problematic since it
slows down too much the query.

As far as potential inaccuracies, I'm already implementing some code that
use this feature ... so if the askedCount is smaller than the returned
count, I count myself the distinct documents AND the sharing... But the use
case that still bugs me is to have a facet size of let's say N users from a
query... with offsetting and size limit.

Jerome

On Friday, November 9, 2012 3:02:59 AM UTC-5, Boaz Leskes wrote:

Hi Jerome,

If I understand correctly you want to count how many documents were
shared for a given query and how many users has shared a document that
matched a query. If so, it both boiles down to knowing how many distinct
values these fields have within the query set.

Andrew Clegg gave a talk in the last London meetup about counting
distinct values in very large datasets. If you can live with the
potential inaccuracies if might find it good:
ElasticSearch approx talk - Google Slides

If you want to have the top 10 most shared documents or the top 10 most
sharing users - facets are indeed your friend. Are these ids numbers or
strings? if strings you might run into memory issues when faceting on them.

Cheers,
Boaz

On Wednesday, November 7, 2012 4:40:34 PM UTC+1, Jérôme Gagnon wrote:

Good morning people,

I actually have 2 issues with Elasticsearch queries that I haven't
found a way to solve yet.

Let's says that what I'm indexing is "sharings" of documents, and I
have a field that is the document_id

  1. I want to have a "count" of how many documents that have been shared
    on all the sharings that have been found on the query. Is there a way to do
    it ? Because the count that is returned is the sharings count and not the
    document count... basically I want to do a count on distinct document id.

  2. I'm also doing a facet on id_user on those sharing to know how many
    sharing each users has on the query response... this one is working, but
    it's like slowing down the request by a factor of 5... so the mean query
    time is going from 200ms to 1sec. The id_user is a high cardinality field
    and I know that ES as some issues with this kind of facetting...

For the point #2, I am increasing the size of the query and doing it
manually... but it's returning an amount of document according to the size
of the query... I can live with that, but let's say I would like to have
the 10 first users with the most document shared... I could do that with
facets ( I think... ) or how could I do that ?

So, is it clear enough and more importantly, is there a way to
do/improve this ?

--

The size parameter indeed control how many records you get back from your
search. In your case it will be sharings. It's important to know that the
bigger size the slower the response as it needs to get more documents from
the store. ES (and lucene) are optimized to get the top X results but not
going to deep in the result sets. That said - getting a couple of hunderds
shouldn't be a problem - I think.

As to facets - facets analyze the entire result set regardless of the size
parameter (both of the search and the facet it self). The size parameter
control what it returns. The facet size parameter does have some
performance impact but it is not big.

It sounds to me term facet is what you need. If that is too slow you might
want to consider changing your data model to directly be able to sort on
top users.

Boaz

On Fri, Nov 9, 2012 at 5:06 PM, Jérôme Gagnon jerome.gagnon.1@gmail.comwrote:

Let say I do a query on sharing, and I want a fixed number of user....
size:10, from:10 ... so I cannot set the size of the query large enough to
be sure that I have enough user (if the case is). I believe it could be
done with terms_facet and setting the size... But it actually slows down
the request from 200ms to > 1 sec ... Even if I'm asking for more document,
I can never be sure to achieve this...

The query would look like something like this ...
https://gist.github.com/bb6ee6d66e25ac805bea

On Friday, November 9, 2012 10:09:54 AM UTC-5, Boaz Leskes wrote:

Hi Jerome

I'm not sure I follow what you want to achieve. Can you give an example
of what you mean with: "But the use case that still bugs me is to have a
facet size of let's say N users from a query... with offsetting and size
limit."

Boaz

On Friday, November 9, 2012 2:51:11 PM UTC+1, Jérôme Gagnon wrote:

Wow, thank you, the presentation is awesome !! Now, I would be pleased
to use this if that existed for something else than date histogram... like
terms facet without any time constraints. :slight_smile:

Both of my facets type are long... So I still didn't run into any memory
issue but faceting on both of these fields is actually problematic since it
slows down too much the query.

As far as potential inaccuracies, I'm already implementing some code
that use this feature ... so if the askedCount is smaller than the returned
count, I count myself the distinct documents AND the sharing... But the use
case that still bugs me is to have a facet size of let's say N users from a
query... with offsetting and size limit.

Jerome

On Friday, November 9, 2012 3:02:59 AM UTC-5, Boaz Leskes wrote:

Hi Jerome,

If I understand correctly you want to count how many documents were
shared for a given query and how many users has shared a document that
matched a query. If so, it both boiles down to knowing how many distinct
values these fields have within the query set.

Andrew Clegg gave a talk in the last London meetup about counting
distinct values in very large datasets. If you can live with the
potential inaccuracies if might find it good: https://docs.google.com/*
*presentation/d/1ESNiqd7HuIfuwXSSK81PAAu6AmEPE
E0u_vyk4FU5x9o/presenthttps://docs.google.com/presentation/d/1ESNiqd7HuIfuwXSSK81PAAu6AmEPEE0u_vyk4FU5x9o/present

If you want to have the top 10 most shared documents or the top 10 most
sharing users - facets are indeed your friend. Are these ids numbers or
strings? if strings you might run into memory issues when faceting on them.

Cheers,
Boaz

On Wednesday, November 7, 2012 4:40:34 PM UTC+1, Jérôme Gagnon wrote:

Good morning people,

I actually have 2 issues with Elasticsearch queries that I haven't
found a way to solve yet.

Let's says that what I'm indexing is "sharings" of documents, and I
have a field that is the document_id

  1. I want to have a "count" of how many documents that have been
    shared on all the sharings that have been found on the query. Is there a
    way to do it ? Because the count that is returned is the sharings count and
    not the document count... basically I want to do a count on distinct
    document id.

  2. I'm also doing a facet on id_user on those sharing to know how many
    sharing each users has on the query response... this one is working, but
    it's like slowing down the request by a factor of 5... so the mean query
    time is going from 200ms to 1sec. The id_user is a high cardinality field
    and I know that ES as some issues with this kind of facetting...

For the point #2, I am increasing the size of the query and doing it
manually... but it's returning an amount of document according to the size
of the query... I can live with that, but let's say I would like to have
the 10 first users with the most document shared... I could do that with
facets ( I think... ) or how could I do that ?

So, is it clear enough and more importantly, is there a way to
do/improve this ?

--

--