Terms Facet other property behavior vs documentation question

The documentation at:
http://www.elasticsearch.org/guide/reference/api/search/facets/

States:
"the number of facet values not included in the returned facets (other), "

In testing things, the number of facet entries returned in my query is
limited to 10 and other is saying something on the order of 200,000. If I
set the limit very high, there are only 34 facets returned with other = 0.
So, other's behavior does not seem to match up with the documentation? Is
the other field actually the number of documents that should match the
facet but aren't in one of the 10 returned facets?

Is there a way to get the count of all the unique facet rows that match
without getting the values so I can safely give the user the option to see
them all only if there aren't too many?

Thanks,
Kevin

--

Your assumption is correct, the other field is not how many other facets
are remaining, but the total number of documents.

There is no way to get all the unique facets. I believe there might be an
issue on Github. Jorg wrote a termlist plugin, which will get all the
unique terms for a field. If your facet is a term facet, then this plugin
would be equivalent.

Getting the entire term list can be slow, depending on your data. If you
have only 34 facets and you know this number will not change much, you can
try increasing the number of facets returned. It all depends on the size of
your data.

Cheers,

Ivan

On Tue, Jan 15, 2013 at 4:34 PM, Kevin Fox kfox1111@gmail.com wrote:

Is there a way to get the count of all the unique facet rows that match
without getting the values so I can safely give the user the option to see
them all only if there aren't too many?

--

Bummer. Well, what I'm trying to do is display the top 10 facets in the
term. If there are more then 10, display a more button which pops up a
dialog that reissues the query requesting and displaying all of the facets.
This works very well except when the facet count for the term is huge. like

  1. If it is that large, its probably not going to be useful to the
    user anyway, so I want to just ignore the whole term if its that large. If
    I could get the top 10 hits plus the number of remaining ones, the ui work
    would be simple. Plus it would be low overhead over the network. I'm
    thinking, since the sort by count on the term happens already, the whole
    list of facets must be loaded into memory and is a complete list. I should
    be able to just add an "extra" field or something that simply returns the
    number of facets in that list minus the ones returned. I think the code
    would be a simple patch to
    src/main/java/org/elasticsearch/search/facet/terms/strings/InternalStringTermsFacet.java.
    I this a reasonable thing to do?

Thanks,
Kevin

On Wednesday, January 16, 2013 7:42:04 AM UTC-8, Ivan Brusic wrote:

Your assumption is correct, the other field is not how many other facets
are remaining, but the total number of documents.

There is no way to get all the unique facets. I believe there might be an
issue on Github. Jorg wrote a termlist plugin, which will get all the
unique terms for a field. If your facet is a term facet, then this plugin
would be equivalent.

Getting the entire term list can be slow, depending on your data. If you
have only 34 facets and you know this number will not change much, you can
try increasing the number of facets returned. It all depends on the size of
your data.

Cheers,

Ivan

On Tue, Jan 15, 2013 at 4:34 PM, Kevin Fox <kfox...@gmail.com<javascript:>

wrote:

Is there a way to get the count of all the unique facet rows that match
without getting the values so I can safely give the user the option to see
them all only if there aren't too many?

--

BTW, my previous response was incorrect in that you can ask for all the
terms back. The setting is "all_terms" : true. Since my use case is similar
to yours, where getting all terms is inefficient, I have never done so.

On Wed, Jan 16, 2013 at 12:47 PM, Kevin Fox kfox1111@gmail.com wrote:

Bummer. Well, what I'm trying to do is display the top 10 facets in the
term. If there are more then 10, display a more button which pops up a
dialog that reissues the query requesting and displaying all of the facets.
This works very well except when the facet count for the term is huge. like

  1. If it is that large, its probably not going to be useful to the
    user anyway, so I want to just ignore the whole term if its that large. If
    I could get the top 10 hits plus the number of remaining ones, the ui work
    would be simple. Plus it would be low overhead over the network. I'm
    thinking, since the sort by count on the term happens already, the whole
    list of facets must be loaded into memory and is a complete list. I should
    be able to just add an "extra" field or something that simply returns the
    number of facets in that list minus the ones returned. I think the code
    would be a simple patch to
    src/main/java/org/elasticsearch/search/facet/terms/strings/InternalStringTermsFacet.java.
    I this a reasonable thing to do?

Thanks,
Kevin

On Wednesday, January 16, 2013 7:42:04 AM UTC-8, Ivan Brusic wrote:

Your assumption is correct, the other field is not how many other facets
are remaining, but the total number of documents.

There is no way to get all the unique facets. I believe there might be an
issue on Github. Jorg wrote a termlist plugin, which will get all the
unique terms for a field. If your facet is a term facet, then this plugin
would be equivalent.

Getting the entire term list can be slow, depending on your data. If you
have only 34 facets and you know this number will not change much, you can
try increasing the number of facets returned. It all depends on the size of
your data.

Cheers,

Ivan

On Tue, Jan 15, 2013 at 4:34 PM, Kevin Fox kfox...@gmail.com wrote:

Is there a way to get the count of all the unique facet rows that match
without getting the values so I can safely give the user the option to see
them all only if there aren't too many?

--

--

So, the all_terms field says it returns all of the facets with counts = 0
even if there is nothing that matches that facet. That means, if there are
a lot of them, it sends them all across the wire. What I'm trying to do is
get the count of all facets minus the elements returned, so I can determine
if I want to requery with something like all_terms set to true or if it
would be too expensive to do so. Any idea where in the code I could add
that?

Thanks,
Kevin

On Wednesday, January 16, 2013 1:13:51 PM UTC-8, Ivan Brusic wrote:

BTW, my previous response was incorrect in that you can ask for all the
terms back. The setting is "all_terms" : true. Since my use case is similar
to yours, where getting all terms is inefficient, I have never done so.

On Wed, Jan 16, 2013 at 12:47 PM, Kevin Fox <kfox...@gmail.com<javascript:>

wrote:

Bummer. Well, what I'm trying to do is display the top 10 facets in the
term. If there are more then 10, display a more button which pops up a
dialog that reissues the query requesting and displaying all of the facets.
This works very well except when the facet count for the term is huge. like

  1. If it is that large, its probably not going to be useful to the
    user anyway, so I want to just ignore the whole term if its that large. If
    I could get the top 10 hits plus the number of remaining ones, the ui work
    would be simple. Plus it would be low overhead over the network. I'm
    thinking, since the sort by count on the term happens already, the whole
    list of facets must be loaded into memory and is a complete list. I should
    be able to just add an "extra" field or something that simply returns the
    number of facets in that list minus the ones returned. I think the code
    would be a simple patch to
    src/main/java/org/elasticsearch/search/facet/terms/strings/InternalStringTermsFacet.java.
    I this a reasonable thing to do?

Thanks,
Kevin

On Wednesday, January 16, 2013 7:42:04 AM UTC-8, Ivan Brusic wrote:

Your assumption is correct, the other field is not how many other facets
are remaining, but the total number of documents.

There is no way to get all the unique facets. I believe there might be
an issue on Github. Jorg wrote a termlist plugin, which will get all the
unique terms for a field. If your facet is a term facet, then this plugin
would be equivalent.

Getting the entire term list can be slow, depending on your data. If you
have only 34 facets and you know this number will not change much, you can
try increasing the number of facets returned. It all depends on the size of
your data.

Cheers,

Ivan

On Tue, Jan 15, 2013 at 4:34 PM, Kevin Fox kfox...@gmail.com wrote:

Is there a way to get the count of all the unique facet rows that match
without getting the values so I can safely give the user the option to see
them all only if there aren't too many?

--

--

The terms facets works off the termlist found in the Lucene index. There
can never be a terms facet with a zero count. Facets never return the
elements, just the term and the count.

If you just want a list of terms, take a long at Jorg's plugin:

The Lucene code is found here:
https://github.com/jprante/elasticsearch-index-termlist/blob/master/src/main/java/org/elasticsearch/action/termlist/TransportTermlistAction.java#L126

The plugin aggregates the termlists from each shard.

--
Ivan

On Wed, Jan 16, 2013 at 2:46 PM, Kevin Fox kfox1111@gmail.com wrote:

So, the all_terms field says it returns all of the facets with counts = 0
even if there is nothing that matches that facet. That means, if there are
a lot of them, it sends them all across the wire. What I'm trying to do is
get the count of all facets minus the elements returned, so I can determine
if I want to requery with something like all_terms set to true or if it
would be too expensive to do so. Any idea where in the code I could add
that?

Thanks,
Kevin

--

Hey Kevin,

I probably did not understand your use case. Why the standard Term Facet did not
answer to your need?
There is an "other" field which is part of the answer:

"facets" : {
"f1" : {
"_type" : "terms",
"missing" : 0,
"total" : 3,
"other" : 0,
"terms" : [ {
"term" : "b",
"count" : 2
}, {
"term" : "c",
"count" : 1
} ]
}
}

This field means that there are other terms than the TOP 10 (

https://github.com/elasticsearch/elasticsearch/issues/1029 )
Isn't it what you are after?

David.

Le 17 janvier 2013 à 00:25, Ivan Brusic ivan@brusic.com a écrit :

The terms facets works off the termlist found in the Lucene index. There can
never be a terms facet with a zero count. Facets never return the elements,
just the term and the count.

If you just want a list of terms, take a long at Jorg's plugin:
GitHub - jprante/elasticsearch-index-termlist: Elasticsearch Index Termlist
https://github.com/jprante/elasticsearch-index-termlist

The Lucene code is found here:

https://github.com/jprante/elasticsearch-index-termlist/blob/master/src/main/java/org/elasticsearch/action/termlist/TransportTermlistAction.java#L126
https://github.com/jprante/elasticsearch-index-termlist/blob/master/src/main/java/org/elasticsearch/action/termlist/TransportTermlistAction.java#L126

The plugin aggregates the termlists from each shard.

--
Ivan

On Wed, Jan 16, 2013 at 2:46 PM, Kevin Fox <kfox1111@gmail.com
mailto:kfox1111@gmail.com > wrote:

So, the all_terms field says it returns all of the facets with
counts = 0 even if there is nothing that matches that facet. That
means, if there are a lot of them, it sends them all across the wire.
What I'm trying to do is get the count of all facets minus the elements
returned, so I can determine if I want to requery with something like
all_terms set to true or if it would be too expensive to do so. Any
idea where in the code I could add that?

Thanks,
Kevin

--

--
David Pilato
http://www.scrutmydocs.org/
http://dev.david.pilato.fr/
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

--

Hi David,

No, unfortunately, I'm after something slightly different. My terminology
hasn't been so good, so let me try to state it again.

For a given term facet request, you limit the number of results with
"size". (default 10 terms)

Returned is
{
"terms" <- up to "size" number of "name":"count" pairs
"total": <- the number of documents matched by the facet
"missing": <- the number of documents not having the property the facet
wants
"other": <- "total" minus the number of documents listed in "terms".
}

I'm interested in terms, not documents for my particular problem. In
particular, you can get the number of terms matched: terms.length, but you
can't get the number of terms not returned. That is the value I'm
interested in.
"The number of other terms not returned". The only way currently is to set
size = HUGENUMBER and request them all, then count them. I'm not interested
in their names and/or counts, or sending them all over the network, simply
how many more of them there are during the initial query.

Does that make sense?

I'm looking through the code trying to figure out where to add that
feature. I'm most of the way there I think but haven't found quite the
right place where all the terms are assembled yet.

Thanks,
Kevin

On Thursday, January 17, 2013 1:06:13 AM UTC-8, David Pilato wrote:

Hey Kevin,

I probably did not understand your use case. Why the standard Term Facet
did not answer to your need?
There is an "other" field which is part of the answer:

"facets" : {
"f1" : {
"_type" : "terms",
"missing" : 0,
"total" : 3,
"other" : 0,
"terms" : [ {
"term" : "b",
"count" : 2
}, {
"term" : "c",
"count" : 1
} ]
}
}
This field means that there are other terms than the TOP 10 (
Add 'other_terms' option for terms facet · Issue #1029 · elastic/elasticsearch · GitHub)
Isn't it what you are after?

David.

Le 17 janvier 2013 à 00:25, Ivan Brusic <iv...@brusic.com <javascript:>>
a écrit :

The terms facets works off the termlist found in the Lucene index. There
can never be a terms facet with a zero count. Facets never return the
elements, just the term and the count.

If you just want a list of terms, take a long at Jorg's plugin:
GitHub - jprante/elasticsearch-index-termlist: Elasticsearch Index Termlist

The Lucene code is found here:

https://github.com/jprante/elasticsearch-index-termlist/blob/master/src/main/java/org/elasticsearch/action/termlist/TransportTermlistAction.java#L126

The plugin aggregates the termlists from each shard.

--
Ivan

On Wed, Jan 16, 2013 at 2:46 PM, Kevin Fox <kfox...@gmail.com<javascript:>

wrote:

So, the all_terms field says it returns all of the facets with counts = 0
even if there is nothing that matches that facet. That means, if there are
a lot of them, it sends them all across the wire. What I'm trying to do is
get the count of all facets minus the elements returned, so I can determine
if I want to requery with something like all_terms set to true or if it
would be too expensive to do so. Any idea where in the code I could add
that?

Thanks,
Kevin

--

--
David Pilato
http://www.scrutmydocs.org/
http://dev.david.pilato.fr/
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

--

Ok, here is a patch that adds an "extra" field that returns the number of
additional terms in the facet when the terms are strings. Its only been
lightly tested and doesn't work on the other data types yet but it is a
start.

What do you think?

Thanks,
Kevin

On Thursday, January 17, 2013 8:06:07 AM UTC-8, Kevin Fox wrote:

Hi David,

No, unfortunately, I'm after something slightly different. My terminology
hasn't been so good, so let me try to state it again.

For a given term facet request, you limit the number of results with
"size". (default 10 terms)

Returned is
{
"terms" <- up to "size" number of "name":"count" pairs
"total": <- the number of documents matched by the facet
"missing": <- the number of documents not having the property the facet
wants
"other": <- "total" minus the number of documents listed in "terms".
}

I'm interested in terms, not documents for my particular problem. In
particular, you can get the number of terms matched: terms.length, but you
can't get the number of terms not returned. That is the value I'm
interested in.
"The number of other terms not returned". The only way currently is to set
size = HUGENUMBER and request them all, then count them. I'm not interested
in their names and/or counts, or sending them all over the network, simply
how many more of them there are during the initial query.

Does that make sense?

I'm looking through the code trying to figure out where to add that
feature. I'm most of the way there I think but haven't found quite the
right place where all the terms are assembled yet.

Thanks,
Kevin

On Thursday, January 17, 2013 1:06:13 AM UTC-8, David Pilato wrote:

Hey Kevin,

I probably did not understand your use case. Why the standard Term Facet
did not answer to your need?
There is an "other" field which is part of the answer:

"facets" : {
"f1" : {
"_type" : "terms",
"missing" : 0,
"total" : 3,
"other" : 0,
"terms" : [ {
"term" : "b",
"count" : 2
}, {
"term" : "c",
"count" : 1
} ]
}
}
This field means that there are other terms than the TOP 10 (
Add 'other_terms' option for terms facet · Issue #1029 · elastic/elasticsearch · GitHub)
Isn't it what you are after?

David.

Le 17 janvier 2013 à 00:25, Ivan Brusic iv...@brusic.com a écrit :

The terms facets works off the termlist found in the Lucene index.
There can never be a terms facet with a zero count. Facets never return the
elements, just the term and the count.

If you just want a list of terms, take a long at Jorg's plugin:
GitHub - jprante/elasticsearch-index-termlist: Elasticsearch Index Termlist

The Lucene code is found here:

https://github.com/jprante/elasticsearch-index-termlist/blob/master/src/main/java/org/elasticsearch/action/termlist/TransportTermlistAction.java#L126

The plugin aggregates the termlists from each shard.

--
Ivan

On Wed, Jan 16, 2013 at 2:46 PM, Kevin Fox kfox...@gmail.com wrote:

So, the all_terms field says it returns all of the facets with counts =
0 even if there is nothing that matches that facet. That means, if there
are a lot of them, it sends them all across the wire. What I'm trying to do
is get the count of all facets minus the elements returned, so I can
determine if I want to requery with something like all_terms set to true or
if it would be too expensive to do so. Any idea where in the code I could
add that?

Thanks,
Kevin

--

--
David Pilato
http://www.scrutmydocs.org/
http://dev.david.pilato.fr/
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

--

Replying back to the group since there are others that know the internals
far better than I do.

The reduce() function is where the aggregation is happening. Depending on
the distribution of your terms, your method could return the correct number
on occasion. Hopefully someone else will respond.

--
Ivan

On Mon, Jan 21, 2013 at 8:34 AM, Kevin Fox kfox1111@gmail.com wrote:

I'm not sure. The code is a bit hard to follow. When I instrumented it, I
got #ofshards instances of InternalStringTermsFacet in
TermsStringOrdinalsFacetCollector, one of which had the right number. The
others were way smaller. I was thinking things probably got aggregated to
that one. The max then just finds the entry. I could be totally wrong
though. Do you know if/where such an aggregation might happen?

Thanks,
Kevin
On Jan 18, 2013 9:39 AM, "Ivan Brusic" ivan@brusic.com wrote:

I haven't stepped through the code, but isn't the handling of extra
during the reduce phase a bit optimistic?

extra = Math.max(extra, mFacet.extraCount());

If each shard returns n unique extra facets, then the extra count would
be n*number_of_shards, not n. Without access to the actual facet terms,
there is no way to detect duplication (or lack of). That is why
the aggregated facets are held in a hash map. Perhaps I'm wrong.

--
Ivan

On Fri, Jan 18, 2013 at 9:14 AM, Kevin Fox kfox1111@gmail.com wrote:

Ok, here is a patch that adds an "extra" field that returns the number
of additional terms in the facet when the terms are strings. Its only been
lightly tested and doesn't work on the other data types yet but it is a
start.

What do you think?

Thanks,
Kevin

On Thursday, January 17, 2013 8:06:07 AM UTC-8, Kevin Fox wrote:

Hi David,

No, unfortunately, I'm after something slightly different. My
terminology hasn't been so good, so let me try to state it again.

For a given term facet request, you limit the number of results with
"size". (default 10 terms)

Returned is
{
"terms" <- up to "size" number of "name":"count" pairs
"total": <- the number of documents matched by the facet
"missing": <- the number of documents not having the property the
facet wants
"other": <- "total" minus the number of documents listed in "terms".
}

I'm interested in terms, not documents for my particular problem.
In particular, you can get the number of terms matched: terms.length, but
you can't get the number of terms not returned. That is the value I'm
interested in.
"The number of other terms not returned". The only way currently is to
set size = HUGENUMBER and request them all, then count them. I'm not
interested in their names and/or counts, or sending them all over the
network, simply how many more of them there are during the initial query.

Does that make sense?

I'm looking through the code trying to figure out where to add that
feature. I'm most of the way there I think but haven't found quite the
right place where all the terms are assembled yet.

Thanks,
Kevin

On Thursday, January 17, 2013 1:06:13 AM UTC-8, David Pilato wrote:

Hey Kevin,

I probably did not understand your use case. Why the standard Term
Facet did not answer to your need?
There is an "other" field which is part of the answer:

"facets" : {
"f1" : {
"_type" : "terms",
"missing" : 0,
"total" : 3,
"other" : 0,
"terms" : [ {
"term" : "b",
"count" : 2
}, {
"term" : "c",
"count" : 1
} ]
}
}
This field means that there are other terms than the TOP 10 (
https://github.com/**elasticsearch/elasticsearch/**issues/1029https://github.com/elasticsearch/elasticsearch/issues/1029)

Isn't it what you are after?

David.

Le 17 janvier 2013 à 00:25, Ivan Brusic iv...@brusic.com a écrit :

The terms facets works off the termlist found in the Lucene index.
There can never be a terms facet with a zero count. Facets never return the
elements, just the term and the count.

If you just want a list of terms, take a long at Jorg's plugin:
https://github.com/jprante/**elasticsearch-index-termlisthttps://github.com/jprante/elasticsearch-index-termlist

The Lucene code is found here:
https://github.com/jprante/**elasticsearch-index-termlist/**
blob/master/src/main/java/org/elasticsearch/action/termlist/
TransportTermlistAction.java#**L126https://github.com/jprante/elasticsearch-index-termlist/blob/master/src/main/java/org/elasticsearch/action/termlist/TransportTermlistAction.java#L126

The plugin aggregates the termlists from each shard.

--
Ivan

On Wed, Jan 16, 2013 at 2:46 PM, Kevin Fox kfox...@gmail.comwrote:

So, the all_terms field says it returns all of the facets with counts
= 0 even if there is nothing that matches that facet. That means, if there
are a lot of them, it sends them all across the wire. What I'm trying to do
is get the count of all facets minus the elements returned, so I can
determine if I want to requery with something like all_terms set to true or
if it would be too expensive to do so. Any idea where in the code I could
add that?

Thanks,
Kevin

--

--
David Pilato
http://www.scrutmydocs.org/
http://dev.david.pilato.fr/
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

--