Facet with multiple counts

Hi there.

I am trying to use ES to produce a table of several summed fields, grouped
by some other field.

Imagine I have a table of users. Each user has a country field. They also
have a boolean field 'likejazz' and 'likerock' (for ease I store these as 1
or 0 to make the following query faster).

Say I want to produce a table with rows for countries and columns for
number of users, number of users who likejazz and another column for those
who likerock, ordered by the first column decending In SQL I would write
this as:

SELECT country, COUNT(*) AS nusers, SUM(likejazz) AS likejazz,
SUM(likerock) AS likerock
FROM users
GROUP BY country
ORDER BY nusers DESC

Is this possible in ES?

As far as I can see I can only calculate one column at a time (total,
likejazz or likerock) and would have to merge them in code. However, this
is a problem if the grouping is performed across a dimension with lots of
terms as, in order to merge accurately I need results for ALL terms.

As soon as I introduce a limit to any of the column facet calculations, I
may get an error. Imagine I try to retrieve the top 10 in each sub facet
search, the problem is that the ranking for the first column may not match
the following terms returned for the other columns.

Obviously in this case, I could get all the countries and merge, however
for collections with 1000s of terms this in inefficient.

Any pointers?

One approach may be to restrict the 2nd and 3rd facet search to just the
terms returned from the first, but not sure thats possible?

Cheers

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi James,

You can use the term statistics facet per field (likejazz, likerock) will
get you what you want for those fields:
Elasticsearch Platform — Find real-time answers at scale | Elastic .
The user count is just a term facet on countries.

You'd need to request the facets to return the information for all
countries but I assume that shouldn't be a problem.

Does this help?

Boaz

On Sunday, May 12, 2013 5:22:05 PM UTC+2, Alastair James wrote:

Hi there.

I am trying to use ES to produce a table of several summed fields, grouped
by some other field.

Imagine I have a table of users. Each user has a country field. They also
have a boolean field 'likejazz' and 'likerock' (for ease I store these as 1
or 0 to make the following query faster).

Say I want to produce a table with rows for countries and columns for
number of users, number of users who likejazz and another column for those
who likerock, ordered by the first column decending In SQL I would write
this as:

SELECT country, COUNT(*) AS nusers, SUM(likejazz) AS likejazz,
SUM(likerock) AS likerock
FROM users
GROUP BY country
ORDER BY nusers DESC

Is this possible in ES?

As far as I can see I can only calculate one column at a time (total,
likejazz or likerock) and would have to merge them in code. However, this
is a problem if the grouping is performed across a dimension with lots of
terms as, in order to merge accurately I need results for ALL terms.

As soon as I introduce a limit to any of the column facet calculations, I
may get an error. Imagine I try to retrieve the top 10 in each sub facet
search, the problem is that the ranking for the first column may not match
the following terms returned for the other columns.

Obviously in this case, I could get all the countries and merge, however
for collections with 1000s of terms this in inefficient.

Any pointers?

One approach may be to restrict the 2nd and 3rd facet search to just the
terms returned from the first, but not sure thats possible?

Cheers

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi there.

Thanks for the reply.

That would be fine for the above example, however the real problem is if I
am grouping over a high cardinality dimension, e.g. a facet with 1000s of
terms. In this case its impractical to return all the terms and merge
manually.

Any other ideas?

Regards

Al

On Monday, 13 May 2013 11:23:53 UTC+1, Boaz Leskes wrote:

Hi James,

You can use the term statistics facet per field (likejazz, likerock) will
get you what you want for those fields:
Elasticsearch Platform — Find real-time answers at scale | Elastic .
The user count is just a term facet on countries.

You'd need to request the facets to return the information for all
countries but I assume that shouldn't be a problem.

Does this help?

Boaz

On Sunday, May 12, 2013 5:22:05 PM UTC+2, Alastair James wrote:

Hi there.

I am trying to use ES to produce a table of several summed fields,
grouped by some other field.

Imagine I have a table of users. Each user has a country field. They also
have a boolean field 'likejazz' and 'likerock' (for ease I store these as 1
or 0 to make the following query faster).

Say I want to produce a table with rows for countries and columns for
number of users, number of users who likejazz and another column for those
who likerock, ordered by the first column decending In SQL I would write
this as:

SELECT country, COUNT(*) AS nusers, SUM(likejazz) AS likejazz,
SUM(likerock) AS likerock
FROM users
GROUP BY country
ORDER BY nusers DESC

Is this possible in ES?

As far as I can see I can only calculate one column at a time (total,
likejazz or likerock) and would have to merge them in code. However, this
is a problem if the grouping is performed across a dimension with lots of
terms as, in order to merge accurately I need results for ALL terms.

As soon as I introduce a limit to any of the column facet calculations, I
may get an error. Imagine I try to retrieve the top 10 in each sub facet
search, the problem is that the ranking for the first column may not match
the following terms returned for the other columns.

Obviously in this case, I could get all the countries and merge, however
for collections with 1000s of terms this in inefficient.

Any pointers?

One approach may be to restrict the 2nd and 3rd facet search to just the
terms returned from the first, but not sure thats possible?

Cheers

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

For high cardinality fields your only "nice" option right now is to write a
plugin with a custom facet. Other than that here are the less clean options
I see:

  1. Ask for top 500 terms, aggregate yourself, use the top 10 (or other
    number of your choosing) and live with the potential in accuracies.
    Elasticsearch does something similar with term facet anyway when
    aggregating results from multiple nodes.
  2. Make two round trips - one to get the top terms (countries in your
    example), then one to get the counts (limited to terms you want)
  3. Based on some assumptions on the total number of users per term, use
    some number arithmetics to join both likejazz and likerock fields to a
    single value and some that up. For example, if only have 100 users per term
    you can sum 100*likejazz+likerock and then extract the result from the
    sum. In this approach the number of users is returned as total count per
    term.

Does this make more sense?

Boaz

On Mon, May 13, 2013 at 12:36 PM, Alastair James al.james@gmail.com wrote:

Hi there.

Thanks for the reply.

That would be fine for the above example, however the real problem is if I
am grouping over a high cardinality dimension, e.g. a facet with 1000s of
terms. In this case its impractical to return all the terms and merge
manually.

Any other ideas?

Regards

Al

On Monday, 13 May 2013 11:23:53 UTC+1, Boaz Leskes wrote:

Hi James,

You can use the term statistics facet per field (likejazz, likerock) will
get you what you want for those fields: http://www.elasticsearch.org/**
guide/reference/api/search/**facets/terms-stats-facet/http://www.elasticsearch.org/guide/reference/api/search/facets/terms-stats-facet/ .
The user count is just a term facet on countries.

You'd need to request the facets to return the information for all
countries but I assume that shouldn't be a problem.

Does this help?

Boaz

On Sunday, May 12, 2013 5:22:05 PM UTC+2, Alastair James wrote:

Hi there.

I am trying to use ES to produce a table of several summed fields,
grouped by some other field.

Imagine I have a table of users. Each user has a country field. They
also have a boolean field 'likejazz' and 'likerock' (for ease I store these
as 1 or 0 to make the following query faster).

Say I want to produce a table with rows for countries and columns for
number of users, number of users who likejazz and another column for those
who likerock, ordered by the first column decending In SQL I would write
this as:

SELECT country, COUNT(*) AS nusers, SUM(likejazz) AS likejazz,
SUM(likerock) AS likerock
FROM users
GROUP BY country
ORDER BY nusers DESC

Is this possible in ES?

As far as I can see I can only calculate one column at a time (total,
likejazz or likerock) and would have to merge them in code. However, this
is a problem if the grouping is performed across a dimension with lots of
terms as, in order to merge accurately I need results for ALL terms.

As soon as I introduce a limit to any of the column facet calculations,
I may get an error. Imagine I try to retrieve the top 10 in each sub facet
search, the problem is that the ranking for the first column may not match
the following terms returned for the other columns.

Obviously in this case, I could get all the countries and merge, however
for collections with 1000s of terms this in inefficient.

Any pointers?

One approach may be to restrict the 2nd and 3rd facet search to just the
terms returned from the first, but not sure thats possible?

Cheers

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/OnBbz-nj194/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi. Thanks again.

Yes, it does make more sense.

If I create a custom facet can I return multiple values? Is this likely to
be fast? Are there any tutorials / pointers on this?

  1. Ask for top 500 terms, aggregate yourself, use the top 10 (or other
    number of your choosing) and live with the potential in accuracies.
    Elasticsearch does something similar with term facet anyway when
    aggregating results from multiple nodes.

We cant really live with inaccuracies. So ES does this anyway? Is there any
details of this limitation and potential inaccuracy? Do you know exactly
how many terms does it merge from each node? I really feel this should be
written in big flashing bold letters on the facet documentation pages!

  1. Make two round trips - one to get the top terms (countries in your
    example), then one to get the counts (limited to terms you want)

How could I limit to the terms I want? I know there is an 'exclude' field
for terms. Is there an 'include' one?

  1. Based on some assumptions on the total number of users per term, use
    some number arithmetics to join both likejazz and likerock fields to a
    single value and some that up. For example, if only have 100 users per term
    you can sum 100*likejazz+likerock and then extract the result from the
    sum. In this approach the number of users is returned as total count per
    term.

Interesting.... Will think this through. Not sure there is enough numeric
range on the return value to cope with the fields (there may be more than
two).

Cheers again.

On 13 May 2013 11:52, Boaz Leskes b.leskes@gmail.com wrote:

For high cardinality fields your only "nice" option right now is to write
a plugin with a custom facet. Other than that here are the less clean
options I see:

  1. Ask for top 500 terms, aggregate yourself, use the top 10 (or other
    number of your choosing) and live with the potential in accuracies.
    Elasticsearch does something similar with term facet anyway when
    aggregating results from multiple nodes.
  2. Make two round trips - one to get the top terms (countries in your
    example), then one to get the counts (limited to terms you want)
  3. Based on some assumptions on the total number of users per term, use
    some number arithmetics to join both likejazz and likerock fields to a
    single value and some that up. For example, if only have 100 users per term
    you can sum 100*likejazz+likerock and then extract the result from the
    sum. In this approach the number of users is returned as total count per
    term.

Does this make more sense?

Boaz

On Mon, May 13, 2013 at 12:36 PM, Alastair James al.james@gmail.comwrote:

Hi there.

Thanks for the reply.

That would be fine for the above example, however the real problem is if
I am grouping over a high cardinality dimension, e.g. a facet with 1000s of
terms. In this case its impractical to return all the terms and merge
manually.

Any other ideas?

Regards

Al

On Monday, 13 May 2013 11:23:53 UTC+1, Boaz Leskes wrote:

Hi James,

You can use the term statistics facet per field (likejazz, likerock)
will get you what you want for those fields:
Elasticsearch Platform — Find real-time answers at scale | Elastic**
facets/terms-stats-facet/http://www.elasticsearch.org/guide/reference/api/search/facets/terms-stats-facet/ .
The user count is just a term facet on countries.

You'd need to request the facets to return the information for all
countries but I assume that shouldn't be a problem.

Does this help?

Boaz

On Sunday, May 12, 2013 5:22:05 PM UTC+2, Alastair James wrote:

Hi there.

I am trying to use ES to produce a table of several summed fields,
grouped by some other field.

Imagine I have a table of users. Each user has a country field. They
also have a boolean field 'likejazz' and 'likerock' (for ease I store these
as 1 or 0 to make the following query faster).

Say I want to produce a table with rows for countries and columns for
number of users, number of users who likejazz and another column for those
who likerock, ordered by the first column decending In SQL I would write
this as:

SELECT country, COUNT(*) AS nusers, SUM(likejazz) AS likejazz,
SUM(likerock) AS likerock
FROM users
GROUP BY country
ORDER BY nusers DESC

Is this possible in ES?

As far as I can see I can only calculate one column at a time (total,
likejazz or likerock) and would have to merge them in code. However, this
is a problem if the grouping is performed across a dimension with lots of
terms as, in order to merge accurately I need results for ALL terms.

As soon as I introduce a limit to any of the column facet calculations,
I may get an error. Imagine I try to retrieve the top 10 in each sub facet
search, the problem is that the ranking for the first column may not match
the following terms returned for the other columns.

Obviously in this case, I could get all the countries and merge,
however for collections with 1000s of terms this in inefficient.

Any pointers?

One approach may be to restrict the 2nd and 3rd facet search to just
the terms returned from the first, but not sure thats possible?

Cheers

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/OnBbz-nj194/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/OnBbz-nj194/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Dr Alastair James
CTO Ometria.com
Skype: al.james

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi,

See my answers bellow.

Boaz

On Mon, May 13, 2013 at 1:28 PM, Alastair James al.james@gmail.com wrote:

Hi. Thanks again.

Yes, it does make more sense.

If I create a custom facet can I return multiple values? Is this likely to
be fast? Are there any tutorials / pointers on this?

It wil take quite a bit of coding but it's doable. If you write you own
facet you basically get to decide what to analyze and what json to return.
If done right it will be fast.

  1. Ask for top 500 terms, aggregate yourself, use the top 10 (or other
    number of your choosing) and live with the potential in accuracies.
    Elasticsearch does something similar with term facet anyway when
    aggregating results from multiple nodes.

We cant really live with inaccuracies. So ES does this anyway? Is there
any details of this limitation and potential inaccuracy? Do you know
exactly how many terms does it merge from each node? I really feel this
should be written in big flashing bold letters on the
facet documentation pages!

This discrepancies are inherit to the distributed nature of the problem.
It's the same one you run into as well. Think about the following - if you
have hundreds of gigabytes distributed on tens of machines bringing all the
data together will be really inefficient. Just moving a complete analysis
over the network will take a long time. Elasticsearch asks every node to do
the analysis locally and return the top terms back to the caller node. The
caller node then aggregates all the reports, adds things up and return it.

  1. Make two round trips - one to get the top terms (countries in your
    example), then one to get the counts (limited to terms you want)

How could I limit to the terms I want? I know there is an 'exclude' field
for terms. Is there an 'include' one?

The term statistics facet doesn't support an exclude/include setup. I mean
changing the query to filter to the countries (in the example) you need.
This comes of course with a price.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi there. Thanks again.

Are there any examples for custom facets?

I quite understand about why only the top X terms are sent from each node
and combined, it would just be really nice to know what X is. At the moment
its basically saying "at some point, your facets are going to
become inaccurate, but you simply dont know when or if this ever occurs".
Would be nice to be able to get a feel for where this point is.

Cheers

On 13 May 2013 13:49, Boaz Leskes b.leskes@gmail.com wrote:

Hi,

See my answers bellow.

Boaz

On Mon, May 13, 2013 at 1:28 PM, Alastair James al.james@gmail.comwrote:

Hi. Thanks again.

Yes, it does make more sense.

If I create a custom facet can I return multiple values? Is this likely
to be fast? Are there any tutorials / pointers on this?

It wil take quite a bit of coding but it's doable. If you write you own
facet you basically get to decide what to analyze and what json to return.
If done right it will be fast.

  1. Ask for top 500 terms, aggregate yourself, use the top 10 (or other
    number of your choosing) and live with the potential in accuracies.
    Elasticsearch does something similar with term facet anyway when
    aggregating results from multiple nodes.

We cant really live with inaccuracies. So ES does this anyway? Is there
any details of this limitation and potential inaccuracy? Do you know
exactly how many terms does it merge from each node? I really feel this
should be written in big flashing bold letters on the
facet documentation pages!

This discrepancies are inherit to the distributed nature of the problem.
It's the same one you run into as well. Think about the following - if you
have hundreds of gigabytes distributed on tens of machines bringing all the
data together will be really inefficient. Just moving a complete analysis
over the network will take a long time. Elasticsearch asks every node to do
the analysis locally and return the top terms back to the caller node. The
caller node then aggregates all the reports, adds things up and return it.

  1. Make two round trips - one to get the top terms (countries in your
    example), then one to get the counts (limited to terms you want)

How could I limit to the terms I want? I know there is an 'exclude' field
for terms. Is there an 'include' one?

The term statistics facet doesn't support an exclude/include setup. I mean
changing the query to filter to the countries (in the example) you need.
This comes of course with a price.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/OnBbz-nj194/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Dr Alastair James
CTO Ometria.com
Skype: al.james

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi,

There are a couple of plugins around that add a facet and, of course, there
is the code of Elasticsearch itself. A quick googling brings up the
following: GitHub - imotov/elasticsearch-facet-script: Fully Scriptable Facets for ElasticSearch ,
GitHub - crate/elasticsearch-timefacets-plugin: Elasticsearch Timebased Facets ,
GitHub - bleskes/elasticfacets: A set of facets and related tools for ElasticSearch . Not all of them are up to date
with the latest version of Elasticsearch (mine for example is still at
0.20.x) which also shows you the down side of this approach. The faceting
api is still under heavy development and changes quite a bit (which is why
the classes all have internal in their name :slight_smile: ).

About the number of entries generated per shard - I've looked at the code
at it is now the size you actually asked for. So if you set your size to a
100 in the request it will retrieve 100 entries per shard and merge them to
a 100 entries output. I think it makes sense to make this configurable in
the future but at least for now that's not possible.

Cheers,
Boaz

On Mon, May 13, 2013 at 3:35 PM, Alastair James al.james@gmail.com wrote:

Hi there. Thanks again.

Are there any examples for custom facets?

I quite understand about why only the top X terms are sent from each node
and combined, it would just be really nice to know what X is. At the moment
its basically saying "at some point, your facets are going to
become inaccurate, but you simply dont know when or if this ever occurs".
Would be nice to be able to get a feel for where this point is.

Cheers

On 13 May 2013 13:49, Boaz Leskes b.leskes@gmail.com wrote:

Hi,

See my answers bellow.

Boaz

On Mon, May 13, 2013 at 1:28 PM, Alastair James al.james@gmail.comwrote:

Hi. Thanks again.

Yes, it does make more sense.

If I create a custom facet can I return multiple values? Is this likely
to be fast? Are there any tutorials / pointers on this?

It wil take quite a bit of coding but it's doable. If you write you own
facet you basically get to decide what to analyze and what json to return.
If done right it will be fast.

  1. Ask for top 500 terms, aggregate yourself, use the top 10 (or other
    number of your choosing) and live with the potential in accuracies.
    Elasticsearch does something similar with term facet anyway when
    aggregating results from multiple nodes.

We cant really live with inaccuracies. So ES does this anyway? Is there
any details of this limitation and potential inaccuracy? Do you know
exactly how many terms does it merge from each node? I really feel this
should be written in big flashing bold letters on the
facet documentation pages!

This discrepancies are inherit to the distributed nature of the problem.
It's the same one you run into as well. Think about the following - if you
have hundreds of gigabytes distributed on tens of machines bringing all the
data together will be really inefficient. Just moving a complete analysis
over the network will take a long time. Elasticsearch asks every node to do
the analysis locally and return the top terms back to the caller node. The
caller node then aggregates all the reports, adds things up and return it.

  1. Make two round trips - one to get the top terms (countries in your
    example), then one to get the counts (limited to terms you want)

How could I limit to the terms I want? I know there is an 'exclude'
field for terms. Is there an 'include' one?

The term statistics facet doesn't support an exclude/include setup. I
mean changing the query to filter to the countries (in the example) you
need. This comes of course with a price.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/OnBbz-nj194/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Dr Alastair James
CTO Ometria.com
Skype: al.james

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/OnBbz-nj194/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ok, thanks.

Good to know the number. I will see if requesting 10x the actual number
required helps.

Cheers
Al

On 14 May 2013 08:18, Boaz Leskes b.leskes@gmail.com wrote:

Hi,

There are a couple of plugins around that add a facet and, of course,
there is the code of Elasticsearch itself. A quick googling brings up the
following: GitHub - imotov/elasticsearch-facet-script: Fully Scriptable Facets for ElasticSearch ,
GitHub - crate/elasticsearch-timefacets-plugin: Elasticsearch Timebased Facets ,
GitHub - bleskes/elasticfacets: A set of facets and related tools for ElasticSearch . Not all of them are up to date
with the latest version of Elasticsearch (mine for example is still at
0.20.x) which also shows you the down side of this approach. The faceting
api is still under heavy development and changes quite a bit (which is why
the classes all have internal in their name :slight_smile: ).

About the number of entries generated per shard - I've looked at the code
at it is now the size you actually asked for. So if you set your size to a
100 in the request it will retrieve 100 entries per shard and merge them to
a 100 entries output. I think it makes sense to make this configurable in
the future but at least for now that's not possible.

Cheers,
Boaz

On Mon, May 13, 2013 at 3:35 PM, Alastair James al.james@gmail.comwrote:

Hi there. Thanks again.

Are there any examples for custom facets?

I quite understand about why only the top X terms are sent from each node
and combined, it would just be really nice to know what X is. At the moment
its basically saying "at some point, your facets are going to
become inaccurate, but you simply dont know when or if this ever occurs".
Would be nice to be able to get a feel for where this point is.

Cheers

On 13 May 2013 13:49, Boaz Leskes b.leskes@gmail.com wrote:

Hi,

See my answers bellow.

Boaz

On Mon, May 13, 2013 at 1:28 PM, Alastair James al.james@gmail.comwrote:

Hi. Thanks again.

Yes, it does make more sense.

If I create a custom facet can I return multiple values? Is this likely
to be fast? Are there any tutorials / pointers on this?

It wil take quite a bit of coding but it's doable. If you write you own
facet you basically get to decide what to analyze and what json to return.
If done right it will be fast.

  1. Ask for top 500 terms, aggregate yourself, use the top 10 (or
    other number of your choosing) and live with the potential in accuracies.
    Elasticsearch does something similar with term facet anyway when
    aggregating results from multiple nodes.

We cant really live with inaccuracies. So ES does this anyway? Is there
any details of this limitation and potential inaccuracy? Do you know
exactly how many terms does it merge from each node? I really feel this
should be written in big flashing bold letters on the
facet documentation pages!

This discrepancies are inherit to the distributed nature of the problem.
It's the same one you run into as well. Think about the following - if you
have hundreds of gigabytes distributed on tens of machines bringing all the
data together will be really inefficient. Just moving a complete analysis
over the network will take a long time. Elasticsearch asks every node to do
the analysis locally and return the top terms back to the caller node. The
caller node then aggregates all the reports, adds things up and return it.

  1. Make two round trips - one to get the top terms (countries in your
    example), then one to get the counts (limited to terms you want)

How could I limit to the terms I want? I know there is an 'exclude'
field for terms. Is there an 'include' one?

The term statistics facet doesn't support an exclude/include setup. I
mean changing the query to filter to the countries (in the example) you
need. This comes of course with a price.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/OnBbz-nj194/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Dr Alastair James
CTO Ometria.com
Skype: al.james

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/OnBbz-nj194/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/OnBbz-nj194/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Dr Alastair James
CTO Ometria.com
Skype: al.james

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.