Statistical facet on multiple fields

Hi

I know some work has begun on this but thought I would post my
findings here. Our goal is the same as Zohar's - we want to generate
statistics on a numeric field, but have those statistics broken down
by the value of another field(s) in the document.

We have come up with a (suboptimal) solution for this by making two
requests to elastic.

The first has a single terms facet on the field we want the statistics
broken down by. This gives us all possible values for the field. The
second request is then based off the first. We add a statistical
facet with a filter, the filter based on the terms which we got back
from the first request.

So say the terms facet returned 1000 terms, we'd then make a second
request with 1000 statistical facets each with a different filter.

This works, but as I'm sure you can imagine it suffers horribly when
it comes to performance!

Regards
Neil

On Dec 28 2010, 6:35 pm, Shay Banon shay.ba...@elasticsearch.com
wrote:

It make sense, what you are after. The main challenge with facets is the
fact that they can get really interesting once you start to combine them (as
is the case in this thread with terms and stats). The problem is that those
facet implementation are highly optimized for the simple reason that they
might end up running over 100s of millions of docs. And implementing all the
combinations in a generic fashion is certainly possible, but will incur a
performance overhead (both in computation, but even more in serialization
over network).

One of the things lined up for 0.15 is to do some refactoring in facets and
make them more pluggable. Once thats out of the way, then people can write
their own facet implementations.

Of couse, there should be a good out of the box set of facets that comes
with ES. My current line of thought is that there will simply be a lot of
facet types, all heavily optimized. There will be a terms_stats, and
date_histogram, and others. I don't mind implementing all of those and have
them as past of ES. Hopefully the community will help with it (or at the
very least, help with coming up with good names for them :slight_smile: ), so you will
get a really rich and heavily optimized set of facets.

-shay.banon

On Tue, Dec 28, 2010 at 2:05 PM, harelba hare...@gmail.com wrote:

Hi,

I've been looking for a way to perform aggregations similar to the
ones talked about in this thread, grouping the data according to an
arbitrary set or fields (or better yet - an expression).

The ScriptHistogramFacet seemed like a good choice, allowing the key
to actually be a "key_script", and skipping the "bucketing" stage. I
thought that this would allow me to achieve this kind of aggregations,
but then I saw that ScriptHistogramFacetCollector.doCollect() relies
on the fact that value returned from key_script has to be of type
Number even if the interval==0. I know that currently you're using
LongLong maps, but If it would have accepted other types as well (at
least strings), that would have been really great.

Am I getting it wrong? Is there a good way to do that? Your help would
be much appreciated.

Thanks,
RL

btw, it would have been totally cool if the data collected by the
StatisticalFacet would be integrated into the HistogramFacet (and its
scripted brother). The StatisticalFacet is great, but often-times the
statistical data is required per some kind of "group", and not only on
some kind of filter over the whole data.

Agreed, its high on my TODO list :slight_smile:
On Wednesday, January 19, 2011 at 8:27 PM, Neil Mosafi wrote:

Hi

I know some work has begun on this but thought I would post my
findings here. Our goal is the same as Zohar's - we want to generate
statistics on a numeric field, but have those statistics broken down
by the value of another field(s) in the document.

We have come up with a (suboptimal) solution for this by making two
requests to elastic.

The first has a single terms facet on the field we want the statistics
broken down by. This gives us all possible values for the field. The
second request is then based off the first. We add a statistical
facet with a filter, the filter based on the terms which we got back
from the first request.

So say the terms facet returned 1000 terms, we'd then make a second
request with 1000 statistical facets each with a different filter.

This works, but as I'm sure you can imagine it suffers horribly when
it comes to performance!

Regards
Neil

On Dec 28 2010, 6:35 pm, Shay Banon shay.ba...@elasticsearch.com
wrote:

It make sense, what you are after. The main challenge with facets is the
fact that they can get really interesting once you start to combine them (as
is the case in this thread with terms and stats). The problem is that those
facet implementation are highly optimized for the simple reason that they
might end up running over 100s of millions of docs. And implementing all the
combinations in a generic fashion is certainly possible, but will incur a
performance overhead (both in computation, but even more in serialization
over network).

One of the things lined up for 0.15 is to do some refactoring in facets and
make them more pluggable. Once thats out of the way, then people can write
their own facet implementations.

Of couse, there should be a good out of the box set of facets that comes
with ES. My current line of thought is that there will simply be a lot of
facet types, all heavily optimized. There will be a terms_stats, and
date_histogram, and others. I don't mind implementing all of those and have
them as past of ES. Hopefully the community will help with it (or at the
very least, help with coming up with good names for them :slight_smile: ), so you will
get a really rich and heavily optimized set of facets.

-shay.banon

On Tue, Dec 28, 2010 at 2:05 PM, harelba hare...@gmail.com wrote:

Hi,

I've been looking for a way to perform aggregations similar to the
ones talked about in this thread, grouping the data according to an
arbitrary set or fields (or better yet - an expression).

The ScriptHistogramFacet seemed like a good choice, allowing the key
to actually be a "key_script", and skipping the "bucketing" stage. I
thought that this would allow me to achieve this kind of aggregations,
but then I saw that ScriptHistogramFacetCollector.doCollect() relies
on the fact that value returned from key_script has to be of type
Number even if the interval==0. I know that currently you're using
LongLong maps, but If it would have accepted other types as well (at
least strings), that would have been really great.

Am I getting it wrong? Is there a good way to do that? Your help would
be much appreciated.

Thanks,
RL

btw, it would have been totally cool if the data collected by the
StatisticalFacet would be integrated into the HistogramFacet (and its
scripted brother). The StatisticalFacet is great, but often-times the
statistical data is required per some kind of "group", and not only on
some kind of filter over the whole data.

Hi,

I have faced the same problem with facet on multiple fields, I had to find the count of posts of each author across each domain. Actually what I wanted in relational database terms is a count() over group by author name and domain.

Lucky for me, both the author name field and domain field were mapped with keyword analyzer. I used a crude workaround, but it worked really well for me.

I created a third field author-domain, indexed in it data from both the earlier fields, joined by a predefined separator, and faceted on this new field. This new field is also mapped with keyword analyzer. Of course I had to again parse the value using the separator, but in any case, worth the overhead was nothing compared to the complexity as Neil mentioned.

Really gathering lot of expectations over the facet on combination of fields.

Thanks,
Sujoy.

Hi

I know some work has begun on this but thought I would post my
findings here. Our goal is the same as Zohar's - we want to generate
statistics on a numeric field, but have those statistics broken down
by the value of another field(s) in the document.

We have come up with a (suboptimal) solution for this by making two
requests to elastic.

The first has a single terms facet on the field we want the statistics
broken down by. This gives us all possible values for the field. The
second request is then based off the first. We add a statistical
facet with a filter, the filter based on the terms which we got back
from the first request.

So say the terms facet returned 1000 terms, we'd then make a second
request with 1000 statistical facets each with a different filter.

This works, but as I'm sure you can imagine it suffers horribly when
it comes to performance!

Regards
Neil

On Dec 28 2010, 6:35 pm, Shay Banon <shay.ba...@>
wrote:

It make sense, what you are after. The main challenge with facets is the
fact that they can get really interesting once you start to combine them (as
is the case in this thread with terms and stats). The problem is that those
facet implementation are highly optimized for the simple reason that they
might end up running over 100s of millions of docs. And implementing all the
combinations in a generic fashion is certainly possible, but will incur a
performance overhead (both in computation, but even more in serialization
over network).

One of the things lined up for 0.15 is to do some refactoring in facets and
make them more pluggable. Once thats out of the way, then people can write
their own facet implementations.

Of couse, there should be a good out of the box set of facets that comes
with ES. My current line of thought is that there will simply be a lot of
facet types, all heavily optimized. There will be a terms_stats, and
date_histogram, and others. I don't mind implementing all of those and have
them as past of ES. Hopefully the community will help with it (or at the
very least, help with coming up with good names for them :slight_smile: ), so you will
get a really rich and heavily optimized set of facets.

-shay.banon

On Tue, Dec 28, 2010 at 2:05 PM, harelba <hare...@> wrote:

Hi,

I've been looking for a way to perform aggregations similar to the
ones talked about in this thread, grouping the data according to an
arbitrary set or fields (or better yet - an expression).

The ScriptHistogramFacet seemed like a good choice, allowing the key
to actually be a "key_script", and skipping the "bucketing" stage. I
thought that this would allow me to achieve this kind of aggregations,
but then I saw that ScriptHistogramFacetCollector.doCollect() relies
on the fact that value returned from key_script has to be of type
Number even if the interval==0. I know that currently you're using
LongLong maps, but If it would have accepted other types as well (at
least strings), that would have been really great.

Am I getting it wrong? Is there a good way to do that? Your help would
be much appreciated.

Thanks,
RL

btw, it would have been totally cool if the data collected by the
StatisticalFacet would be integrated into the HistogramFacet (and its
scripted brother). The StatisticalFacet is great, but often-times the
statistical data is required per some kind of "group", and not only on
some kind of filter over the whole data.

Hi,

Is there a way to achieve this with current aggregations in 1.0.0? We are trying to calculate a simple average across multiple fields in the document. We tried the workaround with naming the aggregates the same, and while it worked in command line through curl, it does not work in the javascript client since it combines 3 properties into one (uses the latest one).

Is there a way to give the avg aggregate an array of fields? Or maybe a way to achieve this through scripts?

Any direction would be great,

Thank you,