Elasticsearch as an OLAP cube

Julien_Naour · August 30, 2013, 1:19pm

Hello everybody,

I have several questions.

We have a project with two axes: BI (some stats about our aggregated data
visualized on graphs with filters and dynamicity between these graphs) and
search. At the beginning we choose elasticsearch just for the search. The
easy way for the BI is an OLAP cube. We have some scalability needs and
that leed us to consider elasticsearch as a pseudo OLAP cube using facets
and filters.
Question 1 : Is it a good idea? Or at least not so bad idea?
The fact that filters could be cached independantly using bool filter is an
avantage considering our problem. We know that facets could be pretty
processing intensive and that could lead to some performance issue. We
expect that scaling out could minimize this performance issue.

We have about 100 000 000 instances of data. Data will be aggregate.
There will be a screen with almost 10 graphs on it, each graph will need
data but are close in term of filters. We are not sure what are the best
practices: have independant data modeling for independant query one by
graph or a big query on a unique type for all graphs. In a development
point of view, it could be better to have the more important graphs and
data as possible, we could beneficiate of asychronicious response to show
graphs when there are ready but in a performance point of view big query on
a big type could be better.
Question 2 : What is the best practice: big query on a big type? One
query/graph on a big type? One query and one type/graph?

To do the BI thing I want to minimize the number of data by aggregating the
most. In this context I have to solution to minimize the number of data by
index/type using array. Imagine that I have a type with this data on it :
{
"filter1" : "A",
"filter2" : "B",
"value" : "value1",
"count" : 1
}
{
"filter1" : "A",
"filter2" : "B",
"value" : "value2",
"count" : 4
}
Question 3: Could it be more interesting to have the folowing data instead ?
{
"filter1" : "A",
"filter2" : "B",
"data" : [{
"value" : "value1",
"count" : 1
}, {
"value" : "value2",
"count" : 4
}
]
}

For the moment in the first case I use a term stats facets to have the
aggregated value like the following:
{
"query" : {
"filtered" : {
"filter" : {
"bool" : {
"must" : [{
"term" : {
"filter1" : "A"
}
}, {
"term" : {
"filter2" : "B"
}
}
]
}
}
}
},
"facets" : {
"value" : {
"terms_stats" : {
"key_field" : "value",
"value_field" : "count"
}
}
}
}
terms stat give many things that I don't want like mean, min, max etc.
Question 4: Is there a more light way to obtain my aggregated value, a term
facet with key_field and value_field for example ?

Thank you in advance

Julien Naour

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jpountz · August 30, 2013, 5:18pm

Hi Julien,

Elasticsearch is indeed a good fit for analytics applications and many
users use elasticsearch for that purpose.
There is no best practice, just do whatever looks easier to you.
When indexing data, Elasticsearch completely forgets about the JSON
structure, so it won't be able to know whether count:4 is associated with
value1 or value2, unless you use nested objects [1] (but if you are fine
with 2 separate documents, this will be easier to set up!)
If you don't need the statistics of the terms stats facets, you can just
use the terms facets[2].

[1] Elasticsearch Platform — Find real-time answers at scale | Elastic
[2]
Elasticsearch Platform — Find real-time answers at scale | Elastic

On Fri, Aug 30, 2013 at 3:19 PM, Julien Naour julnaour@gmail.com wrote:

Hello everybody,

I have several questions.

We have a project with two axes: BI (some stats about our aggregated data
visualized on graphs with filters and dynamicity between these graphs) and
search. At the beginning we choose elasticsearch just for the search. The
easy way for the BI is an OLAP cube. We have some scalability needs and
that leed us to consider elasticsearch as a pseudo OLAP cube using facets
and filters.
Question 1 : Is it a good idea? Or at least not so bad idea?
The fact that filters could be cached independantly using bool filter is
an avantage considering our problem. We know that facets could be pretty
processing intensive and that could lead to some performance issue. We
expect that scaling out could minimize this performance issue.

We have about 100 000 000 instances of data. Data will be aggregate.
There will be a screen with almost 10 graphs on it, each graph will need
data but are close in term of filters. We are not sure what are the best
practices: have independant data modeling for independant query one by
graph or a big query on a unique type for all graphs. In a development
point of view, it could be better to have the more important graphs and
data as possible, we could beneficiate of asychronicious response to show
graphs when there are ready but in a performance point of view big query on
a big type could be better.
Question 2 : What is the best practice: big query on a big type? One
query/graph on a big type? One query and one type/graph?

To do the BI thing I want to minimize the number of data by aggregating
the most. In this context I have to solution to minimize the number of data
by index/type using array. Imagine that I have a type with this data on it :
{
"filter1" : "A",
"filter2" : "B",
"value" : "value1",
"count" : 1
}
{
"filter1" : "A",
"filter2" : "B",
"value" : "value2",
"count" : 4
}
Question 3: Could it be more interesting to have the folowing data instead
?
{
"filter1" : "A",
"filter2" : "B",
"data" : [{
"value" : "value1",
"count" : 1
}, {
"value" : "value2",
"count" : 4
}
]
}

For the moment in the first case I use a term stats facets to have the
aggregated value like the following:
{
"query" : {
"filtered" : {
"filter" : {
"bool" : {
"must" : [{
"term" : {
"filter1" : "A"
}
}, {
"term" : {
"filter2" : "B"
}
}
]
}
}
}
},
"facets" : {
"value" : {
"terms_stats" : {
"key_field" : "value",
"value_field" : "count"
}
}
}
}
terms stat give many things that I don't want like mean, min, max etc.
Question 4: Is there a more light way to obtain my aggregated value, a
term facet with key_field and value_field for example ?

Thank you in advance

Julien Naour

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Julien_Naour · September 2, 2013, 11:04am

Thanks for the reply Adrien,

Ok, my choise is good then even with many facets use
Ok
As I unterstand it, nested documents could decrease performances.
My problem is that you can't add a value field to the term facet as you
can in a terms stats facet. And my problem with term stats facets is that
there is too many. The only field that I need from terms stats facets are
"term" and "total" fields. Other fields lead to ressource use that are
useless.

Julien

On Friday, August 30, 2013 3:19:31 PM UTC+2, Julien Naour wrote:

Hello everybody,

I have several questions.

We have a project with two axes: BI (some stats about our aggregated data
visualized on graphs with filters and dynamicity between these graphs) and
search. At the beginning we choose elasticsearch just for the search. The
easy way for the BI is an OLAP cube. We have some scalability needs and
that leed us to consider elasticsearch as a pseudo OLAP cube using facets
and filters.
Question 1 : Is it a good idea? Or at least not so bad idea?
The fact that filters could be cached independantly using bool filter is
an avantage considering our problem. We know that facets could be pretty
processing intensive and that could lead to some performance issue. We
expect that scaling out could minimize this performance issue.

We have about 100 000 000 instances of data. Data will be aggregate.
There will be a screen with almost 10 graphs on it, each graph will need
data but are close in term of filters. We are not sure what are the best
practices: have independant data modeling for independant query one by
graph or a big query on a unique type for all graphs. In a development
point of view, it could be better to have the more important graphs and
data as possible, we could beneficiate of asychronicious response to show
graphs when there are ready but in a performance point of view big query on
a big type could be better.
Question 2 : What is the best practice: big query on a big type? One
query/graph on a big type? One query and one type/graph?

To do the BI thing I want to minimize the number of data by aggregating
the most. In this context I have to solution to minimize the number of data
by index/type using array. Imagine that I have a type with this data on it :
{
"filter1" : "A",
"filter2" : "B",
"value" : "value1",
"count" : 1
}
{
"filter1" : "A",
"filter2" : "B",
"value" : "value2",
"count" : 4
}
Question 3: Could it be more interesting to have the folowing data instead
?
{
"filter1" : "A",
"filter2" : "B",
"data" : [{
"value" : "value1",
"count" : 1
}, {
"value" : "value2",
"count" : 4
}
]
}

For the moment in the first case I use a term stats facets to have the
aggregated value like the following:
{
"query" : {
"filtered" : {
"filter" : {
"bool" : {
"must" : [{
"term" : {
"filter1" : "A"
}
}, {
"term" : {
"filter2" : "B"
}
}
]
}
}
}
},
"facets" : {
"value" : {
"terms_stats" : {
"key_field" : "value",
"value_field" : "count"
}
}
}
}
terms stat give many things that I don't want like mean, min, max etc.
Question 4: Is there a more light way to obtain my aggregated value, a
term facet with key_field and value_field for example ?

Thank you in advance

Julien Naour

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jpountz · September 2, 2013, 12:36pm

Hi,

On Mon, Sep 2, 2013 at 1:04 PM, Julien Naour julnaour@gmail.com wrote:

As I unterstand it, nested documents could decrease performances.

Indeed, they would add some overhead.

My problem is that you can't add a value field to the term facet as you
can in a terms stats facet. And my problem with term stats facets is that
there is too many. The only field that I need from terms stats facets are
"term" and "total" fields. Other fields lead to ressource use that are
useless.

I wouldn't worry about it. Computing these numbers is very fast and you
probably wouldn't see any performance improvement by disabling them.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Julien_Naour · September 2, 2013, 12:59pm

Thanks again Adrien,

Julien

2013/9/2 Adrien Grand adrien.grand@elasticsearch.com

Hi,

On Mon, Sep 2, 2013 at 1:04 PM, Julien Naour julnaour@gmail.com wrote:

As I unterstand it, nested documents could decrease performances.

Indeed, they would add some overhead.

My problem is that you can't add a value field to the term facet as
you can in a terms stats facet. And my problem with term stats facets is
that there is too many. The only field that I need from terms stats facets
are "term" and "total" fields. Other fields lead to ressource use that are
useless.

I wouldn't worry about it. Computing these numbers is very fast and you
probably wouldn't see any performance improvement by disabling them.

--
Adrien Grand

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/iTy9IYL23as/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Histogram facets on elasticsearh Elasticsearch	4	417	July 6, 2017
Query Execution Time, Performance Elasticsearch	13	3875	July 6, 2017
The performance of facets Elasticsearch	7	681	July 6, 2017
Greetings! Elasticsearch	8	947	July 6, 2017
How to improve performance of facet queries? Elasticsearch	7	1385	July 6, 2017

Elasticsearch as an OLAP cube

Related topics