Elasticsearch as an OLAP cube


(Julien Naour) #1

Hello everybody,

I have several questions.

We have a project with two axes: BI (some stats about our aggregated data
visualized on graphs with filters and dynamicity between these graphs) and
search. At the beginning we choose elasticsearch just for the search. The
easy way for the BI is an OLAP cube. We have some scalability needs and
that leed us to consider elasticsearch as a pseudo OLAP cube using facets
and filters.
Question 1 : Is it a good idea? Or at least not so bad idea?
The fact that filters could be cached independantly using bool filter is an
avantage considering our problem. We know that facets could be pretty
processing intensive and that could lead to some performance issue. We
expect that scaling out could minimize this performance issue.

We have about 100 000 000 instances of data. Data will be aggregate.
There will be a screen with almost 10 graphs on it, each graph will need
data but are close in term of filters. We are not sure what are the best
practices: have independant data modeling for independant query one by
graph or a big query on a unique type for all graphs. In a development
point of view, it could be better to have the more important graphs and
data as possible, we could beneficiate of asychronicious response to show
graphs when there are ready but in a performance point of view big query on
a big type could be better.
Question 2 : What is the best practice: big query on a big type? One
query/graph on a big type? One query and one type/graph?

To do the BI thing I want to minimize the number of data by aggregating the
most. In this context I have to solution to minimize the number of data by
index/type using array. Imagine that I have a type with this data on it :
{
"filter1" : "A",
"filter2" : "B",
"value" : "value1",
"count" : 1
}
{
"filter1" : "A",
"filter2" : "B",
"value" : "value2",
"count" : 4
}
Question 3: Could it be more interesting to have the folowing data instead ?
{
"filter1" : "A",
"filter2" : "B",
"data" : [{
"value" : "value1",
"count" : 1
}, {
"value" : "value2",
"count" : 4
}
]
}

For the moment in the first case I use a term stats facets to have the
aggregated value like the following:
{
"query" : {
"filtered" : {
"filter" : {
"bool" : {
"must" : [{
"term" : {
"filter1" : "A"
}
}, {
"term" : {
"filter2" : "B"
}
}
]
}
}
}
},
"facets" : {
"value" : {
"terms_stats" : {
"key_field" : "value",
"value_field" : "count"
}
}
}
}
terms stat give many things that I don't want like mean, min, max etc.
Question 4: Is there a more light way to obtain my aggregated value, a term
facet with key_field and value_field for example ?

Thank you in advance

Julien Naour

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Adrien Grand) #2

Hi Julien,

  1. Elasticsearch is indeed a good fit for analytics applications and many
    users use elasticsearch for that purpose.
  2. There is no best practice, just do whatever looks easier to you.
  3. When indexing data, Elasticsearch completely forgets about the JSON
    structure, so it won't be able to know whether count:4 is associated with
    value1 or value2, unless you use nested objects [1] (but if you are fine
    with 2 separate documents, this will be easier to set up!)
  4. If you don't need the statistics of the terms stats facets, you can just
    use the terms facets[2].

[1] http://www.elasticsearch.org/guide/reference/mapping/nested-type/
[2]
http://www.elasticsearch.org/guide/reference/api/search/facets/terms-facet/

On Fri, Aug 30, 2013 at 3:19 PM, Julien Naour julnaour@gmail.com wrote:

Hello everybody,

I have several questions.

We have a project with two axes: BI (some stats about our aggregated data
visualized on graphs with filters and dynamicity between these graphs) and
search. At the beginning we choose elasticsearch just for the search. The
easy way for the BI is an OLAP cube. We have some scalability needs and
that leed us to consider elasticsearch as a pseudo OLAP cube using facets
and filters.
Question 1 : Is it a good idea? Or at least not so bad idea?
The fact that filters could be cached independantly using bool filter is
an avantage considering our problem. We know that facets could be pretty
processing intensive and that could lead to some performance issue. We
expect that scaling out could minimize this performance issue.

We have about 100 000 000 instances of data. Data will be aggregate.
There will be a screen with almost 10 graphs on it, each graph will need
data but are close in term of filters. We are not sure what are the best
practices: have independant data modeling for independant query one by
graph or a big query on a unique type for all graphs. In a development
point of view, it could be better to have the more important graphs and
data as possible, we could beneficiate of asychronicious response to show
graphs when there are ready but in a performance point of view big query on
a big type could be better.
Question 2 : What is the best practice: big query on a big type? One
query/graph on a big type? One query and one type/graph?

To do the BI thing I want to minimize the number of data by aggregating
the most. In this context I have to solution to minimize the number of data
by index/type using array. Imagine that I have a type with this data on it :
{
"filter1" : "A",
"filter2" : "B",
"value" : "value1",
"count" : 1
}
{
"filter1" : "A",
"filter2" : "B",
"value" : "value2",
"count" : 4
}
Question 3: Could it be more interesting to have the folowing data instead
?
{
"filter1" : "A",
"filter2" : "B",
"data" : [{
"value" : "value1",
"count" : 1
}, {
"value" : "value2",
"count" : 4
}
]
}

For the moment in the first case I use a term stats facets to have the
aggregated value like the following:
{
"query" : {
"filtered" : {
"filter" : {
"bool" : {
"must" : [{
"term" : {
"filter1" : "A"
}
}, {
"term" : {
"filter2" : "B"
}
}
]
}
}
}
},
"facets" : {
"value" : {
"terms_stats" : {
"key_field" : "value",
"value_field" : "count"
}
}
}
}
terms stat give many things that I don't want like mean, min, max etc.
Question 4: Is there a more light way to obtain my aggregated value, a
term facet with key_field and value_field for example ?

Thank you in advance

Julien Naour

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Julien Naour) #3

Thanks for the reply Adrien,

  1. Ok, my choise is good then even with many facets use
  2. Ok
  3. As I unterstand it, nested documents could decrease performances.
  4. My problem is that you can't add a value field to the term facet as you
    can in a terms stats facet. And my problem with term stats facets is that
    there is too many. The only field that I need from terms stats facets are
    "term" and "total" fields. Other fields lead to ressource use that are
    useless.

Julien

On Friday, August 30, 2013 3:19:31 PM UTC+2, Julien Naour wrote:

Hello everybody,

I have several questions.

We have a project with two axes: BI (some stats about our aggregated data
visualized on graphs with filters and dynamicity between these graphs) and
search. At the beginning we choose elasticsearch just for the search. The
easy way for the BI is an OLAP cube. We have some scalability needs and
that leed us to consider elasticsearch as a pseudo OLAP cube using facets
and filters.
Question 1 : Is it a good idea? Or at least not so bad idea?
The fact that filters could be cached independantly using bool filter is
an avantage considering our problem. We know that facets could be pretty
processing intensive and that could lead to some performance issue. We
expect that scaling out could minimize this performance issue.

We have about 100 000 000 instances of data. Data will be aggregate.
There will be a screen with almost 10 graphs on it, each graph will need
data but are close in term of filters. We are not sure what are the best
practices: have independant data modeling for independant query one by
graph or a big query on a unique type for all graphs. In a development
point of view, it could be better to have the more important graphs and
data as possible, we could beneficiate of asychronicious response to show
graphs when there are ready but in a performance point of view big query on
a big type could be better.
Question 2 : What is the best practice: big query on a big type? One
query/graph on a big type? One query and one type/graph?

To do the BI thing I want to minimize the number of data by aggregating
the most. In this context I have to solution to minimize the number of data
by index/type using array. Imagine that I have a type with this data on it :
{
"filter1" : "A",
"filter2" : "B",
"value" : "value1",
"count" : 1
}
{
"filter1" : "A",
"filter2" : "B",
"value" : "value2",
"count" : 4
}
Question 3: Could it be more interesting to have the folowing data instead
?
{
"filter1" : "A",
"filter2" : "B",
"data" : [{
"value" : "value1",
"count" : 1
}, {
"value" : "value2",
"count" : 4
}
]
}

For the moment in the first case I use a term stats facets to have the
aggregated value like the following:
{
"query" : {
"filtered" : {
"filter" : {
"bool" : {
"must" : [{
"term" : {
"filter1" : "A"
}
}, {
"term" : {
"filter2" : "B"
}
}
]
}
}
}
},
"facets" : {
"value" : {
"terms_stats" : {
"key_field" : "value",
"value_field" : "count"
}
}
}
}
terms stat give many things that I don't want like mean, min, max etc.
Question 4: Is there a more light way to obtain my aggregated value, a
term facet with key_field and value_field for example ?

Thank you in advance

Julien Naour

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Adrien Grand) #4

Hi,

On Mon, Sep 2, 2013 at 1:04 PM, Julien Naour julnaour@gmail.com wrote:

  1. As I unterstand it, nested documents could decrease performances.

Indeed, they would add some overhead.

  1. My problem is that you can't add a value field to the term facet as you
    can in a terms stats facet. And my problem with term stats facets is that
    there is too many. The only field that I need from terms stats facets are
    "term" and "total" fields. Other fields lead to ressource use that are
    useless.

I wouldn't worry about it. Computing these numbers is very fast and you
probably wouldn't see any performance improvement by disabling them.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Julien Naour) #5

Thanks again Adrien,

Julien

2013/9/2 Adrien Grand adrien.grand@elasticsearch.com

Hi,

On Mon, Sep 2, 2013 at 1:04 PM, Julien Naour julnaour@gmail.com wrote:

  1. As I unterstand it, nested documents could decrease performances.

Indeed, they would add some overhead.

  1. My problem is that you can't add a value field to the term facet as
    you can in a terms stats facet. And my problem with term stats facets is
    that there is too many. The only field that I need from terms stats facets
    are "term" and "total" fields. Other fields lead to ressource use that are
    useless.

I wouldn't worry about it. Computing these numbers is very fast and you
probably wouldn't see any performance improvement by disabling them.

--
Adrien Grand

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/iTy9IYL23as/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #6