We have a project with two axes: BI (some stats about our aggregated data
visualized on graphs with filters and dynamicity between these graphs) and
search. At the beginning we choose elasticsearch just for the search. The
easy way for the BI is an OLAP cube. We have some scalability needs and
that leed us to consider elasticsearch as a pseudo OLAP cube using facets
and filters.
Question 1 : Is it a good idea? Or at least not so bad idea?
The fact that filters could be cached independantly using bool filter is an
avantage considering our problem. We know that facets could be pretty
processing intensive and that could lead to some performance issue. We
expect that scaling out could minimize this performance issue.
We have about 100 000 000 instances of data. Data will be aggregate.
There will be a screen with almost 10 graphs on it, each graph will need
data but are close in term of filters. We are not sure what are the best
practices: have independant data modeling for independant query one by
graph or a big query on a unique type for all graphs. In a development
point of view, it could be better to have the more important graphs and
data as possible, we could beneficiate of asychronicious response to show
graphs when there are ready but in a performance point of view big query on
a big type could be better.
Question 2 : What is the best practice: big query on a big type? One
query/graph on a big type? One query and one type/graph?
To do the BI thing I want to minimize the number of data by aggregating the
most. In this context I have to solution to minimize the number of data by
index/type using array. Imagine that I have a type with this data on it :
{
"filter1" : "A",
"filter2" : "B",
"value" : "value1",
"count" : 1
}
{
"filter1" : "A",
"filter2" : "B",
"value" : "value2",
"count" : 4
}
Question 3: Could it be more interesting to have the folowing data instead ?
{
"filter1" : "A",
"filter2" : "B",
"data" : [{
"value" : "value1",
"count" : 1
}, {
"value" : "value2",
"count" : 4
}
]
}
For the moment in the first case I use a term stats facets to have the
aggregated value like the following:
{
"query" : {
"filtered" : {
"filter" : {
"bool" : {
"must" : [{
"term" : {
"filter1" : "A"
}
}, {
"term" : {
"filter2" : "B"
}
}
]
}
}
}
},
"facets" : {
"value" : {
"terms_stats" : {
"key_field" : "value",
"value_field" : "count"
}
}
}
}
terms stat give many things that I don't want like mean, min, max etc.
Question 4: Is there a more light way to obtain my aggregated value, a term
facet with key_field and value_field for example ?
Elasticsearch is indeed a good fit for analytics applications and many
users use elasticsearch for that purpose.
There is no best practice, just do whatever looks easier to you.
When indexing data, Elasticsearch completely forgets about the JSON
structure, so it won't be able to know whether count:4 is associated with
value1 or value2, unless you use nested objects [1] (but if you are fine
with 2 separate documents, this will be easier to set up!)
If you don't need the statistics of the terms stats facets, you can just
use the terms facets[2].
We have a project with two axes: BI (some stats about our aggregated data
visualized on graphs with filters and dynamicity between these graphs) and
search. At the beginning we choose elasticsearch just for the search. The
easy way for the BI is an OLAP cube. We have some scalability needs and
that leed us to consider elasticsearch as a pseudo OLAP cube using facets
and filters.
Question 1 : Is it a good idea? Or at least not so bad idea?
The fact that filters could be cached independantly using bool filter is
an avantage considering our problem. We know that facets could be pretty
processing intensive and that could lead to some performance issue. We
expect that scaling out could minimize this performance issue.
We have about 100 000 000 instances of data. Data will be aggregate.
There will be a screen with almost 10 graphs on it, each graph will need
data but are close in term of filters. We are not sure what are the best
practices: have independant data modeling for independant query one by
graph or a big query on a unique type for all graphs. In a development
point of view, it could be better to have the more important graphs and
data as possible, we could beneficiate of asychronicious response to show
graphs when there are ready but in a performance point of view big query on
a big type could be better.
Question 2 : What is the best practice: big query on a big type? One
query/graph on a big type? One query and one type/graph?
To do the BI thing I want to minimize the number of data by aggregating
the most. In this context I have to solution to minimize the number of data
by index/type using array. Imagine that I have a type with this data on it :
{
"filter1" : "A",
"filter2" : "B",
"value" : "value1",
"count" : 1
}
{
"filter1" : "A",
"filter2" : "B",
"value" : "value2",
"count" : 4
}
Question 3: Could it be more interesting to have the folowing data instead
?
{
"filter1" : "A",
"filter2" : "B",
"data" : [{
"value" : "value1",
"count" : 1
}, {
"value" : "value2",
"count" : 4
}
]
}
For the moment in the first case I use a term stats facets to have the
aggregated value like the following:
{
"query" : {
"filtered" : {
"filter" : {
"bool" : {
"must" : [{
"term" : {
"filter1" : "A"
}
}, {
"term" : {
"filter2" : "B"
}
}
]
}
}
}
},
"facets" : {
"value" : {
"terms_stats" : {
"key_field" : "value",
"value_field" : "count"
}
}
}
}
terms stat give many things that I don't want like mean, min, max etc.
Question 4: Is there a more light way to obtain my aggregated value, a
term facet with key_field and value_field for example ?
Ok, my choise is good then even with many facets use
Ok
As I unterstand it, nested documents could decrease performances.
My problem is that you can't add a value field to the term facet as you
can in a terms stats facet. And my problem with term stats facets is that
there is too many. The only field that I need from terms stats facets are
"term" and "total" fields. Other fields lead to ressource use that are
useless.
Julien
On Friday, August 30, 2013 3:19:31 PM UTC+2, Julien Naour wrote:
Hello everybody,
I have several questions.
We have a project with two axes: BI (some stats about our aggregated data
visualized on graphs with filters and dynamicity between these graphs) and
search. At the beginning we choose elasticsearch just for the search. The
easy way for the BI is an OLAP cube. We have some scalability needs and
that leed us to consider elasticsearch as a pseudo OLAP cube using facets
and filters.
Question 1 : Is it a good idea? Or at least not so bad idea?
The fact that filters could be cached independantly using bool filter is
an avantage considering our problem. We know that facets could be pretty
processing intensive and that could lead to some performance issue. We
expect that scaling out could minimize this performance issue.
We have about 100 000 000 instances of data. Data will be aggregate.
There will be a screen with almost 10 graphs on it, each graph will need
data but are close in term of filters. We are not sure what are the best
practices: have independant data modeling for independant query one by
graph or a big query on a unique type for all graphs. In a development
point of view, it could be better to have the more important graphs and
data as possible, we could beneficiate of asychronicious response to show
graphs when there are ready but in a performance point of view big query on
a big type could be better.
Question 2 : What is the best practice: big query on a big type? One
query/graph on a big type? One query and one type/graph?
To do the BI thing I want to minimize the number of data by aggregating
the most. In this context I have to solution to minimize the number of data
by index/type using array. Imagine that I have a type with this data on it :
{
"filter1" : "A",
"filter2" : "B",
"value" : "value1",
"count" : 1
}
{
"filter1" : "A",
"filter2" : "B",
"value" : "value2",
"count" : 4
}
Question 3: Could it be more interesting to have the folowing data instead
?
{
"filter1" : "A",
"filter2" : "B",
"data" : [{
"value" : "value1",
"count" : 1
}, {
"value" : "value2",
"count" : 4
}
]
}
For the moment in the first case I use a term stats facets to have the
aggregated value like the following:
{
"query" : {
"filtered" : {
"filter" : {
"bool" : {
"must" : [{
"term" : {
"filter1" : "A"
}
}, {
"term" : {
"filter2" : "B"
}
}
]
}
}
}
},
"facets" : {
"value" : {
"terms_stats" : {
"key_field" : "value",
"value_field" : "count"
}
}
}
}
terms stat give many things that I don't want like mean, min, max etc.
Question 4: Is there a more light way to obtain my aggregated value, a
term facet with key_field and value_field for example ?
As I unterstand it, nested documents could decrease performances.
Indeed, they would add some overhead.
My problem is that you can't add a value field to the term facet as you
can in a terms stats facet. And my problem with term stats facets is that
there is too many. The only field that I need from terms stats facets are
"term" and "total" fields. Other fields lead to ressource use that are
useless.
I wouldn't worry about it. Computing these numbers is very fast and you
probably wouldn't see any performance improvement by disabling them.
As I unterstand it, nested documents could decrease performances.
Indeed, they would add some overhead.
My problem is that you can't add a value field to the term facet as
you can in a terms stats facet. And my problem with term stats facets is
that there is too many. The only field that I need from terms stats facets
are "term" and "total" fields. Other fields lead to ressource use that are
useless.
I wouldn't worry about it. Computing these numbers is very fast and you
probably wouldn't see any performance improvement by disabling them.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.