How to improve performance of facet queries?

Hi,

I'm trying to improve the performance of my facet queries. My typical facet
query looks a bit more complicated (two facets with regex and a more
complex query), but I have reduced it to a very simple example:

{
"query": {
"match": {
"_tokens._all._text.ngram": "kat"
}
},
"facets": {
"tokens": {
"terms": {
"field": "_tokens._all._facet"
}
}
},
"size": 0
}

The answer is:

{
"facets": {
"tokens": {
"_type": "terms",
"missing": 0,
"other": 7321391,
"terms": [ ... ],
"total": 7663578
}
},
...
}

As you can see, there are a lot of documents. My index has 23 GiB size.

This query takes ~ 200ms, but it takes < 5ms without the facet. The
question is: how can I improve its performance? It should be around 10ms...

(1) I am thinking if rewriting my query could improve it.
"_tokens._all._facet" is a non-analyzed string field, whereas
"_tokens._all._text.ngram" is ICU tokenized, ICU folded and ngram analyzed.
There are several values for each document. Is there anything wrong there I
should consider?
(2) I don't know if it could be possible to "index" or "cache" somehow the
values of "_tokens._all._facet". It is the field that is used by the facets
all the time, so it gets constantly accessed.
(3) If I use a cluster, could a high number of nodes and shards improve my
performance? Would Elasticsearch perform a parallel facetting in the data
nodes?
(4) Finally, can you give me advice on how to test if I am having problems
with data access (like I/O blocks)?

Thanks in advance! :slight_smile:

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I am curious, is this 200ms the first time the query runs of 200ms
everytime? The first one should always be slower and subsequent ones should
be fast once the index is warmed. If you're trying to optimize the warming
case, I recommend using index warmers.

Facets currently pull all the values for the field into memory (field
cache):

I would make sure you have enough RAM allocated to elasticsearch to hold
all these in memory.

To run through your items:

  1. I don't think re-writing the query will help on the facet side.
  2. Yeah, this is how things work. Just ensure you are using resident cache
    type (see link above) so that your facet values stay in memory.
  3. This could speed things up, since you are running across multiple
    nodes/shards, each with a smaller set of values in memory. Just watch out
    for incorrect facet counts (
    terms facet gives wrong count with n_shards > 1 · Issue #1305 · elastic/elasticsearch · GitHub)
  4. I would ensure that you're machine isn't swapping and especially isn't
    swapping elasticsearch (check out details on mlockall here:
    Elasticsearch Platform — Find real-time answers at scale | Elastic). If
    you're good there, you can check out iostat to ensure disk performance is
    as expected.

Best Regards,
Paul

On Monday, September 9, 2013 7:16:22 AM UTC-4, Guillermo Arias del Río
wrote:

Hi,

I'm trying to improve the performance of my facet queries. My typical
facet query looks a bit more complicated (two facets with regex and a more
complex query), but I have reduced it to a very simple example:

{
"query": {
"match": {
"_tokens._all._text.ngram": "kat"
}
},
"facets": {
"tokens": {
"terms": {
"field": "_tokens._all._facet"
}
}
},
"size": 0
}

The answer is:

{
"facets": {
"tokens": {
"_type": "terms",
"missing": 0,
"other": 7321391,
"terms": [ ... ],
"total": 7663578
}
},
...
}

As you can see, there are a lot of documents. My index has 23 GiB size.

This query takes ~ 200ms, but it takes < 5ms without the facet. The
question is: how can I improve its performance? It should be around 10ms...

(1) I am thinking if rewriting my query could improve it.
"_tokens._all._facet" is a non-analyzed string field, whereas
"_tokens._all._text.ngram" is ICU tokenized, ICU folded and ngram analyzed.
There are several values for each document. Is there anything wrong there I
should consider?
(2) I don't know if it could be possible to "index" or "cache" somehow the
values of "_tokens._all._facet". It is the field that is used by the facets
all the time, so it gets constantly accessed.
(3) If I use a cluster, could a high number of nodes and shards improve my
performance? Would Elasticsearch perform a parallel facetting in the data
nodes?
(4) Finally, can you give me advice on how to test if I am having problems
with data access (like I/O blocks)?

Thanks in advance! :slight_smile:

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi, Paul,

thanks for your response :slight_smile:

The time is an average of 5 requests after the first one, so it would be
the warmed case, but I don't know if warmers could help me here: in the
normal scenario, the queries vary a lot, though it is true that I am always
facetting the same field. I also thought that my problem had to do with the
RAM allocated... but then I ran bigdesk and saw that the memory never gets
used above about 60% of the amount I've reserved (currently 6 GiB). I will
try to tweak the field cache.

Thank you very much for the tips about (3) and (4). Maybe there's something
going wrong there!

Best Regards,
Guillermo

On Mon, Sep 9, 2013 at 11:21 PM, ppearcy ppearcy@gmail.com wrote:

I am curious, is this 200ms the first time the query runs of 200ms
everytime? The first one should always be slower and subsequent ones should
be fast once the index is warmed. If you're trying to optimize the warming
case, I recommend using index warmers.

Facets currently pull all the values for the field into memory (field
cache):
Elasticsearch Platform — Find real-time answers at scale | Elastic

I would make sure you have enough RAM allocated to elasticsearch to hold
all these in memory.

To run through your items:

  1. I don't think re-writing the query will help on the facet side.
  2. Yeah, this is how things work. Just ensure you are using resident cache
    type (see link above) so that your facet values stay in memory.
  3. This could speed things up, since you are running across multiple
    nodes/shards, each with a smaller set of values in memory. Just watch out
    for incorrect facet counts (
    terms facet gives wrong count with n_shards > 1 · Issue #1305 · elastic/elasticsearch · GitHub)
  4. I would ensure that you're machine isn't swapping and especially isn't
    swapping elasticsearch (check out details on mlockall here:
    Elasticsearch Platform — Find real-time answers at scale | Elastic). If
    you're good there, you can check out iostat to ensure disk performance is
    as expected.

Best Regards,
Paul

On Monday, September 9, 2013 7:16:22 AM UTC-4, Guillermo Arias del Río
wrote:

Hi,

I'm trying to improve the performance of my facet queries. My typical
facet query looks a bit more complicated (two facets with regex and a more
complex query), but I have reduced it to a very simple example:

{
"query": {
"match": {
"_tokens._all._text.ngram": "kat"
}
},
"facets": {
"tokens": {
"terms": {
"field": "_tokens._all._facet"
}
}
},
"size": 0
}

The answer is:

{
"facets": {
"tokens": {
"_type": "terms",
"missing": 0,
"other": 7321391,
"terms": [ ... ],
"total": 7663578
}
},
...
}

As you can see, there are a lot of documents. My index has 23 GiB size.

This query takes ~ 200ms, but it takes < 5ms without the facet. The
question is: how can I improve its performance? It should be around 10ms...

(1) I am thinking if rewriting my query could improve it.
"_tokens._all._facet" is a non-analyzed string field, whereas
"_tokens._all._text.ngram" is ICU tokenized, ICU folded and ngram analyzed.
There are several values for each document. Is there anything wrong there I
should consider?
(2) I don't know if it could be possible to "index" or "cache" somehow
the values of "_tokens._all._facet". It is the field that is used by the
facets all the time, so it gets constantly accessed.
(3) If I use a cluster, could a high number of nodes and shards improve
my performance? Would Elasticsearch perform a parallel facetting in the
data nodes?
(4) Finally, can you give me advice on how to test if I am having
problems with data access (like I/O blocks)?

Thanks in advance! :slight_smile:

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/UxRs4moIaPM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

So, I checked a few things, but I am still lost...

Now I only have one node. I can see with bigdesk that the heap memory never
goes over 5 of the available 6 GiB.
"_nodes/stats/indices/fielddata/_tokens._all._facet" tells me that I have a
fielddata cache of about 500 MiB for the field I'm using in my facet.
Still, when I make the tests, I can see how Elasticsearch performs I/O
operations (I used iotop for this). So, what can I be doing wrong? By the
way, just to be sure, I took out the facets from the query and it lasts a
few milliseconds, so it is really the facetting that's taking so long.

Can someone help me? Paul? :slight_smile:

On Tue, Sep 10, 2013 at 12:04 PM, Guillermo Arias del Río <
ariasdelrio@gmail.com> wrote:

Hi, Paul,

thanks for your response :slight_smile:

The time is an average of 5 requests after the first one, so it would be
the warmed case, but I don't know if warmers could help me here: in the
normal scenario, the queries vary a lot, though it is true that I am always
facetting the same field. I also thought that my problem had to do with the
RAM allocated... but then I ran bigdesk and saw that the memory never gets
used above about 60% of the amount I've reserved (currently 6 GiB). I will
try to tweak the field cache.

Thank you very much for the tips about (3) and (4). Maybe there's
something going wrong there!

Best Regards,
Guillermo

On Mon, Sep 9, 2013 at 11:21 PM, ppearcy ppearcy@gmail.com wrote:

I am curious, is this 200ms the first time the query runs of 200ms
everytime? The first one should always be slower and subsequent ones should
be fast once the index is warmed. If you're trying to optimize the warming
case, I recommend using index warmers.

Facets currently pull all the values for the field into memory (field
cache):
Elasticsearch Platform — Find real-time answers at scale | Elastic

I would make sure you have enough RAM allocated to elasticsearch to hold
all these in memory.

To run through your items:

  1. I don't think re-writing the query will help on the facet side.
  2. Yeah, this is how things work. Just ensure you are using resident
    cache type (see link above) so that your facet values stay in memory.
  3. This could speed things up, since you are running across multiple
    nodes/shards, each with a smaller set of values in memory. Just watch out
    for incorrect facet counts (
    terms facet gives wrong count with n_shards > 1 · Issue #1305 · elastic/elasticsearch · GitHub)
  4. I would ensure that you're machine isn't swapping and especially isn't
    swapping elasticsearch (check out details on mlockall here:
    Elasticsearch Platform — Find real-time answers at scale | Elastic). If
    you're good there, you can check out iostat to ensure disk performance is
    as expected.

Best Regards,
Paul

On Monday, September 9, 2013 7:16:22 AM UTC-4, Guillermo Arias del Río
wrote:

Hi,

I'm trying to improve the performance of my facet queries. My typical
facet query looks a bit more complicated (two facets with regex and a more
complex query), but I have reduced it to a very simple example:

{
"query": {
"match": {
"_tokens._all._text.ngram": "kat"
}
},
"facets": {
"tokens": {
"terms": {
"field": "_tokens._all._facet"
}
}
},
"size": 0
}

The answer is:

{
"facets": {
"tokens": {
"_type": "terms",
"missing": 0,
"other": 7321391,
"terms": [ ... ],
"total": 7663578
}
},
...
}

As you can see, there are a lot of documents. My index has 23 GiB size.

This query takes ~ 200ms, but it takes < 5ms without the facet. The
question is: how can I improve its performance? It should be around 10ms...

(1) I am thinking if rewriting my query could improve it.
"_tokens._all._facet" is a non-analyzed string field, whereas
"_tokens._all._text.ngram" is ICU tokenized, ICU folded and ngram analyzed.
There are several values for each document. Is there anything wrong there I
should consider?
(2) I don't know if it could be possible to "index" or "cache" somehow
the values of "_tokens._all._facet". It is the field that is used by the
facets all the time, so it gets constantly accessed.
(3) If I use a cluster, could a high number of nodes and shards improve
my performance? Would Elasticsearch perform a parallel facetting in the
data nodes?
(4) Finally, can you give me advice on how to test if I am having
problems with data access (like I/O blocks)?

Thanks in advance! :slight_smile:

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/UxRs4moIaPM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hey,
I played around with faceting and was able to get very similar results to
yours. Regex vs normal term facet didn't seem to make a big difference. I
didn't do any formal testing, but running a few different types of facet
queries came to the conclusion that facet run time is basically:
num result docs * num distinct terms for field

So, if you have a specific query or a low cardinality field things should
be fast. Outside of that, I think you're looking at distributing this
across more shards/CPUs to speed this up more.

Best Regards,
Paul

On Tuesday, September 10, 2013 8:23:52 AM UTC-4, Guillermo Arias del Río
wrote:

So, I checked a few things, but I am still lost...

Now I only have one node. I can see with bigdesk that the heap memory
never goes over 5 of the available 6 GiB.
"_nodes/stats/indices/fielddata/_tokens._all._facet" tells me that I have a
fielddata cache of about 500 MiB for the field I'm using in my facet.
Still, when I make the tests, I can see how Elasticsearch performs I/O
operations (I used iotop for this). So, what can I be doing wrong? By the
way, just to be sure, I took out the facets from the query and it lasts a
few milliseconds, so it is really the facetting that's taking so long.

Can someone help me? Paul? :slight_smile:

On Tue, Sep 10, 2013 at 12:04 PM, Guillermo Arias del Río <
arias...@gmail.com <javascript:>> wrote:

Hi, Paul,

thanks for your response :slight_smile:

The time is an average of 5 requests after the first one, so it would be
the warmed case, but I don't know if warmers could help me here: in the
normal scenario, the queries vary a lot, though it is true that I am always
facetting the same field. I also thought that my problem had to do with the
RAM allocated... but then I ran bigdesk and saw that the memory never gets
used above about 60% of the amount I've reserved (currently 6 GiB). I will
try to tweak the field cache.

Thank you very much for the tips about (3) and (4). Maybe there's
something going wrong there!

Best Regards,
Guillermo

On Mon, Sep 9, 2013 at 11:21 PM, ppearcy <ppe...@gmail.com <javascript:>>wrote:

I am curious, is this 200ms the first time the query runs of 200ms
everytime? The first one should always be slower and subsequent ones should
be fast once the index is warmed. If you're trying to optimize the warming
case, I recommend using index warmers.

Facets currently pull all the values for the field into memory (field
cache):
Elasticsearch Platform — Find real-time answers at scale | Elastic

I would make sure you have enough RAM allocated to elasticsearch to hold
all these in memory.

To run through your items:

  1. I don't think re-writing the query will help on the facet side.
  2. Yeah, this is how things work. Just ensure you are using resident
    cache type (see link above) so that your facet values stay in memory.
  3. This could speed things up, since you are running across multiple
    nodes/shards, each with a smaller set of values in memory. Just watch out
    for incorrect facet counts (
    terms facet gives wrong count with n_shards > 1 · Issue #1305 · elastic/elasticsearch · GitHub)
  4. I would ensure that you're machine isn't swapping and especially
    isn't swapping elasticsearch (check out details on mlockall here:
    Elasticsearch Platform — Find real-time answers at scale | Elastic). If
    you're good there, you can check out iostat to ensure disk performance is
    as expected.

Best Regards,
Paul

On Monday, September 9, 2013 7:16:22 AM UTC-4, Guillermo Arias del Río
wrote:

Hi,

I'm trying to improve the performance of my facet queries. My typical
facet query looks a bit more complicated (two facets with regex and a more
complex query), but I have reduced it to a very simple example:

{
"query": {
"match": {
"_tokens._all._text.ngram": "kat"
}
},
"facets": {
"tokens": {
"terms": {
"field": "_tokens._all._facet"
}
}
},
"size": 0
}

The answer is:

{
"facets": {
"tokens": {
"_type": "terms",
"missing": 0,
"other": 7321391,
"terms": [ ... ],
"total": 7663578
}
},
...
}

As you can see, there are a lot of documents. My index has 23 GiB size.

This query takes ~ 200ms, but it takes < 5ms without the facet. The
question is: how can I improve its performance? It should be around 10ms...

(1) I am thinking if rewriting my query could improve it.
"_tokens._all._facet" is a non-analyzed string field, whereas
"_tokens._all._text.ngram" is ICU tokenized, ICU folded and ngram analyzed.
There are several values for each document. Is there anything wrong there I
should consider?
(2) I don't know if it could be possible to "index" or "cache" somehow
the values of "_tokens._all._facet". It is the field that is used by the
facets all the time, so it gets constantly accessed.
(3) If I use a cluster, could a high number of nodes and shards improve
my performance? Would Elasticsearch perform a parallel facetting in the
data nodes?
(4) Finally, can you give me advice on how to test if I am having
problems with data access (like I/O blocks)?

Thanks in advance! :slight_smile:

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/UxRs4moIaPM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi, Paul,

I was doing the same kind of tests, but I found something very interesting:
Even when I set my query to be limited to 0 ( "query": { "filtered": {
"filter": { "limit": { "value": 0 } }, "query": {...} } } ), it still lasts
over 200ms! But if I take out the facets, it just takes 2ms. I am going to
test in a cluster soon. I hope I'll be able to find out what's happenning.

Thanks again.
Guillermo

On Wed, Sep 11, 2013 at 9:46 AM, ppearcy ppearcy@gmail.com wrote:

Hey,
I played around with faceting and was able to get very similar results
to yours. Regex vs normal term facet didn't seem to make a big difference.
I didn't do any formal testing, but running a few different types of facet
queries came to the conclusion that facet run time is basically:
num result docs * num distinct terms for field

So, if you have a specific query or a low cardinality field things should
be fast. Outside of that, I think you're looking at distributing this
across more shards/CPUs to speed this up more.

Best Regards,
Paul

On Tuesday, September 10, 2013 8:23:52 AM UTC-4, Guillermo Arias del Río
wrote:

So, I checked a few things, but I am still lost...

Now I only have one node. I can see with bigdesk that the heap memory
never goes over 5 of the available 6 GiB. "_nodes/stats/indices/**fielddata/_tokens._all._facet"
tells me that I have a fielddata cache of about 500 MiB for the field I'm
using in my facet. Still, when I make the tests, I can see how
Elasticsearch performs I/O operations (I used iotop for this). So, what can
I be doing wrong? By the way, just to be sure, I took out the facets from
the query and it lasts a few milliseconds, so it is really the facetting
that's taking so long.

Can someone help me? Paul? :slight_smile:

On Tue, Sep 10, 2013 at 12:04 PM, Guillermo Arias del Río <
arias...@gmail.com> wrote:

Hi, Paul,

thanks for your response :slight_smile:

The time is an average of 5 requests after the first one, so it would be
the warmed case, but I don't know if warmers could help me here: in the
normal scenario, the queries vary a lot, though it is true that I am always
facetting the same field. I also thought that my problem had to do with the
RAM allocated... but then I ran bigdesk and saw that the memory never gets
used above about 60% of the amount I've reserved (currently 6 GiB). I will
try to tweak the field cache.

Thank you very much for the tips about (3) and (4). Maybe there's
something going wrong there!

Best Regards,
Guillermo

On Mon, Sep 9, 2013 at 11:21 PM, ppearcy ppe...@gmail.com wrote:

I am curious, is this 200ms the first time the query runs of 200ms
everytime? The first one should always be slower and subsequent ones should
be fast once the index is warmed. If you're trying to optimize the warming
case, I recommend using index warmers.

Facets currently pull all the values for the field into memory (field
cache): http://www.**Elasticsearch Platform — Find real-time answers at scale | Elastic
**fielddata/http://www.elasticsearch.org/guide/reference/index-modules/fielddata/

I would make sure you have enough RAM allocated to elasticsearch to
hold all these in memory.

To run through your items:

  1. I don't think re-writing the query will help on the facet side.
  2. Yeah, this is how things work. Just ensure you are using resident
    cache type (see link above) so that your facet values stay in memory.
  3. This could speed things up, since you are running across multiple
    nodes/shards, each with a smaller set of values in memory. Just watch out
    for incorrect facet counts (https://github.com/**
    elasticsearch/elasticsearch/**issues/1305https://github.com/elasticsearch/elasticsearch/issues/1305
    )
  4. I would ensure that you're machine isn't swapping and especially
    isn't swapping elasticsearch (check out details on mlockall here:
    http://www.**Elasticsearch Platform — Find real-time answers at scale | Elasticreference/setup/installation/http://www.elasticsearch.org/guide/reference/setup/installation/
    )
    . If you're good there, you can check out iostat to ensure disk
    performance is as expected.

Best Regards,
Paul

On Monday, September 9, 2013 7:16:22 AM UTC-4, Guillermo Arias del Río
wrote:

Hi,

I'm trying to improve the performance of my facet queries. My typical
facet query looks a bit more complicated (two facets with regex and a more
complex query), but I have reduced it to a very simple example:

{
"query": {
"match": {
"_tokens._all._text.ngram": "kat"
}
},
"facets": {
"tokens": {
"terms": {
"field": "_tokens._all._facet"
}
}
},
"size": 0
}

The answer is:

{
"facets": {
"tokens": {
"_type": "terms",
"missing": 0,
"other": 7321391,
"terms": [ ... ],
"total": 7663578
}
},
...
}

As you can see, there are a lot of documents. My index has 23 GiB size.

This query takes ~ 200ms, but it takes < 5ms without the facet. The
question is: how can I improve its performance? It should be around 10ms...

(1) I am thinking if rewriting my query could improve it.
"_tokens._all._facet" is a non-analyzed string field, whereas
"_tokens._all._text.ngram" is ICU tokenized, ICU folded and ngram analyzed.
There are several values for each document. Is there anything wrong there I
should consider?
(2) I don't know if it could be possible to "index" or "cache" somehow
the values of "_tokens._all._facet". It is the field that is used by the
facets all the time, so it gets constantly accessed.
(3) If I use a cluster, could a high number of nodes and shards
improve my performance? Would Elasticsearch perform a parallel facetting in
the data nodes?
(4) Finally, can you give me advice on how to test if I am having
problems with data access (like I/O blocks)?

Thanks in advance! :slight_smile:

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/**
topic/elasticsearch/**UxRs4moIaPM/unsubscribehttps://groups.google.com/d/topic/elasticsearch/UxRs4moIaPM/unsubscribe
.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/UxRs4moIaPM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi,

I've started a second node in the same server and the performance goes
up... and I think I've discovered why. Elasticsearch is aware that my
server has 8 CPU's, so it reserves a search thread pool of 8 threads. When
I launch the search, only one of these threads is working. But if I start
two nodes, two threads are working. That results in a better usage of the
server's resources. So, having more CPU's helps dealing with more requests,
but it doesn't increase the performance of one single query. Am I right?

On Wed, Sep 11, 2013 at 11:29 AM, Guillermo Arias del Río <
ariasdelrio@gmail.com> wrote:

Hi, Paul,

I was doing the same kind of tests, but I found something very
interesting: Even when I set my query to be limited to 0 ( "query": {
"filtered": { "filter": { "limit": { "value": 0 } }, "query": {...} } } ),
it still lasts over 200ms! But if I take out the facets, it just takes 2ms.
I am going to test in a cluster soon. I hope I'll be able to find out
what's happenning.

Thanks again.
Guillermo

On Wed, Sep 11, 2013 at 9:46 AM, ppearcy ppearcy@gmail.com wrote:

Hey,
I played around with faceting and was able to get very similar results
to yours. Regex vs normal term facet didn't seem to make a big difference.
I didn't do any formal testing, but running a few different types of facet
queries came to the conclusion that facet run time is basically:
num result docs * num distinct terms for field

So, if you have a specific query or a low cardinality field things should
be fast. Outside of that, I think you're looking at distributing this
across more shards/CPUs to speed this up more.

Best Regards,
Paul

On Tuesday, September 10, 2013 8:23:52 AM UTC-4, Guillermo Arias del Río
wrote:

So, I checked a few things, but I am still lost...

Now I only have one node. I can see with bigdesk that the heap memory
never goes over 5 of the available 6 GiB. "_nodes/stats/indices/**fielddata/_tokens._all._facet"
tells me that I have a fielddata cache of about 500 MiB for the field I'm
using in my facet. Still, when I make the tests, I can see how
Elasticsearch performs I/O operations (I used iotop for this). So, what can
I be doing wrong? By the way, just to be sure, I took out the facets from
the query and it lasts a few milliseconds, so it is really the facetting
that's taking so long.

Can someone help me? Paul? :slight_smile:

On Tue, Sep 10, 2013 at 12:04 PM, Guillermo Arias del Río <
arias...@gmail.com> wrote:

Hi, Paul,

thanks for your response :slight_smile:

The time is an average of 5 requests after the first one, so it would
be the warmed case, but I don't know if warmers could help me here: in the
normal scenario, the queries vary a lot, though it is true that I am always
facetting the same field. I also thought that my problem had to do with the
RAM allocated... but then I ran bigdesk and saw that the memory never gets
used above about 60% of the amount I've reserved (currently 6 GiB). I will
try to tweak the field cache.

Thank you very much for the tips about (3) and (4). Maybe there's
something going wrong there!

Best Regards,
Guillermo

On Mon, Sep 9, 2013 at 11:21 PM, ppearcy ppe...@gmail.com wrote:

I am curious, is this 200ms the first time the query runs of 200ms
everytime? The first one should always be slower and subsequent ones should
be fast once the index is warmed. If you're trying to optimize the warming
case, I recommend using index warmers.

Facets currently pull all the values for the field into memory (field
cache): http://www.Elasticsearch Platform — Find real-time answers at scale | Elastic
reference/index-modules/**fielddata/http://www.elasticsearch.org/guide/reference/index-modules/fielddata/

I would make sure you have enough RAM allocated to elasticsearch to
hold all these in memory.

To run through your items:

  1. I don't think re-writing the query will help on the facet side.
  2. Yeah, this is how things work. Just ensure you are using resident
    cache type (see link above) so that your facet values stay in memory.
  3. This could speed things up, since you are running across multiple
    nodes/shards, each with a smaller set of values in memory. Just watch out
    for incorrect facet counts (https://github.com/**
    elasticsearch/elasticsearch/**issues/1305https://github.com/elasticsearch/elasticsearch/issues/1305
    )
  4. I would ensure that you're machine isn't swapping and especially
    isn't swapping elasticsearch (check out details on mlockall here:
    http://www.**Elasticsearch Platform — Find real-time answers at scale | Elasticreference/setup/installation/http://www.elasticsearch.org/guide/reference/setup/installation/
    )
    . If you're good there, you can check out iostat to ensure disk
    performance is as expected.

Best Regards,
Paul

On Monday, September 9, 2013 7:16:22 AM UTC-4, Guillermo Arias del Río
wrote:

Hi,

I'm trying to improve the performance of my facet queries. My typical
facet query looks a bit more complicated (two facets with regex and a more
complex query), but I have reduced it to a very simple example:

{
"query": {
"match": {
"_tokens._all._text.ngram": "kat"
}
},
"facets": {
"tokens": {
"terms": {
"field": "_tokens._all._facet"
}
}
},
"size": 0
}

The answer is:

{
"facets": {
"tokens": {
"_type": "terms",
"missing": 0,
"other": 7321391,
"terms": [ ... ],
"total": 7663578
}
},
...
}

As you can see, there are a lot of documents. My index has 23 GiB
size.

This query takes ~ 200ms, but it takes < 5ms without the facet. The
question is: how can I improve its performance? It should be around 10ms...

(1) I am thinking if rewriting my query could improve it.
"_tokens._all._facet" is a non-analyzed string field, whereas
"_tokens._all._text.ngram" is ICU tokenized, ICU folded and ngram analyzed.
There are several values for each document. Is there anything wrong there I
should consider?
(2) I don't know if it could be possible to "index" or "cache"
somehow the values of "_tokens._all._facet". It is the field that is used
by the facets all the time, so it gets constantly accessed.
(3) If I use a cluster, could a high number of nodes and shards
improve my performance? Would Elasticsearch perform a parallel facetting in
the data nodes?
(4) Finally, can you give me advice on how to test if I am having
problems with data access (like I/O blocks)?

Thanks in advance! :slight_smile:

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/**
topic/elasticsearch/**UxRs4moIaPM/unsubscribehttps://groups.google.com/d/topic/elasticsearch/UxRs4moIaPM/unsubscribe
.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/UxRs4moIaPM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.