Performance killed when faceting on high cardinality fields

Hi,

We're doing some ES performance testing with a relatively small index. All
is peachy until we want to facet on a field that has relatively high
cardinality - in this case it's a "tags" field that, as you can imagine,
has a high number of distinct values across all documents in the index.
So when we include faceting on tags in our queries performance sinks from
over 400 QPS to 20-30 QPS. The average latency jumps from 40 ms to 500 ms.

Is there anything in ES that one can use to improve performance in such
cases?

In Solr land there are 2 faceting methods, one of which is designed for "situations
where the number of indexed values for the field is high, but the number of
values per document is low":

Field Cache: If facet.method=fc then a field-cache approach will be used.
This is currently implemented using either the Lucene FieldCachehttp://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/all/org/apache/lucene/search/FieldCache.html or
(starting in Solr 1.4) an UnInvertedField if the field either is
multi-valued or is tokenized (according toFieldTypehttp://wiki.apache.org/solr/FieldType.isTokened()).
Each document is looked up in the cache to see what terms/values it
contains, and a tally is incremented for each value. This is excellent for
situations where the number of indexed values for the field is high, but
the number of values per document is low. For multi-valued fields, a hybrid
approach is used that uses term filters from the filterCache for terms that
match many documents.

Source: http://wiki.apache.org/solr/SolrFacetingOverview

I didn't see anything like this in ES docs and I'm wondering if there is
room for improvement in ES faceting or....?

Thanks,
Otis

Search Analytics - http://sematext.com/search-analytics/index.html
Scalable Performance Monitoring - http://sematext.com/spm/index.html

Otis,

I think this is similar/the same to the issue I raised here:

Interesting that the effect you saw was increased latency; in 0.18 the
construction would always be fast but would run out of memory (unless you
used "soft" caching, in which case it would just "swap" data in/out a lot,
which would obviously increase the latency; but caches are by default
"hard").

Workarounds: nesting helps a lot, provided you don't need to do custom
sorts. Where you do need to sort, creating separate indexes for the "worst"
offenders (eg highest 1% of array sizes) worked very well.

I think Shay mentioned during the discussion that improving the field cache
to handle large field cardinalities was on his todo list.

Alex
Ikanow: agile intelligence through open analytics http://bit.ly/ikanow-oss

On Tuesday, May 22, 2012 2:18:23 PM UTC-4, Otis Gospodnetic wrote:

Hi,

We're doing some ES performance testing with a relatively small index.
All is peachy until we want to facet on a field that has relatively high
cardinality - in this case it's a "tags" field that, as you can imagine,
has a high number of distinct values across all documents in the index.
So when we include faceting on tags in our queries performance sinks from
over 400 QPS to 20-30 QPS. The average latency jumps from 40 ms to 500 ms.

Is there anything in ES that one can use to improve performance in such
cases?

In Solr land there are 2 faceting methods, one of which is designed for "situations
where the number of indexed values for the field is high, but the number of
values per document is low":

Field Cache: If facet.method=fc then a field-cache approach will be
used. This is currently implemented using either the Lucene FieldCachehttp://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/all/org/apache/lucene/search/FieldCache.html or
(starting in Solr 1.4) an UnInvertedField if the field either is
multi-valued or is tokenized (according toFieldTypehttp://wiki.apache.org/solr/FieldType.isTokened()).
Each document is looked up in the cache to see what terms/values it
contains, and a tally is incremented for each value. This is excellent for
situations where the number of indexed values for the field is high, but
the number of values per document is low. For multi-valued fields, a hybrid
approach is used that uses term filters from the filterCache for terms that
match many documents.

Source: SolrFacetingOverview - Solr - Apache Software Foundation

I didn't see anything like this in ES docs and I'm wondering if there is
room for improvement in ES faceting or....?

Thanks,
Otis

Search Analytics - Cloud Monitoring Tools & Services | Sematext
Scalable Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

Hi,

On Tuesday, May 22, 2012 5:56:21 PM UTC-4, Alex at Ikanow wrote:

Otis,

I think this is similar/the same to the issue I raised here:

Question/comment about multi-value field data construction · Issue #1683 · elastic/elasticsearch · GitHub

Interesting that the effect you saw was increased latency; in 0.18 the
construction would always be fast but would run out of memory (unless you
used "soft" caching, in which case it would just "swap" data in/out a lot,
which would obviously increase the latency; but caches are by default
"hard").

Right. We're about to do a round of performance tests and use SPM for ES
to look at all ES cache stats.

Workarounds: nesting helps a lot, provided you don't need to do custom
sorts. Where you do need to sort, creating separate indexes for the "worst"
offenders (eg highest 1% of array sizes) worked very well.

I think Shay mentioned during the discussion that improving the field
cache to handle large field cardinalities was on his todo list

Uh, that would be great!
Shay, is there an issue we should watch, and do you know if this will be in
0.20?

Thanks!
Otis

Search Analytics - Cloud Monitoring Tools & Services | Sematext
Scalable Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

Alex
Ikanow: agile intelligence through open analyticshttp://bit.ly/ikanow-oss

On Tuesday, May 22, 2012 2:18:23 PM UTC-4, Otis Gospodnetic wrote:

Hi,

We're doing some ES performance testing with a relatively small index.
All is peachy until we want to facet on a field that has relatively high
cardinality - in this case it's a "tags" field that, as you can imagine,
has a high number of distinct values across all documents in the index.
So when we include faceting on tags in our queries performance sinks from
over 400 QPS to 20-30 QPS. The average latency jumps from 40 ms to 500 ms.

Is there anything in ES that one can use to improve performance in such
cases?

In Solr land there are 2 faceting methods, one of which is designed for "situations
where the number of indexed values for the field is high, but the number of
values per document is low":

Field Cache: If facet.method=fc then a field-cache approach will be
used. This is currently implemented using either the Lucene FieldCachehttp://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/all/org/apache/lucene/search/FieldCache.html or
(starting in Solr 1.4) an UnInvertedField if the field either is
multi-valued or is tokenized (according toFieldTypehttp://wiki.apache.org/solr/FieldType.isTokened()).
Each document is looked up in the cache to see what terms/values it
contains, and a tally is incremented for each value. This is excellent for
situations where the number of indexed values for the field is high, but
the number of values per document is low. For multi-valued fields, a hybrid
approach is used that uses term filters from the filterCache for terms that
match many documents.

Source: SolrFacetingOverview - Solr - Apache Software Foundation

I didn't see anything like this in ES docs and I'm wondering if there is
room for improvement in ES faceting or....?

Thanks,
Otis

Search Analytics - Cloud Monitoring Tools & Services | Sematext
Scalable Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

Hello,

On Wednesday, May 23, 2012 6:02:00 PM UTC-4, Otis Gospodnetic wrote:

Hi,

On Tuesday, May 22, 2012 5:56:21 PM UTC-4, Alex at Ikanow wrote:

Otis,

I think this is similar/the same to the issue I raised here:

Question/comment about multi-value field data construction · Issue #1683 · elastic/elasticsearch · GitHub

Interesting that the effect you saw was increased latency; in 0.18 the
construction would always be fast but would run out of memory (unless you
used "soft" caching, in which case it would just "swap" data in/out a lot,
which would obviously increase the latency; but caches are by default
"hard").

Right. We're about to do a round of performance tests and use SPM for ES
to look at all ES cache stats.

Workarounds: nesting helps a lot, provided you don't need to do custom
sorts. Where you do need to sort, creating separate indexes for the "worst"
offenders (eg highest 1% of array sizes) worked very well.

I think Shay mentioned during the discussion that improving the field
cache to handle large field cardinalities was on his todo list

Uh, that would be great!
Shay, is there an issue we should watch, and do you know if this will be
in 0.20?

Shay, for what it's worth, I did some thread dumping while running a
performance test with faceting on high cardinality field and identified
what looks like a hotspot:

"elasticsearch[search]-pool-6-thread-22" daemon prio=10
tid=0x00002ab2e8183800 nid=0x3681 runnable [0x0000000049f03000]
java.lang.Thread.State: RUNNABLE
at
org.apache.lucene.util.PriorityQueue.downHeap(PriorityQueue.java:239)
at
org.apache.lucene.util.PriorityQueue.updateTop(PriorityQueue.java:202)
at
org.elasticsearch.search.facet.terms.strings.TermsStringOrdinalsFacetCollector.facet(TermsStringOrdinalsFacetCollector.java:168)
at
org.elasticsearch.search.facet.FacetPhase.execute(FacetPhase.java:138)
at
org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:203)

I have not look at that code yet, but you probably know what's on line 168
by heart. Is there any chance that something could be optimized there?
And should I open an issue with the above or is this a known thing and has
an issue open already?

Also, I see that strings in the package name.
Do you think performance would be any better if we somehow replaces string
tokens with, say, int tokens?

Thanks,
Otis

Search Analytics - Cloud Monitoring Tools & Services | Sematext
Scalable Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

Alex

Ikanow: agile intelligence through open analyticshttp://bit.ly/ikanow-oss

On Tuesday, May 22, 2012 2:18:23 PM UTC-4, Otis Gospodnetic wrote:

Hi,

We're doing some ES performance testing with a relatively small index.
All is peachy until we want to facet on a field that has relatively high
cardinality - in this case it's a "tags" field that, as you can imagine,
has a high number of distinct values across all documents in the index.
So when we include faceting on tags in our queries performance sinks
from over 400 QPS to 20-30 QPS. The average latency jumps from 40 ms to
500 ms.

Is there anything in ES that one can use to improve performance in such
cases?

In Solr land there are 2 faceting methods, one of which is designed for "situations
where the number of indexed values for the field is high, but the number of
values per document is low":

Field Cache: If facet.method=fc then a field-cache approach will be
used. This is currently implemented using either the Lucene FieldCachehttp://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/all/org/apache/lucene/search/FieldCache.html or
(starting in Solr 1.4) an UnInvertedField if the field either is
multi-valued or is tokenized (according toFieldTypehttp://wiki.apache.org/solr/FieldType.isTokened()).
Each document is looked up in the cache to see what terms/values it
contains, and a tally is incremented for each value. This is excellent for
situations where the number of indexed values for the field is high, but
the number of values per document is low. For multi-valued fields, a hybrid
approach is used that uses term filters from the filterCache for terms that
match many documents.

Source: SolrFacetingOverview - Solr - Apache Software Foundation

I didn't see anything like this in ES docs and I'm wondering if there is
room for improvement in ES faceting or....?

Thanks,
Otis

Search Analytics - Cloud Monitoring Tools & Services | Sematext
Scalable Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

I switched from string tags to short tags and increased the amount of data
I could load in memory before hitting OOM.

See https://groups.google.com/d/msg/elasticsearch/xsMmFDuSVCM/gPzXVBzLlBkJ
for all the things I've done so far.

This week I finally got my new machines with more memory, and that
obviously has helped the most :slight_smile:

Thanks,
Andy

Hi Andy,

On Thursday, May 24, 2012 8:15:16 AM UTC-4, Andy Wick wrote:

I switched from string tags to short tags and increased the amount of data
I could load in memory before hitting OOM.

See https://groups.google.com/d/msg/elasticsearch/xsMmFDuSVCM/gPzXVBzLlBkJ for all the things I've done so far.

OK, so these are the relevant points for us from that thread:

In our case the problem is not OOM -- we have servers with > 90 GB RAM.
Our problem is speed - query latency.

So, would you happen to know if either of the above changes had a positive
effect on query speed?

Thanks,
Otis

Search Analytics - Cloud Monitoring Tools & Services | Sematext
Scalable Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

This week I finally got my new machines with more memory, and that

obviously has helped the most :slight_smile:

Thanks,
Andy

Hi,

On Thursday, May 24, 2012 8:15:16 AM UTC-4, Andy Wick wrote:

I switched from string tags to short tags and increased the amount of data
I could load in memory before hitting OOM.

See https://groups.google.com/d/msg/elasticsearch/xsMmFDuSVCM/gPzXVBzLlBkJ for all the things I've done so far.

Re that _version trick. Is the following what you did?

Create 2 indices:

  1. the main index
  2. the tags-to-sequence-number-generator-via-version trick index

Index 2) is used to send it each tag of doc to be indexed and get a
distinct number for each new tag via _version
This converts tags to numbers and lets you index tags as numbers in the
main index.

Something like this:
doc1:
tags: a b c
doc 2:
tabs: a foo bar

At index time this happens for doc 1:

  • send tag a to index 2) and get some int back, say 1
  • send tag b to index 2) and get back 2
  • send tag c to index 2) and get back 3

index this doc in main index

Then for doc 2:

  • send tag a to index 2) and get back 1 (again!)
  • send tag foo to index 2) and get back 4
  • send tag bar to index 2) and get back 5

index this doc in main index.

Then, at search time, you facet on the tags field that is now multi-valued
and numeric (and not multi-valued and string).

So ES could return facet (count) as follows:

1 (2)
2 (1)
3 (1)
4 (1)
5 (1)

And then you use index 2) to look up 1, 2, 3, 4, and 4 and get back the
original string values of those tags, thus allowing you to show this to the
end user:

a (2)
b (1)
c (1)
foo (1)
bar (1)

Something like that?

If so, doesn't this considerably slow down your indexing?
And doesn't it actually add search latency?

In your case speed may not matter as much as memory footprint. In our case
we need to index a few thousand documents a second and handle > 100 QPS
with 90th percentile latency < 100 ms.

Thanks,
Otis

Search Analytics - Cloud Monitoring Tools & Services | Sematext
Scalable Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

This week I finally got my new machines with more memory, and that

obviously has helped the most :slight_smile:

Thanks,
Andy

As far as speed increase from switching from strings to shorts - I didn't
measure it. Subjectively it felt faster, but that could also have been
from reduced loading from disk. My deployment is billions of documents,
6k new docs a second, but EXTREMELY low QPS (basically 0) with no real
response time requirements.

Currently I actually have 3 types in 2 indexes (although you could do it
differently)

  1. elements/element
  2. tags/tag
  3. tags/sequence (only has 1 document currently, and in theory could be in
    your tag type, but i kept it separate)

I have about 15k tags right now. I bulk index everything, and my indexer
stays runnings. So on start it loads all the tags and caches them. It
only hits ES if it doesn't have the tag in cache. In which case it does a
GET, to make sure it wasn't already added by another indexer, if not there
it then gets a new sequence number, and then a POST with op_type=create to
handle the possible race condition of multiple indexers creating. I do the
same caching on the viewer side where I map the numbers back to strings
before returning to the user.

If you don't have long running indexers and viewers, or an un-cachable
number of tags, then I agree the extra look ups will hurt performance.

Another suggestion that might be easier is if your tags can be split up in
categories and you don't need to facet all the categories each time,
then splitting into multiple fields should help. The max number of tags
per document really seems to effect memory (and I'm assuming performance.)
Of course if you need to facet everything then I doubt it will help, and
probably hurt performance.

Thanks,
Andy

Hey,

Otis, are you using any special configuration on the field cache? Are you
also indexing data while searching?

The profiling section that you saw is known, and its basically the time
it takes to sort through the per segment aggregations. As was mentioned, I
have some ideas on how to improve that, working hard on making it into 0.20.

-shay.banon

On Thu, May 24, 2012 at 10:28 PM, Andy Wick andywick@gmail.com wrote:

As far as speed increase from switching from strings to shorts - I didn't
measure it. Subjectively it felt faster, but that could also have been
from reduced loading from disk. My deployment is billions of documents,
6k new docs a second, but EXTREMELY low QPS (basically 0) with no real
response time requirements.

Currently I actually have 3 types in 2 indexes (although you could do it
differently)

  1. elements/element
  2. tags/tag
  3. tags/sequence (only has 1 document currently, and in theory could be
    in your tag type, but i kept it separate)

I have about 15k tags right now. I bulk index everything, and my indexer
stays runnings. So on start it loads all the tags and caches them. It
only hits ES if it doesn't have the tag in cache. In which case it does a
GET, to make sure it wasn't already added by another indexer, if not there
it then gets a new sequence number, and then a POST with op_type=create to
handle the possible race condition of multiple indexers creating. I do the
same caching on the viewer side where I map the numbers back to strings
before returning to the user.

If you don't have long running indexers and viewers, or an un-cachable
number of tags, then I agree the extra look ups will hurt performance.

Another suggestion that might be easier is if your tags can be split up in
categories and you don't need to facet all the categories each time,
then splitting into multiple fields should help. The max number of tags
per document really seems to effect memory (and I'm assuming performance.)
Of course if you need to facet everything then I doubt it will help, and
probably hurt performance.

Thanks,
Andy

Hey Shay,

Great to hear that's a known hotspot and looking forward to any
improvements there! So when is 0.20 coming? Just kidding.

Maybe that hotspot was being hit because of those new segments because,
yes, we had indexing going while searching (we have to do this - documents
are streaming in all the time). We can't stop indexing. Maybe we could
increase our index refresh interval and see if that improves performance of
queries with facets...

Otis

Search Analytics - Cloud Monitoring Tools & Services | Sematext
Scalable Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Tuesday, May 29, 2012 2:19:24 PM UTC-4, kimchy wrote:

Hey,

Otis, are you using any special configuration on the field cache? Are
you also indexing data while searching?

The profiling section that you saw is known, and its basically the time
it takes to sort through the per segment aggregations. As was mentioned, I
have some ideas on how to improve that, working hard on making it into 0.20.

-shay.banon

On Thu, May 24, 2012 at 10:28 PM, Andy Wick andywick@gmail.com wrote:

As far as speed increase from switching from strings to shorts - I didn't
measure it. Subjectively it felt faster, but that could also have been
from reduced loading from disk. My deployment is billions of documents,
6k new docs a second, but EXTREMELY low QPS (basically 0) with no real
response time requirements.

Currently I actually have 3 types in 2 indexes (although you could do it
differently)

  1. elements/element
  2. tags/tag
  3. tags/sequence (only has 1 document currently, and in theory could be
    in your tag type, but i kept it separate)

I have about 15k tags right now. I bulk index everything, and my indexer
stays runnings. So on start it loads all the tags and caches them. It
only hits ES if it doesn't have the tag in cache. In which case it does a
GET, to make sure it wasn't already added by another indexer, if not there
it then gets a new sequence number, and then a POST with op_type=create to
handle the possible race condition of multiple indexers creating. I do the
same caching on the viewer side where I map the numbers back to strings
before returning to the user.

If you don't have long running indexers and viewers, or an un-cachable
number of tags, then I agree the extra look ups will hurt performance.

Another suggestion that might be easier is if your tags can be split up
in categories and you don't need to facet all the categories each time,
then splitting into multiple fields should help. The max number of tags
per document really seems to effect memory (and I'm assuming performance.)
Of course if you need to facet everything then I doubt it will help, and
probably hurt performance.

Thanks,
Andy

I think the warmup option is the best one you have. Current state of 0.20
is that its basically 0.19 + warmups, and its being used in production by
several users (that needed the warmup option).

On Tue, May 29, 2012 at 9:39 PM, Otis Gospodnetic <
otis.gospodnetic@gmail.com> wrote:

Hey Shay,

Great to hear that's a known hotspot and looking forward to any
improvements there! So when is 0.20 coming? Just kidding.

Maybe that hotspot was being hit because of those new segments because,
yes, we had indexing going while searching (we have to do this - documents
are streaming in all the time). We can't stop indexing. Maybe we could
increase our index refresh interval and see if that improves performance of
queries with facets...

Otis

Search Analytics - Cloud Monitoring Tools & Services | Sematext
Scalable Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Tuesday, May 29, 2012 2:19:24 PM UTC-4, kimchy wrote:

Hey,

Otis, are you using any special configuration on the field cache? Are
you also indexing data while searching?

The profiling section that you saw is known, and its basically the
time it takes to sort through the per segment aggregations. As was
mentioned, I have some ideas on how to improve that, working hard on making
it into 0.20.

-shay.banon

On Thu, May 24, 2012 at 10:28 PM, Andy Wick andywick@gmail.com wrote:

As far as speed increase from switching from strings to shorts - I
didn't measure it. Subjectively it felt faster, but that could also have
been from reduced loading from disk. My deployment is billions of
documents, 6k new docs a second, but EXTREMELY low QPS (basically 0) with
no real response time requirements.

Currently I actually have 3 types in 2 indexes (although you could do it
differently)

  1. elements/element
  2. tags/tag
  3. tags/sequence (only has 1 document currently, and in theory could be
    in your tag type, but i kept it separate)

I have about 15k tags right now. I bulk index everything, and my
indexer stays runnings. So on start it loads all the tags and caches them.
It only hits ES if it doesn't have the tag in cache. In which case it
does a GET, to make sure it wasn't already added by another indexer, if not
there it then gets a new sequence number, and then a POST
with op_type=create to handle the possible race condition of multiple
indexers creating. I do the same caching on the viewer side where I map
the numbers back to strings before returning to the user.

If you don't have long running indexers and viewers, or an un-cachable
number of tags, then I agree the extra look ups will hurt performance.

Another suggestion that might be easier is if your tags can be split up
in categories and you don't need to facet all the categories each time,
then splitting into multiple fields should help. The max number of tags
per document really seems to effect memory (and I'm assuming performance.)
Of course if you need to facet everything then I doubt it will help, and
probably hurt performance.

Thanks,
Andy

Hey there,
any updates on this problem? I have the exact same thing happening.
Comparing with SolR(comparable schema and same data) using
facet.method=enum, I get about the same response time for SolR and
Elasticsearch, but when using facet.method=fc, I get results around 4x
faster for SolR.
Is there any work going on the addresses that? Otis, did you manage to
find a work around for that?

Thanks,

Leo

On Wednesday, May 30, 2012 12:29:38 AM UTC+2, kimchy wrote:

I think the warmup option is the best one you have. Current state of 0.20
is that its basically 0.19 + warmups, and its being used in production by
several users (that needed the warmup option).

On Tue, May 29, 2012 at 9:39 PM, Otis Gospodnetic <otis.gos...@gmail.com<javascript:>

wrote:

Hey Shay,

Great to hear that's a known hotspot and looking forward to any
improvements there! So when is 0.20 coming? Just kidding.

Maybe that hotspot was being hit because of those new segments because,
yes, we had indexing going while searching (we have to do this - documents
are streaming in all the time). We can't stop indexing. Maybe we could
increase our index refresh interval and see if that improves performance of
queries with facets...

Otis

Search Analytics - Cloud Monitoring Tools & Services | Sematext
Scalable Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Tuesday, May 29, 2012 2:19:24 PM UTC-4, kimchy wrote:

Hey,

Otis, are you using any special configuration on the field cache? Are
you also indexing data while searching?

The profiling section that you saw is known, and its basically the
time it takes to sort through the per segment aggregations. As was
mentioned, I have some ideas on how to improve that, working hard on making
it into 0.20.

-shay.banon

On Thu, May 24, 2012 at 10:28 PM, Andy Wick <andy...@gmail.com<javascript:>

wrote:

As far as speed increase from switching from strings to shorts - I
didn't measure it. Subjectively it felt faster, but that could also have
been from reduced loading from disk. My deployment is billions of
documents, 6k new docs a second, but EXTREMELY low QPS (basically 0) with
no real response time requirements.

Currently I actually have 3 types in 2 indexes (although you could do
it differently)

  1. elements/element
  2. tags/tag
  3. tags/sequence (only has 1 document currently, and in theory could
    be in your tag type, but i kept it separate)

I have about 15k tags right now. I bulk index everything, and my
indexer stays runnings. So on start it loads all the tags and caches them.
It only hits ES if it doesn't have the tag in cache. In which case it
does a GET, to make sure it wasn't already added by another indexer, if not
there it then gets a new sequence number, and then a POST
with op_type=create to handle the possible race condition of multiple
indexers creating. I do the same caching on the viewer side where I map
the numbers back to strings before returning to the user.

If you don't have long running indexers and viewers, or an un-cachable
number of tags, then I agree the extra look ups will hurt performance.

Another suggestion that might be easier is if your tags can be split up
in categories and you don't need to facet all the categories each time,
then splitting into multiple fields should help. The max number of tags
per document really seems to effect memory (and I'm assuming performance.)
Of course if you need to facet everything then I doubt it will help, and
probably hurt performance.

Thanks,
Andy

--

Looking at the latest commits, there seems to have been numerous changes to
the field data and the facets that use them. Hopefully they address the
issue and something will be released soon.

Lucene 4.1 was just released. The next version of Elasticsearch is supposed
to support Lucene 4, so things might be sidetracked in order to catch up.
Many new features under the hood, perhaps Elasticsearch will make use of
them.

Cheers,

Ivan

On Wed, Jan 23, 2013 at 9:16 AM, Leonardo Menezes <
leonardo.menezess@gmail.com> wrote:

Hey there,
any updates on this problem? I have the exact same thing happening.
Comparing with SolR(comparable schema and same data) using
facet.method=enum, I get about the same response time for SolR and
Elasticsearch, but when using facet.method=fc, I get results around 4x
faster for SolR.
Is there any work going on the addresses that? Otis, did you manage to
find a work around for that?

Thanks,

Leo

On Wednesday, May 30, 2012 12:29:38 AM UTC+2, kimchy wrote:

I think the warmup option is the best one you have. Current state of 0.20
is that its basically 0.19 + warmups, and its being used in production by
several users (that needed the warmup option).

On Tue, May 29, 2012 at 9:39 PM, Otis Gospodnetic otis.gos...@gmail.comwrote:

Hey Shay,

Great to hear that's a known hotspot and looking forward to any
improvements there! So when is 0.20 coming? Just kidding.

Maybe that hotspot was being hit because of those new segments because,
yes, we had indexing going while searching (we have to do this - documents
are streaming in all the time). We can't stop indexing. Maybe we could
increase our index refresh interval and see if that improves performance of
queries with facets...

Otis

Search Analytics - http://sematext.com/search-**analytics/index.htmlhttp://sematext.com/search-analytics/index.html
Scalable Performance Monitoring - http://sematext.com/spm/index.**htmlhttp://sematext.com/spm/index.html

On Tuesday, May 29, 2012 2:19:24 PM UTC-4, kimchy wrote:

Hey,

Otis, are you using any special configuration on the field cache? Are
you also indexing data while searching?

The profiling section that you saw is known, and its basically the
time it takes to sort through the per segment aggregations. As was
mentioned, I have some ideas on how to improve that, working hard on making
it into 0.20.

-shay.banon

On Thu, May 24, 2012 at 10:28 PM, Andy Wick andy...@gmail.com wrote:

As far as speed increase from switching from strings to shorts - I
didn't measure it. Subjectively it felt faster, but that could also have
been from reduced loading from disk. My deployment is billions of
documents, 6k new docs a second, but EXTREMELY low QPS (basically 0) with
no real response time requirements.

Currently I actually have 3 types in 2 indexes (although you could do
it differently)

  1. elements/element
  2. tags/tag
  3. tags/sequence (only has 1 document currently, and in theory could
    be in your tag type, but i kept it separate)

I have about 15k tags right now. I bulk index everything, and my
indexer stays runnings. So on start it loads all the tags and caches them.
It only hits ES if it doesn't have the tag in cache. In which case it
does a GET, to make sure it wasn't already added by another indexer, if not
there it then gets a new sequence number, and then a POST
with op_type=create to handle the possible race condition of multiple
indexers creating. I do the same caching on the viewer side where I map
the numbers back to strings before returning to the user.

If you don't have long running indexers and viewers, or an un-cachable
number of tags, then I agree the extra look ups will hurt performance.

Another suggestion that might be easier is if your tags can be split
up in categories and you don't need to facet all the categories each time,
then splitting into multiple fields should help. The max number of tags
per document really seems to effect memory (and I'm assuming performance.)
Of course if you need to facet everything then I doubt it will help, and
probably hurt performance.

Thanks,
Andy

--

--

Hi Leo,

I have not looked into this further and have not notices any changes that
would improve this particular issue in ES.
In Lucene devs are going crazy improving faceting performance, but ES has
its own faceting impl, and Solr its own as well.

See:
http://search-lucene.com/m/hQZT12C5C8x1/facet&subj=Re+Solr+faceting+vs+Lucene+faceting

Otis

ELASTICSEARCH Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Wednesday, January 23, 2013 12:16:31 PM UTC-5, Leonardo Menezes wrote:

Hey there,
any updates on this problem? I have the exact same thing happening.
Comparing with SolR(comparable schema and same data) using
facet.method=enum, I get about the same response time for SolR and
Elasticsearch, but when using facet.method=fc, I get results around 4x
faster for SolR.
Is there any work going on the addresses that? Otis, did you manage to
find a work around for that?

Thanks,

Leo

On Wednesday, May 30, 2012 12:29:38 AM UTC+2, kimchy wrote:

I think the warmup option is the best one you have. Current state of 0.20
is that its basically 0.19 + warmups, and its being used in production by
several users (that needed the warmup option).

On Tue, May 29, 2012 at 9:39 PM, Otis Gospodnetic otis.gos...@gmail.comwrote:

Hey Shay,

Great to hear that's a known hotspot and looking forward to any
improvements there! So when is 0.20 coming? Just kidding.

Maybe that hotspot was being hit because of those new segments because,
yes, we had indexing going while searching (we have to do this - documents
are streaming in all the time). We can't stop indexing. Maybe we could
increase our index refresh interval and see if that improves performance of
queries with facets...

Otis

Search Analytics - Cloud Monitoring Tools & Services | Sematext
Scalable Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Tuesday, May 29, 2012 2:19:24 PM UTC-4, kimchy wrote:

Hey,

Otis, are you using any special configuration on the field cache? Are
you also indexing data while searching?

The profiling section that you saw is known, and its basically the
time it takes to sort through the per segment aggregations. As was
mentioned, I have some ideas on how to improve that, working hard on making
it into 0.20.

-shay.banon

On Thu, May 24, 2012 at 10:28 PM, Andy Wick andy...@gmail.com wrote:

As far as speed increase from switching from strings to shorts - I
didn't measure it. Subjectively it felt faster, but that could also have
been from reduced loading from disk. My deployment is billions of
documents, 6k new docs a second, but EXTREMELY low QPS (basically 0) with
no real response time requirements.

Currently I actually have 3 types in 2 indexes (although you could do
it differently)

  1. elements/element
  2. tags/tag
  3. tags/sequence (only has 1 document currently, and in theory could
    be in your tag type, but i kept it separate)

I have about 15k tags right now. I bulk index everything, and my
indexer stays runnings. So on start it loads all the tags and caches them.
It only hits ES if it doesn't have the tag in cache. In which case it
does a GET, to make sure it wasn't already added by another indexer, if not
there it then gets a new sequence number, and then a POST
with op_type=create to handle the possible race condition of multiple
indexers creating. I do the same caching on the viewer side where I map
the numbers back to strings before returning to the user.

If you don't have long running indexers and viewers, or an un-cachable
number of tags, then I agree the extra look ups will hurt performance.

Another suggestion that might be easier is if your tags can be split
up in categories and you don't need to facet all the categories each time,
then splitting into multiple fields should help. The max number of tags
per document really seems to effect memory (and I'm assuming performance.)
Of course if you need to facet everything then I doubt it will help, and
probably hurt performance.

Thanks,
Andy

--

Have you seen some of the latest commits?

There are no issues attached to these commits, so there is no telling what
version they belong to.

--
Ivan

On Wed, Jan 23, 2013 at 9:09 PM, Otis Gospodnetic <
otis.gospodnetic@gmail.com> wrote:

I have not looked into this further and have not notices any changes that
would improve this particular issue in ES.

--

Ivan Brusic wrote:

Have you seen some of the latest commits?

Added sparse multi ordinals implementation for field data. · elastic/elasticsearch@346422b · GitHub

There are no issues attached to these commits, so there is no
telling what version they belong to.

The goal is for the fielddata refactoring and Lucene 4.1 integration
to appear in 0.21.0. Much of the work is already in master.

-Drew

--

So... just to give an update on this. Reading the source code last night,
We found a parameter that doesn't seem to be documented anywhere and that
is related to choosing which faceting method should be used for a certain
field. The parameter is called execution_hint and should be used like

"facets" : {
"company" : {
"terms" : {
"field" : "current_company",
"size" : 15,
"execution_hint":"map"
}
}
}

The process of choosing the faceting method occurs at TermsFacetProcessor
and is a bit different for strings than it is for other types. Anyway,
after running some tests with this setting, our response time improved a
LOT. So, some numbers:

Index: 12MM documents
Field: string, multi valued. has about 400k unique value
Document: has between 1 to 10 values for this field

Query #1(matches 5000k documents)

  • using "execution_hint":"map" - roughly 50ms avg.
  • not using it - roughly 600ms avg.

Query #2(match all, so, 12MM documents)

  • using "execution_hint":"map" - roughly 1.9s avg.
  • not using it - roughly 800ms avg.

so, since our query pattern is really close to query #1, that really made a
big difference in our results. hope that might be of some help for someone
else.

Leonardo Menezes
(+34) 688907766
http://lmenezes.com

http://twitter.com/leonardomenezes

On Thu, Jan 24, 2013 at 4:39 PM, Drew Raines aaraines@gmail.com wrote:

Ivan Brusic wrote:

Have you seen some of the latest commits?

Added sparse multi ordinals implementation for field data. · elastic/elasticsearch@346422b · GitHub

There are no issues attached to these commits, so there is no
telling what version they belong to.

The goal is for the fielddata refactoring and Lucene 4.1 integration
to appear in 0.21.0. Much of the work is already in master.

-Drew

--

Is this against master or a previous version? And what about memory usage?

On Fri, Jan 25, 2013 at 11:16 AM, Leonardo Menezes <
leonardo.menezess@gmail.com> wrote:

So... just to give an update on this. Reading the source code last night,
We found a parameter that doesn't seem to be documented anywhere and that
is related to choosing which faceting method should be used for a certain
field. The parameter is called execution_hint and should be used like

"facets" : {
"company" : {
"terms" : {
"field" : "current_company",
"size" : 15,
"execution_hint":"map"
}
}
}

The process of choosing the faceting method occurs at TermsFacetProcessor
and is a bit different for strings than it is for other types. Anyway,
after running some tests with this setting, our response time improved a
LOT. So, some numbers:

Index: 12MM documents
Field: string, multi valued. has about 400k unique value
Document: has between 1 to 10 values for this field

Query #1(matches 5000k documents)

  • using "execution_hint":"map" - roughly 50ms avg.
  • not using it - roughly 600ms avg.

Query #2(match all, so, 12MM documents)

  • using "execution_hint":"map" - roughly 1.9s avg.
  • not using it - roughly 800ms avg.

so, since our query pattern is really close to query #1, that really made
a big difference in our results. hope that might be of some help for
someone else.

Leonardo Menezes
(+34) 688907766
http://lmenezes.com

http://twitter.com/leonardomenezes

On Thu, Jan 24, 2013 at 4:39 PM, Drew Raines aaraines@gmail.com wrote:

Ivan Brusic wrote:

Have you seen some of the latest commits?

Added sparse multi ordinals implementation for field data. · elastic/elasticsearch@346422b · GitHub

There are no issues attached to these commits, so there is no
telling what version they belong to.

The goal is for the fielddata refactoring and Lucene 4.1 integration
to appear in 0.21.0. Much of the work is already in master.

-Drew

--

We are running 20.1. Memory usage actually dropped as well as CPU usage.
Not really sure why that could be... As mentioned before, depending on the
query pattern you have, maybe this setting is actually counter productive.

Also, without this option, We were not really able to keep the cluster
running for too much time, at some point, things would slow down too much
and the cluster would just become unstable. We have only been running since
last night our cluster with live traffic(since before it wasn't able to
handle it). So if anything odd come up I will update this, but at the
moment, everything looks good.

Leonardo Menezes
(+34) 688907766
http://lmenezes.com

http://es.linkedin.com/in/leonardomenezess
http://twitter.com/leonardomenezes

On Fri, Jan 25, 2013 at 10:39 AM, Itamar Syn-Hershko itamar@code972.comwrote:

Is this against master or a previous version? And what about memory usage?

On Fri, Jan 25, 2013 at 11:16 AM, Leonardo Menezes <
leonardo.menezess@gmail.com> wrote:

So... just to give an update on this. Reading the source code last night,
We found a parameter that doesn't seem to be documented anywhere and that
is related to choosing which faceting method should be used for a certain
field. The parameter is called execution_hint and should be used like

"facets" : {
"company" : {
"terms" : {
"field" : "current_company",
"size" : 15,
"execution_hint":"map"
}
}
}

The process of choosing the faceting method occurs at TermsFacetProcessor
and is a bit different for strings than it is for other types. Anyway,
after running some tests with this setting, our response time improved a
LOT. So, some numbers:

Index: 12MM documents
Field: string, multi valued. has about 400k unique value
Document: has between 1 to 10 values for this field

Query #1(matches 5000k documents)

  • using "execution_hint":"map" - roughly 50ms avg.
  • not using it - roughly 600ms avg.

Query #2(match all, so, 12MM documents)

  • using "execution_hint":"map" - roughly 1.9s avg.
  • not using it - roughly 800ms avg.

so, since our query pattern is really close to query #1, that really made
a big difference in our results. hope that might be of some help for
someone else.

Leonardo Menezes
(+34) 688907766
http://lmenezes.com

http://twitter.com/leonardomenezes

On Thu, Jan 24, 2013 at 4:39 PM, Drew Raines aaraines@gmail.com wrote:

Ivan Brusic wrote:

Have you seen some of the latest commits?

Added sparse multi ordinals implementation for field data. · elastic/elasticsearch@346422b · GitHub

There are no issues attached to these commits, so there is no
telling what version they belong to.

The goal is for the fielddata refactoring and Lucene 4.1 integration
to appear in 0.21.0. Much of the work is already in master.

-Drew

--

--

Hi,

interesting... it looks like your system can fit the 5000k documents
into the cache with "execution_hint: map" without being hit seriously by
GC. Without execution_hint:map, do you use soft refs by any chance? That
would explain the 600ms, could be extra time because your cache elements
are being invalidated.

Jörg

Am 25.01.13 10:16, schrieb Leonardo Menezes:

So... just to give an update on this. Reading the source code last
night, We found a parameter that doesn't seem to be documented
anywhere and that is related to choosing which faceting method should
be used for a certain field. The parameter is called execution_hint
and should be used like

"facets" : {
"company" : {
"terms" : {
"field" : "current_company",
"size" : 15,
"execution_hint":"map"
}
}
}

The process of choosing the faceting method occurs at
TermsFacetProcessor and is a bit different for strings than it is for
other types. Anyway, after running some tests with this setting, our
response time improved a LOT. So, some numbers:

Index: 12MM documents
Field: string, multi valued. has about 400k unique value
Document: has between 1 to 10 values for this field

Query #1(matches 5000k documents)

  • using "execution_hint":"map" - roughly 50ms avg.
  • not using it - roughly 600ms avg.

Query #2(match all, so, 12MM documents)

  • using "execution_hint":"map" - roughly 1.9s avg.
  • not using it - roughly 800ms avg.

so, since our query pattern is really close to query #1, that really
made a big difference in our results. hope that might be of some help
for someone else.

Leonardo Menezes
(+34) 688907766
http://lmenezes.com http://lmenezes.com/

http://twitter.com/leonardomenezes

On Thu, Jan 24, 2013 at 4:39 PM, Drew Raines <aaraines@gmail.com
mailto:aaraines@gmail.com> wrote:

Ivan Brusic wrote:

> Have you seen some of the latest commits?
>
>
https://github.com/elasticsearch/elasticsearch/commit/346422b74751f498f037daff34ea136a131fca89
>
> There are no issues attached to these commits, so there is no
> telling what version they belong to.

The goal is for the fielddata refactoring and Lucene 4.1 integration
to appear in 0.21.0.  Much of the work is already in master.

-Drew

--