Performance killed when faceting on high cardinality fields

we tried using soft caches but it didn't matter as they were not actually
being invalidated. it was just for testing purposes, but we are not using
that anymore. we also don't have much of memory problems. we are running
the jvm with 30gb heap, so that's plenty for our needs at the moment.

Leonardo Menezes
(+34) 688907766
http://lmenezes.com

http://twitter.com/leonardomenezes

On Fri, Jan 25, 2013 at 11:06 AM, Jörg Prante joergprante@gmail.com wrote:

Hi,

interesting... it looks like your system can fit the 5000k documents into
the cache with "execution_hint: map" without being hit seriously by GC.
Without execution_hint:map, do you use soft refs by any chance? That would
explain the 600ms, could be extra time because your cache elements are
being invalidated.

Jörg

Am 25.01.13 10:16, schrieb Leonardo Menezes:

So... just to give an update on this. Reading the source code last night,
We found a parameter that doesn't seem to be documented anywhere and that
is related to choosing which faceting method should be used for a certain
field. The parameter is called execution_hint and should be used like

"facets" : {
"company" : {
"terms" : {
"field" : "current_company",
"size" : 15,
"execution_hint":"map"
}
}
}

The process of choosing the faceting method occurs at TermsFacetProcessor
and is a bit different for strings than it is for other types. Anyway,
after running some tests with this setting, our response time improved a
LOT. So, some numbers:

Index: 12MM documents
Field: string, multi valued. has about 400k unique value
Document: has between 1 to 10 values for this field

Query #1(matches 5000k documents)

  • using "execution_hint":"map" - roughly 50ms avg.
  • not using it - roughly 600ms avg.

Query #2(match all, so, 12MM documents)

  • using "execution_hint":"map" - roughly 1.9s avg.
  • not using it - roughly 800ms avg.

so, since our query pattern is really close to query #1, that really made
a big difference in our results. hope that might be of some help for
someone else.

Leonardo Menezes
(+34) 688907766
http://lmenezes.com http://lmenezes.com/

<http://twitter.com/**leonardomenezeshttp://twitter.com/leonardomenezes

On Thu, Jan 24, 2013 at 4:39 PM, Drew Raines <aaraines@gmail.com <mailto:
aaraines@gmail.com>> wrote:

Ivan Brusic wrote:

> Have you seen some of the latest commits?
>
>
https://github.com/**elasticsearch/elasticsearch/**commit/**

346422b74751f498f037daff34ea13**6a131fca89https://github.com/elasticsearch/elasticsearch/commit/346422b74751f498f037daff34ea136a131fca89
>
> There are no issues attached to these commits, so there is no
> telling what version they belong to.

The goal is for the fielddata refactoring and Lucene 4.1 integration
to appear in 0.21.0.  Much of the work is already in master.

-Drew

--

Few notes here:

  1. execution_type for terms facet does not change the amount of memory required for the field data itself.
  2. If you were using soft field cache (which I don't recommend using, really), then your tests are problematic because you might hit cases where you load less/more the field data as it gets evicted. As long as this flag is set, your performance tests are effectively meaningless...
  3. The idea of execution_type is to have an internal additional execution method that we were playing while testing. In general, it will be slower compared to the regular execution type, the only case where it might be faster is with fields that have many many unique values. Btw, its ok to have undocumented flags in a project, when its something we use internally but still don't wish to properly expose, its *intentionally not documented".

Last, and this is important, please, don't use the soft field data cache. If you do, you probably doing it wrong. I see the advice for it begin thrown in many places, its not a magic bullet solution if you don't have enough memory for the facets, it will just make things worse 99.9% of the cases. We are working on reducing the memory requirements for facets, but for now, just make sure you have enough memory for your faceting/sorting needs.

On Jan 25, 2013, at 11:39 AM, Leonardo Menezes leonardo.menezess@gmail.com wrote:

we tried using soft caches but it didn't matter as they were not actually being invalidated. it was just for testing purposes, but we are not using that anymore. we also don't have much of memory problems. we are running the jvm with 30gb heap, so that's plenty for our needs at the moment.

Leonardo Menezes
(+34) 688907766
http://lmenezes.com

On Fri, Jan 25, 2013 at 11:06 AM, Jörg Prante joergprante@gmail.com wrote:
Hi,

interesting... it looks like your system can fit the 5000k documents into the cache with "execution_hint: map" without being hit seriously by GC. Without execution_hint:map, do you use soft refs by any chance? That would explain the 600ms, could be extra time because your cache elements are being invalidated.

Jörg

Am 25.01.13 10:16, schrieb Leonardo Menezes:
So... just to give an update on this. Reading the source code last night, We found a parameter that doesn't seem to be documented anywhere and that is related to choosing which faceting method should be used for a certain field. The parameter is called execution_hint and should be used like

"facets" : {
"company" : {
"terms" : {
"field" : "current_company",
"size" : 15,
"execution_hint":"map"
}
}
}

The process of choosing the faceting method occurs at TermsFacetProcessor and is a bit different for strings than it is for other types. Anyway, after running some tests with this setting, our response time improved a LOT. So, some numbers:

Index: 12MM documents
Field: string, multi valued. has about 400k unique value
Document: has between 1 to 10 values for this field

Query #1(matches 5000k documents)

  • using "execution_hint":"map" - roughly 50ms avg.
  • not using it - roughly 600ms avg.

Query #2(match all, so, 12MM documents)

  • using "execution_hint":"map" - roughly 1.9s avg.
  • not using it - roughly 800ms avg.

so, since our query pattern is really close to query #1, that really made a big difference in our results. hope that might be of some help for someone else.

Leonardo Menezes
(+34) 688907766
http://lmenezes.com http://lmenezes.com/

http://twitter.com/leonardomenezes

On Thu, Jan 24, 2013 at 4:39 PM, Drew Raines <aaraines@gmail.com mailto:aaraines@gmail.com> wrote:

Ivan Brusic wrote:

> Have you seen some of the latest commits?
>
>
https://github.com/elasticsearch/elasticsearch/commit/346422b74751f498f037daff34ea136a131fca89
>
> There are no issues attached to these commits, so there is no
> telling what version they belong to.

The goal is for the fielddata refactoring and Lucene 4.1 integration
to appear in 0.21.0.  Much of the work is already in master.

-Drew

--

hey kimchy,
thanks for the explanation. In our case in particular, soft caches made
no difference at all, since cache was never being evicted. Now, we are back
to our original setup(which didn't include soft caches), and things are
running just fine. Without the execution hint, our cluster was unable to
work for more than 5min. Of course, it just might be a really specific
case(very high cardinality, not a lot of documents matched for average
query...).
Anyway, maybe it's interesting documenting this flag, since someone may run
into this same problem...

Leonardo Menezes
(+34) 688907766
http://lmenezes.com

http://es.linkedin.com/in/leonardomenezess
http://twitter.com/leonardomenezes

On Sun, Jan 27, 2013 at 11:30 AM, kimchy@gmail.com wrote:

Few notes here:

  1. execution_type for terms facet does not change the amount of memory
    required for the field data itself.
  2. If you were using soft field cache (which I don't recommend using,
    really), then your tests are problematic because you might hit cases where
    you load less/more the field data as it gets evicted. As long as this flag
    is set, your performance tests are effectively meaningless...
  3. The idea of execution_type is to have an internal additional execution
    method that we were playing while testing. In general, it will be slower
    compared to the regular execution type, the only case where it might be
    faster is with fields that have many many unique values. Btw, its ok to
    have undocumented flags in a project, when its something we use internally
    but still don't wish to properly expose, its *intentionally not documented".

Last, and this is important, please, don't use the soft field data cache.
If you do, you probably doing it wrong. I see the advice for it begin
thrown in many places, its not a magic bullet solution if you don't have
enough memory for the facets, it will just make things worse 99.9% of the
cases. We are working on reducing the memory requirements for facets, but
for now, just make sure you have enough memory for your faceting/sorting
needs.

On Jan 25, 2013, at 11:39 AM, Leonardo Menezes <
leonardo.menezess@gmail.com> wrote:

we tried using soft caches but it didn't matter as they were not actually
being invalidated. it was just for testing purposes, but we are not using
that anymore. we also don't have much of memory problems. we are running
the jvm with 30gb heap, so that's plenty for our needs at the moment.

Leonardo Menezes
(+34) 688907766
http://lmenezes.com

http://twitter.com/leonardomenezes

On Fri, Jan 25, 2013 at 11:06 AM, Jörg Prante joergprante@gmail.comwrote:

Hi,

interesting... it looks like your system can fit the 5000k documents into
the cache with "execution_hint: map" without being hit seriously by GC.
Without execution_hint:map, do you use soft refs by any chance? That would
explain the 600ms, could be extra time because your cache elements are
being invalidated.

Jörg

Am 25.01.13 10:16, schrieb Leonardo Menezes:

So... just to give an update on this. Reading the source code last
night, We found a parameter that doesn't seem to be documented anywhere and
that is related to choosing which faceting method should be used for a
certain field. The parameter is called execution_hint and should be used
like

"facets" : {
"company" : {
"terms" : {
"field" : "current_company",
"size" : 15,
"execution_hint":"map"
}
}
}

The process of choosing the faceting method occurs at
TermsFacetProcessor and is a bit different for strings than it is for other
types. Anyway, after running some tests with this setting, our response
time improved a LOT. So, some numbers:

Index: 12MM documents
Field: string, multi valued. has about 400k unique value
Document: has between 1 to 10 values for this field

Query #1(matches 5000k documents)

  • using "execution_hint":"map" - roughly 50ms avg.
  • not using it - roughly 600ms avg.

Query #2(match all, so, 12MM documents)

  • using "execution_hint":"map" - roughly 1.9s avg.
  • not using it - roughly 800ms avg.

so, since our query pattern is really close to query #1, that really
made a big difference in our results. hope that might be of some help for
someone else.

Leonardo Menezes
(+34) 688907766
http://lmenezes.com http://lmenezes.com/

<http://twitter.com/**leonardomenezeshttp://twitter.com/leonardomenezes

On Thu, Jan 24, 2013 at 4:39 PM, Drew Raines <aaraines@gmail.com<mailto:
aaraines@gmail.com>> wrote:

Ivan Brusic wrote:

> Have you seen some of the latest commits?
>
>
https://github.com/**elasticsearch/elasticsearch/**commit/**

346422b74751f498f037daff34ea13**6a131fca89https://github.com/elasticsearch/elasticsearch/commit/346422b74751f498f037daff34ea136a131fca89
>
> There are no issues attached to these commits, so there is no
> telling what version they belong to.

The goal is for the fielddata refactoring and Lucene 4.1 integration
to appear in 0.21.0.  Much of the work is already in master.

-Drew

--

I would feel more comfortable documenting it once we decide to "officially" expose it. To be honest, we will expose it as an option, just explaining when to use it is tricky. What I always hope is that those won't really be options that the user decides, but we can make this decision automatically. The new field data refactoring that is now in master will allow us to more easily implement those type of decision, we will get there :slight_smile:

On Jan 27, 2013, at 11:41 AM, Leonardo Menezes leonardo.menezess@gmail.com wrote:

hey kimchy,
thanks for the explanation. In our case in particular, soft caches made no difference at all, since cache was never being evicted. Now, we are back to our original setup(which didn't include soft caches), and things are running just fine. Without the execution hint, our cluster was unable to work for more than 5min. Of course, it just might be a really specific case(very high cardinality, not a lot of documents matched for average query...).
Anyway, maybe it's interesting documenting this flag, since someone may run into this same problem...

Leonardo Menezes
(+34) 688907766
http://lmenezes.com

On Sun, Jan 27, 2013 at 11:30 AM, kimchy@gmail.com wrote:
Few notes here:

  1. execution_type for terms facet does not change the amount of memory required for the field data itself.
  2. If you were using soft field cache (which I don't recommend using, really), then your tests are problematic because you might hit cases where you load less/more the field data as it gets evicted. As long as this flag is set, your performance tests are effectively meaningless...
  3. The idea of execution_type is to have an internal additional execution method that we were playing while testing. In general, it will be slower compared to the regular execution type, the only case where it might be faster is with fields that have many many unique values. Btw, its ok to have undocumented flags in a project, when its something we use internally but still don't wish to properly expose, its *intentionally not documented".

Last, and this is important, please, don't use the soft field data cache. If you do, you probably doing it wrong. I see the advice for it begin thrown in many places, its not a magic bullet solution if you don't have enough memory for the facets, it will just make things worse 99.9% of the cases. We are working on reducing the memory requirements for facets, but for now, just make sure you have enough memory for your faceting/sorting needs.

On Jan 25, 2013, at 11:39 AM, Leonardo Menezes leonardo.menezess@gmail.com wrote:

we tried using soft caches but it didn't matter as they were not actually being invalidated. it was just for testing purposes, but we are not using that anymore. we also don't have much of memory problems. we are running the jvm with 30gb heap, so that's plenty for our needs at the moment.

Leonardo Menezes
(+34) 688907766
http://lmenezes.com

On Fri, Jan 25, 2013 at 11:06 AM, Jörg Prante joergprante@gmail.com wrote:
Hi,

interesting... it looks like your system can fit the 5000k documents into the cache with "execution_hint: map" without being hit seriously by GC. Without execution_hint:map, do you use soft refs by any chance? That would explain the 600ms, could be extra time because your cache elements are being invalidated.

Jörg

Am 25.01.13 10:16, schrieb Leonardo Menezes:
So... just to give an update on this. Reading the source code last night, We found a parameter that doesn't seem to be documented anywhere and that is related to choosing which faceting method should be used for a certain field. The parameter is called execution_hint and should be used like

"facets" : {
"company" : {
"terms" : {
"field" : "current_company",
"size" : 15,
"execution_hint":"map"
}
}
}

The process of choosing the faceting method occurs at TermsFacetProcessor and is a bit different for strings than it is for other types. Anyway, after running some tests with this setting, our response time improved a LOT. So, some numbers:

Index: 12MM documents
Field: string, multi valued. has about 400k unique value
Document: has between 1 to 10 values for this field

Query #1(matches 5000k documents)

  • using "execution_hint":"map" - roughly 50ms avg.
  • not using it - roughly 600ms avg.

Query #2(match all, so, 12MM documents)

  • using "execution_hint":"map" - roughly 1.9s avg.
  • not using it - roughly 800ms avg.

so, since our query pattern is really close to query #1, that really made a big difference in our results. hope that might be of some help for someone else.

Leonardo Menezes
(+34) 688907766
http://lmenezes.com http://lmenezes.com/

http://twitter.com/leonardomenezes

On Thu, Jan 24, 2013 at 4:39 PM, Drew Raines <aaraines@gmail.com mailto:aaraines@gmail.com> wrote:

Ivan Brusic wrote:

> Have you seen some of the latest commits?
>
>
https://github.com/elasticsearch/elasticsearch/commit/346422b74751f498f037daff34ea136a131fca89
>
> There are no issues attached to these commits, so there is no
> telling what version they belong to.

The goal is for the fielddata refactoring and Lucene 4.1 integration
to appear in 0.21.0.  Much of the work is already in master.

-Drew

--

Hi!

On Jan 27, 2013, at 12:31 PM, kimchy@gmail.com wrote:

I would feel more comfortable documenting it once we decide to "officially" expose it. To be honest, we will expose it as an option, just explaining when to use it is tricky. What I always hope is that those won't really be options that the user decides, but we can make this decision automatically. The new field data refactoring that is now in master will allow us to more easily implement those type of decision, we will get there :slight_smile:

I can understand the reasons behind not exposing it, a little bit at least.
However, for us this problem almost meant not being able to put this into production. And what was worse, no one seemed to know about it, even though we didn't seem to the only ones having run into it.

A different point: We did not primarily run into memory problems, that was more of a secondary issue. The default TermStringOrdinals implementation was completely hot in its PriorityQueue usage, which eventually led to almost all threads being busy with iterating terms.
That led to excessive heap usage, which stalled all nodes.
We've not seen any OOM errors, so we actually have never been limited by memory alone. And I strongly believe that we could've used up all memory we'd care to put it :wink:

Ultimately it's not important for us to have this documented, obviously, but it would've saved us quite a bit of headaches if there had been any indication of its existence.
Basically it took the decision to flat out replace the facet implementation to discover that there's actually a different implementation we could try without patching anything. You can imagine our faces once we saw its effect.
What surprises me a bit is the age of the change that introduced this parameter. Have you had any actual use for in all that time?

best,
-k

On Jan 27, 2013, at 11:41 AM, Leonardo Menezes leonardo.menezess@gmail.com wrote:

hey kimchy,
thanks for the explanation. In our case in particular, soft caches made no difference at all, since cache was never being evicted. Now, we are back to our original setup(which didn't include soft caches), and things are running just fine. Without the execution hint, our cluster was unable to work for more than 5min. Of course, it just might be a really specific case(very high cardinality, not a lot of documents matched for average query...).
Anyway, maybe it's interesting documenting this flag, since someone may run into this same problem...

Leonardo Menezes
(+34) 688907766
http://lmenezes.com

On Sun, Jan 27, 2013 at 11:30 AM, kimchy@gmail.com wrote:
Few notes here:

  1. execution_type for terms facet does not change the amount of memory required for the field data itself.
  2. If you were using soft field cache (which I don't recommend using, really), then your tests are problematic because you might hit cases where you load less/more the field data as it gets evicted. As long as this flag is set, your performance tests are effectively meaningless...
  3. The idea of execution_type is to have an internal additional execution method that we were playing while testing. In general, it will be slower compared to the regular execution type, the only case where it might be faster is with fields that have many many unique values. Btw, its ok to have undocumented flags in a project, when its something we use internally but still don't wish to properly expose, its *intentionally not documented".

Last, and this is important, please, don't use the soft field data cache. If you do, you probably doing it wrong. I see the advice for it begin thrown in many places, its not a magic bullet solution if you don't have enough memory for the facets, it will just make things worse 99.9% of the cases. We are working on reducing the memory requirements for facets, but for now, just make sure you have enough memory for your faceting/sorting needs.

On Jan 25, 2013, at 11:39 AM, Leonardo Menezes leonardo.menezess@gmail.com wrote:

we tried using soft caches but it didn't matter as they were not actually being invalidated. it was just for testing purposes, but we are not using that anymore. we also don't have much of memory problems. we are running the jvm with 30gb heap, so that's plenty for our needs at the moment.

Leonardo Menezes
(+34) 688907766
http://lmenezes.com

On Fri, Jan 25, 2013 at 11:06 AM, Jörg Prante joergprante@gmail.com wrote:
Hi,

interesting... it looks like your system can fit the 5000k documents into the cache with "execution_hint: map" without being hit seriously by GC. Without execution_hint:map, do you use soft refs by any chance? That would explain the 600ms, could be extra time because your cache elements are being invalidated.

Jörg

Am 25.01.13 10:16, schrieb Leonardo Menezes:
So... just to give an update on this. Reading the source code last night, We found a parameter that doesn't seem to be documented anywhere and that is related to choosing which faceting method should be used for a certain field. The parameter is called execution_hint and should be used like

"facets" : {
"company" : {
"terms" : {
"field" : "current_company",
"size" : 15,
"execution_hint":"map"
}
}
}

The process of choosing the faceting method occurs at TermsFacetProcessor and is a bit different for strings than it is for other types. Anyway, after running some tests with this setting, our response time improved a LOT. So, some numbers:

Index: 12MM documents
Field: string, multi valued. has about 400k unique value
Document: has between 1 to 10 values for this field

Query #1(matches 5000k documents)

  • using "execution_hint":"map" - roughly 50ms avg.
  • not using it - roughly 600ms avg.

Query #2(match all, so, 12MM documents)

  • using "execution_hint":"map" - roughly 1.9s avg.
  • not using it - roughly 800ms avg.

so, since our query pattern is really close to query #1, that really made a big difference in our results. hope that might be of some help for someone else.

Leonardo Menezes
(+34) 688907766
http://lmenezes.com http://lmenezes.com/

http://twitter.com/leonardomenezes

On Thu, Jan 24, 2013 at 4:39 PM, Drew Raines <aaraines@gmail.com mailto:aaraines@gmail.com> wrote:

Ivan Brusic wrote:

> Have you seen some of the latest commits?
>
>
https://github.com/elasticsearch/elasticsearch/commit/346422b74751f498f037daff34ea136a131fca89
>
> There are no issues attached to these commits, so there is no
> telling what version they belong to.

The goal is for the fielddata refactoring and Lucene 4.1 integration
to appear in 0.21.0.  Much of the work is already in master.

-Drew

--

The priority queue problems you run into, it mainly relates to how many concurrent requests you allow to have in your node. Its recommended to control that using by controlling the search thread pool anyhow.

Regarding the flag, nothing much to say except that this flag, at this time, is considered internal, which might be publicly doc'ed in the future. Obviously, we are working on improving facet execution in any case applied.

On Jan 27, 2013, at 4:37 PM, Kay Röpke kroepke@gmail.com wrote:

Hi!

On Jan 27, 2013, at 12:31 PM, kimchy@gmail.com wrote:

I would feel more comfortable documenting it once we decide to "officially" expose it. To be honest, we will expose it as an option, just explaining when to use it is tricky. What I always hope is that those won't really be options that the user decides, but we can make this decision automatically. The new field data refactoring that is now in master will allow us to more easily implement those type of decision, we will get there :slight_smile:

I can understand the reasons behind not exposing it, a little bit at least.
However, for us this problem almost meant not being able to put this into production. And what was worse, no one seemed to know about it, even though we didn't seem to the only ones having run into it.

A different point: We did not primarily run into memory problems, that was more of a secondary issue. The default TermStringOrdinals implementation was completely hot in its PriorityQueue usage, which eventually led to almost all threads being busy with iterating terms.
That led to excessive heap usage, which stalled all nodes.
We've not seen any OOM errors, so we actually have never been limited by memory alone. And I strongly believe that we could've used up all memory we'd care to put it :wink:

Ultimately it's not important for us to have this documented, obviously, but it would've saved us quite a bit of headaches if there had been any indication of its existence.
Basically it took the decision to flat out replace the facet implementation to discover that there's actually a different implementation we could try without patching anything. You can imagine our faces once we saw its effect.
What surprises me a bit is the age of the change that introduced this parameter. Have you had any actual use for in all that time?

best,
-k

On Jan 27, 2013, at 11:41 AM, Leonardo Menezes leonardo.menezess@gmail.com wrote:

hey kimchy,
thanks for the explanation. In our case in particular, soft caches made no difference at all, since cache was never being evicted. Now, we are back to our original setup(which didn't include soft caches), and things are running just fine. Without the execution hint, our cluster was unable to work for more than 5min. Of course, it just might be a really specific case(very high cardinality, not a lot of documents matched for average query...).
Anyway, maybe it's interesting documenting this flag, since someone may run into this same problem...

Leonardo Menezes
(+34) 688907766
http://lmenezes.com

On Sun, Jan 27, 2013 at 11:30 AM, kimchy@gmail.com wrote:
Few notes here:

  1. execution_type for terms facet does not change the amount of memory required for the field data itself.
  2. If you were using soft field cache (which I don't recommend using, really), then your tests are problematic because you might hit cases where you load less/more the field data as it gets evicted. As long as this flag is set, your performance tests are effectively meaningless...
  3. The idea of execution_type is to have an internal additional execution method that we were playing while testing. In general, it will be slower compared to the regular execution type, the only case where it might be faster is with fields that have many many unique values. Btw, its ok to have undocumented flags in a project, when its something we use internally but still don't wish to properly expose, its *intentionally not documented".

Last, and this is important, please, don't use the soft field data cache. If you do, you probably doing it wrong. I see the advice for it begin thrown in many places, its not a magic bullet solution if you don't have enough memory for the facets, it will just make things worse 99.9% of the cases. We are working on reducing the memory requirements for facets, but for now, just make sure you have enough memory for your faceting/sorting needs.

On Jan 25, 2013, at 11:39 AM, Leonardo Menezes leonardo.menezess@gmail.com wrote:

we tried using soft caches but it didn't matter as they were not actually being invalidated. it was just for testing purposes, but we are not using that anymore. we also don't have much of memory problems. we are running the jvm with 30gb heap, so that's plenty for our needs at the moment.

Leonardo Menezes
(+34) 688907766
http://lmenezes.com

On Fri, Jan 25, 2013 at 11:06 AM, Jörg Prante joergprante@gmail.com wrote:
Hi,

interesting... it looks like your system can fit the 5000k documents into the cache with "execution_hint: map" without being hit seriously by GC. Without execution_hint:map, do you use soft refs by any chance? That would explain the 600ms, could be extra time because your cache elements are being invalidated.

Jörg

Am 25.01.13 10:16, schrieb Leonardo Menezes:
So... just to give an update on this. Reading the source code last night, We found a parameter that doesn't seem to be documented anywhere and that is related to choosing which faceting method should be used for a certain field. The parameter is called execution_hint and should be used like

"facets" : {
"company" : {
"terms" : {
"field" : "current_company",
"size" : 15,
"execution_hint":"map"
}
}
}

The process of choosing the faceting method occurs at TermsFacetProcessor and is a bit different for strings than it is for other types. Anyway, after running some tests with this setting, our response time improved a LOT. So, some numbers:

Index: 12MM documents
Field: string, multi valued. has about 400k unique value
Document: has between 1 to 10 values for this field

Query #1(matches 5000k documents)

  • using "execution_hint":"map" - roughly 50ms avg.
  • not using it - roughly 600ms avg.

Query #2(match all, so, 12MM documents)

  • using "execution_hint":"map" - roughly 1.9s avg.
  • not using it - roughly 800ms avg.

so, since our query pattern is really close to query #1, that really made a big difference in our results. hope that might be of some help for someone else.

Leonardo Menezes
(+34) 688907766
http://lmenezes.com http://lmenezes.com/

http://twitter.com/leonardomenezes

On Thu, Jan 24, 2013 at 4:39 PM, Drew Raines <aaraines@gmail.com mailto:aaraines@gmail.com> wrote:

Ivan Brusic wrote:

> Have you seen some of the latest commits?
>
>
https://github.com/elasticsearch/elasticsearch/commit/346422b74751f498f037daff34ea136a131fca89
>
> There are no issues attached to these commits, so there is no
> telling what version they belong to.

The goal is for the fielddata refactoring and Lucene 4.1 integration
to appear in 0.21.0.  Much of the work is already in master.

-Drew

--

--