Bool filters vs and/or/not, field cache

Hi,

We have a cluster of 2 nodes each with 32GB RAM, with 16GB RAM allocated for the Java heap. We're running Oracle Java 1.7.0_17, and ES 0.19.8.

We've seen organic growth of our heap usage (via bigdesk) up to around 12GB. A recent release of our application seems to have bumped this up to 15GB, so it's time (some would say 'overdue') we looked at our memory usage. The lion's share of this memory is field cache, which is about 11GB on each node.

I found this thread from a while ago where Clinton talks about bool filters vs and/or/not:

https://groups.google.com/forum/#!msg/elasticsearch/PS12RcyNSWc/I1PX1r0RfFcJ

In particular:

Bool filter vs and/or/not:

The bool filter consumes bitsets. Most filters produce bitsets, eg a
filter like { term: { status: "active" }} will examine every document in
the index and create a bitset for the entire index (one bit per
document) which contains '1' if the document matches, and '0' if it
doesn't.

[snip]

and/or/not filters don't demand bitsets. They work doc-by-doc, so
they're a good fit for geo filters. They also short-circuit. If a doc
has already been excluded by an earlier filter, it won't run the later
filters.

So to put it all together, combine the bitset filters with a bool
filter, and then combine the bool filter with the geo filter using an
'and' clause, with the geo-filter after the and (see example below)

Our application currently uses and/or/not extensively. If we converted those to be bool filters, would it be reasonable to expect to see field cache usage drop - as those bitsets would start being used, instead of doc-by-doc processing? Or am I misunderstanding this, and actually, I should just add servers?

Cheers,
Dan

Dan Fairs | dan.fairs@gmail.com | @danfairs | secondsync.com

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Our application currently uses and/or/not extensively. If we converted
those to be bool filters, would it be reasonable to expect to see
field cache usage drop - as those bitsets would start being used,
instead of doc-by-doc processing? Or am I misunderstanding this, and
actually, I should just add servers?

Unfortunately, not. If you need access to the values for a field (field
data) then it loads values into memory for ALL docs in the index. The
logic being that, even if you only need the values for docs 1..10 on
this request, on the next request, you'll probably need values for other
docs.

Field data is loaded in these situations:

  • sorting
  • faceting
  • script (doc['field'])
  • numeric_range or geo filters

The bitsets vs and/or is a different matter: how filters are applied.
All filters other than numeric_range and geo* produce bitsets, so it is
best to combine these using bool filters.

A geo calculation is relatively expensive, so you don't want to run it
on all docs (as you would for a bitset). Instead, use it as the last
clause in an and/or so that it doesn't perform the calculation on docs
that have already been excluded. (However, even if you only do the
calculation on one doc, it still loads all values for the geo_point into
memory in case they are needed later on.)

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Clint,

On 21 Mar 2013, at 10:51, Clinton Gormley clint@traveljury.com wrote:

Our application currently uses and/or/not extensively. If we converted
those to be bool filters, would it be reasonable to expect to see
field cache usage drop - as those bitsets would start being used,
instead of doc-by-doc processing? Or am I misunderstanding this, and
actually, I should just add servers?

Unfortunately, not. If you need access to the values for a field (field
data) then it loads values into memory for ALL docs in the index.

[snip]

Right, OK - thanks for clarifying.

The bitsets vs and/or is a different matter: how filters are applied.
All filters other than numeric_range and geo* produce bitsets, so it is
best to combine these using bool filters.

A geo calculation is relatively expensive, so you don't want to run it
on all docs (as you would for a bitset). Instead, use it as the last
clause in an and/or so that it doesn't perform the calculation on docs
that have already been excluded. (However, even if you only do the
calculation on one doc, it still loads all values for the geo_point into
memory in case they are needed later on.)

OK, that's an interesting insight. We currently have a set of 'base' filters, to which other filters are generally applied on a per-request basis, and that base filter does actually include a polygon filter. Sounds like I need to try to move things around to make sure that particular filter is applied last of all.

Thanks,
Dan

Dan Fairs | dan.fairs@gmail.com | @danfairs | secondsync.com

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.