This is the third time I've tried to write this reply, thanks Google Groups.
I think you'd have to iterate over the field data twice, once to construct
the estimate, and once again to load the data, so it might slow things
down. And really it'd be meaningless unless you ran a GC first, as there's
no way to know how much memory is potentially available until after a GC.
So you'd have to have a user-specified limit.
Would this be a really silly idea:
Wrap the whole FieldDataLoader#load method in a try/catch for
OutOfMemoryError.
Then if you get one, do an immediate GC (in the catch block so all the
local variables are out of scope).
Then throw an IOException: "Unable to load field data: out of heap space"
instead.
Is that crazy? It kinda sounds crazy, but no worse than being able to take
down a node with a single bad facet.
In answer to my own questions 1 and 4: I'm now 99% sure that filters and
document type are irrelevant when loading field data into the cache, so
faceting really will cause you to load all the field values across all the
types in your index.
(Can anyone confirm/deny please?)
On Monday, 8 October 2012 09:29:58 UTC+1, Jörg Prante wrote:
Would love to see answers to this questions too.
An important feature for ES would be a graceful rejection of faceting over
a field by precomputing the memory consumption to prevent OOMs. Right now
ES throws OOM if faceting fails, but will not automatically recover the
index from that state (only manual cluster restart helps).
Jörg
On Sunday, October 7, 2012 11:45:44 AM UTC+2, Andrew Clegg wrote:
I want to do some planning around how much cache memory it will take to
facet over potentially a lot of records (millions, eventually billions).
These are mainly date histograms and term facets.
So, I have a few questions.
-
Is it correct to say that running a facet on a field causes every
shard to load all the values for that field into memory? Before any facet
filters are applied?
-
What factors affect the memory consumed when this happens? Is it:
number of documents in the shard, number of distinct values in that field,
something else?
-
Is there a formula for calculating/estimating the overall usage?
(FieldDataLoader is a bit opaque if you're not a Lucene specialist.)
-
Is the document type taken into account anywhere in this process? Or
is the data loading done across all types in the index?
Let me go into 4 in a little more detail. Our index contains a large
number of different types (around a hundred I think) which have most of the
same fields in common. If someone does a facet on one type, will the data
for that field in across all types get loaded?
If that's the case, are we perhaps better off having a separate index
for each type?
Thanks in advance,
A.
--