On 9/5/2012 1:48 PM, nikolai.alex@me.com wrote:
As many, I stumbled over memory issues during facet searches. This problem is often addressed in many threads. Maybe it is time to consolidate this scattered knowledge in a single place, like a blog post.
For me it is still unclear, what it means, when it is said, that all field values are load into memory. Are really only the values loaded? For example, I got 1 million documents with a field tag, that contains one of the values 'a', 'b',...'z'. Does that mean that 26 strings (a to z) are loaded into memory? Or is the whole dictionary loaded into memory? This leads to the question. Do only the number of distinct values count, or are the number of documents also important?
Do I have to calculate my memory usage per shard or per node?
Maybe we can aggregate all relevant information about this topic in this or another thread. And than come up with a blog post. I think it is worth the effort and I really would love to help on that.
Thanks in advance!
I agree, it would be great to try to discuss memory usage as a topic.
"all field values are loaded into memory"
That sounds like the Lucene level fieldCache which is a very interesting
structure.
I believe when you do doc['fieldName'] in a script you are actually
accessing one of these caches.
A field cache is using memory in one ES Shard, but not all values for
all fields in a whole shared, but just one Lucene segment (a physical
sub-division of Lucene index)and any one cache is only one fields values
(0 or more values per document).
Walking through a result set to record/gather/count something, in this
case a facet, is done by a Lucene Collector.
Typically collectors need to look at only a field or two when moving
between segments; consider a Terms Stats facet for example of something
that needs only two fields.
A Collector is told when it moves between segments. It can load the
fieldCache for just the field(s) it is looking for as the index changes
segments.
This is how an index, even a single Lucene index can support collecting
up millions of documents, it doesn't load all documents, or even all
stored values of a particular document, just the values for fields you
ask for one segment at a time.
But there is a serious trick here, the values for one of these field
caches DO NOT TAKE UP VM MEMORY. They are mapped directly onto the file
system buffer (yes, I didn't know you could do that
either). What you see (or the folks who write Lucene and FS ) actually
see, points to memory mapped directly into a nio.MappedBuffer with
subsequences and direct charbuffers, charsequence and all the various
other ways to look at the bytes, never copying the bytes, but continuing
to point directly into the FS memory mapped buffer.
That means that the memory when accessing an index uses lots of files
system memory and much less heap memory.
If you over do the heap in Lucene and don't leave nothing for the file
system you can hurt Lucene performance.
Yes, that complicates memory calculations.
-Paul
--