Consolidate facet search knowledge about memory usage

As many, I stumbled over memory issues during facet searches. This problem is often addressed in many threads. Maybe it is time to consolidate this scattered knowledge in a single place, like a blog post.

For me it is still unclear, what it means, when it is said, that all field values are load into memory. Are really only the values loaded? For example, I got 1 million documents with a field tag, that contains one of the values 'a', 'b',...'z'. Does that mean that 26 strings (a to z) are loaded into memory? Or is the whole dictionary loaded into memory? This leads to the question. Do only the number of distinct values count, or are the number of documents also important?

Do I have to calculate my memory usage per shard or per node?

Maybe we can aggregate all relevant information about this topic in this or another thread. And than come up with a blog post. I think it is worth the effort and I really would love to help on that.

Thanks in advance!

--

Hi,

Not sure if you saw it, but a few months ago we wrote a post on this topic over on Sematext Blogs - http://Blog.sematext.com

Otis

--

@Nikolai +1 on having a chance for the community to aggregate and
contribute content by topics.

@Otis, can you point me to the specific blog post?

Regards,
Lukas

On Thu, Sep 6, 2012 at 2:22 AM, Otis Gospodnetic <otis.gospodnetic@gmail.com

wrote:

Hi,

Not sure if you saw it, but a few months ago we wrote a post on this topic
over on Sematext Blogs - http://Blog.sematext.com

Otis

--

--

Hello!

Lukáš, I think Otis was thinking about http://blog.sematext.com/2012/05/17/elasticsearch-cache-usage/

--

Regards,

Rafał Kuć

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch

@Nikolai +1 on having a chance for the community to aggregate and contribute content by topics.

@Otis, can you point me to the specific blog post?

Regards,

Lukas

On Thu, Sep 6, 2012 at 2:22 AM, Otis Gospodnetic <otis.gospodnetic@gmail.com> wrote:

Hi,

Not sure if you saw it, but a few months ago we wrote a post on this topic over on Sematext Blogs - http://Blog.sematext.com

Otis

--

--

Great idea Nikola, I find myself wondering about potential memory issues
with facets all the time based on what I read in the forums. I haven't gone
into production yet so all looks good right now but when I do I'll have
elasticsearch running on a data server with about 30 sites hitting it, each
with their own index. Facets is one of those features that is so important
to have for many users (and a must for mine) but I'm not sure if people
just run into memory issues when dealing with hundreds of millions of
documents that can generate dozens of facets with hundreds of items per
facet or if it can happen with a few thousand documents with 5 facets and
50 results per facet for example.

On Wednesday, 5 September 2012 15:48:49 UTC-5, nikola...@me.com wrote:

As many, I stumbled over memory issues during facet searches. This problem
is often addressed in many threads. Maybe it is time to consolidate this
scattered knowledge in a single place, like a blog post.

For me it is still unclear, what it means, when it is said, that all field
values are load into memory. Are really only the values loaded? For
example, I got 1 million documents with a field tag, that contains one of
the values 'a', 'b',...'z'. Does that mean that 26 strings (a to z) are
loaded into memory? Or is the whole dictionary loaded into memory? This
leads to the question. Do only the number of distinct values count, or are
the number of documents also important?

Do I have to calculate my memory usage per shard or per node?

Maybe we can aggregate all relevant information about this topic in this
or another thread. And than come up with a blog post. I think it is worth
the effort and I really would love to help on that.

Thanks in advance!

--

On 9/5/2012 1:48 PM, nikolai.alex@me.com wrote:

As many, I stumbled over memory issues during facet searches. This problem is often addressed in many threads. Maybe it is time to consolidate this scattered knowledge in a single place, like a blog post.

For me it is still unclear, what it means, when it is said, that all field values are load into memory. Are really only the values loaded? For example, I got 1 million documents with a field tag, that contains one of the values 'a', 'b',...'z'. Does that mean that 26 strings (a to z) are loaded into memory? Or is the whole dictionary loaded into memory? This leads to the question. Do only the number of distinct values count, or are the number of documents also important?

Do I have to calculate my memory usage per shard or per node?

Maybe we can aggregate all relevant information about this topic in this or another thread. And than come up with a blog post. I think it is worth the effort and I really would love to help on that.

Thanks in advance!

I agree, it would be great to try to discuss memory usage as a topic.

"all field values are loaded into memory"

That sounds like the Lucene level fieldCache which is a very interesting
structure.
I believe when you do doc['fieldName'] in a script you are actually
accessing one of these caches.

A field cache is using memory in one ES Shard, but not all values for
all fields in a whole shared, but just one Lucene segment (a physical
sub-division of Lucene index)and any one cache is only one fields values
(0 or more values per document).

Walking through a result set to record/gather/count something, in this
case a facet, is done by a Lucene Collector.
Typically collectors need to look at only a field or two when moving
between segments; consider a Terms Stats facet for example of something
that needs only two fields.
A Collector is told when it moves between segments. It can load the
fieldCache for just the field(s) it is looking for as the index changes
segments.

This is how an index, even a single Lucene index can support collecting
up millions of documents, it doesn't load all documents, or even all
stored values of a particular document, just the values for fields you
ask for one segment at a time.

But there is a serious trick here, the values for one of these field
caches DO NOT TAKE UP VM MEMORY. They are mapped directly onto the file
system buffer (yes, I didn't know you could do that
either). What you see (or the folks who write Lucene and FS ) actually
see, points to memory mapped directly into a nio.MappedBuffer with
subsequences and direct charbuffers, charsequence and all the various
other ways to look at the bytes, never copying the bytes, but continuing
to point directly into the FS memory mapped buffer.

That means that the memory when accessing an index uses lots of files
system memory and much less heap memory.
If you over do the heap in Lucene and don't leave nothing for the file
system you can hurt Lucene performance.

Yes, that complicates memory calculations.
-Paul

--