How does the memory usage for terms facets work?

Hi everyone

I am still a bit unclear on how terms facets load values into memory.

What people have said is that it loads all the values into memory. Does
that means it loads all the unique values of the fields into memory or the
values of the fields per document?

For example

Suppose I have documents:
{
"id" : "1",
"tags" : ["foo", "bar"]
}

{
"id" : "2",
"tags" : ["foo", "bar"]
}

Will "foo" and "bar" be loaded once or twice into memory?

Thank you!
Jieren

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Unique ones.

So facetting on few unique values will scale really easily.
But if you facet on a comment field for example, it will load (too) many terms in memory.

HTH

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 24 avr. 2013 à 20:20, jieren jieren@klout.com a écrit :

Hi everyone

I am still a bit unclear on how terms facets load values into memory.

What people have said is that it loads all the values into memory. Does that means it loads all the unique values of the fields into memory or the values of the fields per document?

For example

Suppose I have documents:
{
"id" : "1",
"tags" : ["foo", "bar"]
}

{
"id" : "2",
"tags" : ["foo", "bar"]
}

Will "foo" and "bar" be loaded once or twice into memory?

Thank you!
Jieren

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks for the fast answer!

On Wednesday, April 24, 2013 11:32:38 AM UTC-7, David Pilato wrote:

Unique ones.

So facetting on few unique values will scale really easily.
But if you facet on a comment field for example, it will load (too) many
terms in memory.

HTH

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 24 avr. 2013 à 20:20, jieren <jie...@klout.com <javascript:>> a écrit :

Hi everyone

I am still a bit unclear on how terms facets load values into memory.

What people have said is that it loads all the values into memory. Does
that means it loads all the unique values of the fields into memory or the
values of the fields per document?

For example

Suppose I have documents:
{
"id" : "1",
"tags" : ["foo", "bar"]
}

{
"id" : "2",
"tags" : ["foo", "bar"]
}

Will "foo" and "bar" be loaded once or twice into memory?

Thank you!
Jieren

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

This is not clarified anywhere. This description of memory usage by field
cache should help everyone.

  1. Estimations (in Bytes) for a single Lucene segment, field cache for the
    following types:
    => numbers (including datetime formats)
    48 (JAVA structures for docs list) + 4 * max_doc_id * max_array_size

8 (JAVA structures for unique term list) + unique_terms_count * 4

=> strings
48 (JAVA structures for docs list) + 4 * max_doc_id * max_array_size
+
8 (JAVA structures for unique term list) + unique_terms_count * (4 +
string_size_in_bytes)

max_doc_id - the highest lucene id + 1 (in coresponding segment)
string_size_in_bytes(max) = 4 * string_len (UTF8)
max_array_size - maximum number of elements (through all the documents in
segment) in multivalued field.

  1. Since the field cache is per segment, unique terms array is kept per
    segment too

Check that you use multivalued vield as tags. So even if you have only 1
document with eg. 10 elements in tags and the rest of the documents have 1
element in tags (I still mean in a single Lucene segment), the field cache
still uses bidirectional array for document list with Y-size = 10, so it
takes the same amount of memory as if all the documents have 10 values in
tags.

So one thins is unique terms - this can be estimated very simple. But the
second thing is an array with document pointers - this can be very heavy. I
strongly do NOT recommend using facets on multivalued fields, in this case
use nested array - then each element of the field is a separated document
and here the situation does not occur.

In my case optimizing multivalued fields and switching to nested gave me
about 2GB of field cache usage instead of 17GB :slight_smile:

Remember that this cache can be estimated in a single segment. Each shard
consists of 10-20 segments (for default ES settings). Each segment max size
(by default) is 5GB and merge policy takes care to keep a few big segments
(up to 5GB), most segments are small (it depends of shard size of course).
You can check segments sizes getting localhost:9200//_segments.

I hope that this will solve your problems with field cache exploding :slight_smile: It
solved mine :slight_smile:

Best regards.
Marcin Dojwa

2013/4/24 jieren jieren@klout.com

Thanks for the fast answer!

On Wednesday, April 24, 2013 11:32:38 AM UTC-7, David Pilato wrote:

Unique ones.

So facetting on few unique values will scale really easily.
But if you facet on a comment field for example, it will load (too) many
terms in memory.

HTH

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 24 avr. 2013 à 20:20, jieren jie...@klout.com a écrit :

Hi everyone

I am still a bit unclear on how terms facets load values into memory.

What people have said is that it loads all the values into memory. Does
that means it loads all the unique values of the fields into memory or the
values of the fields per document?

For example

Suppose I have documents:
{
"id" : "1",
"tags" : ["foo", "bar"]
}

{
"id" : "2",
"tags" : ["foo", "bar"]
}

Will "foo" and "bar" be loaded once or twice into memory?

Thank you!
Jieren

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi,

Thanks a lot for your detailed explaination...

I got two questions:

  1. Those calculating term facet formulas based on which ES version?

  2. multivalued term means what? a field contains multivalues or the multi
    field type offered by ES mapping?

Thanks in advance

On Thursday, April 25, 2013 2:20:13 AM UTC+8, jieren wrote:

Hi everyone

I am still a bit unclear on how terms facets load values into memory.

What people have said is that it loads all the values into memory. Does
that means it loads all the unique values of the fields into memory or the
values of the fields per document?

For example

Suppose I have documents:
{
"id" : "1",
"tags" : ["foo", "bar"]
}

{
"id" : "2",
"tags" : ["foo", "bar"]
}

Will "foo" and "bar" be loaded once or twice into memory?

Thank you!
Jieren

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

  1. I believe Marcin is referring to the field cache structure found in
    versions pre-0.90. The latest version does improve on the use case with
    high-cardinality fields, but how and by how much I do not know (still on
    0.20).

  2. The former (field contains multivalues). Multi-field essentially creates
    two or more fields under the hood (which in turn can also be multi-valued),
    so the original field is not sharing the same Lucene field.

--
Ivan

On Fri, Apr 26, 2013 at 7:31 PM, Yuan Xu lemon8292@gmail.com wrote:

Hi,

Thanks a lot for your detailed explaination...

I got two questions:

  1. Those calculating term facet formulas based on which ES version?

  2. multivalued term means what? a field contains multivalues or the multi
    field type offered by ES mapping?

Thanks in advance

On Thursday, April 25, 2013 2:20:13 AM UTC+8, jieren wrote:

Hi everyone

I am still a bit unclear on how terms facets load values into memory.

What people have said is that it loads all the values into memory. Does
that means it loads all the unique values of the fields into memory or the
values of the fields per document?

For example

Suppose I have documents:
{
"id" : "1",
"tags" : ["foo", "bar"]
}

{
"id" : "2",
"tags" : ["foo", "bar"]
}

Will "foo" and "bar" be loaded once or twice into memory?

Thank you!
Jieren

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi,

  1. Yes, these calculations refer to 0.20.6
  2. Like Ivan said, eg. arrays.

Best regards.
Marcin Dojwa

2013/4/28 Ivan Brusic ivan@brusic.com

  1. I believe Marcin is referring to the field cache structure found in
    versions pre-0.90. The latest version does improve on the use case with
    high-cardinality fields, but how and by how much I do not know (still on
    0.20).

  2. The former (field contains multivalues). Multi-field essentially
    creates two or more fields under the hood (which in turn can also be
    multi-valued), so the original field is not sharing the same Lucene field.

--
Ivan

On Fri, Apr 26, 2013 at 7:31 PM, Yuan Xu lemon8292@gmail.com wrote:

Hi,

Thanks a lot for your detailed explaination...

I got two questions:

  1. Those calculating term facet formulas based on which ES version?

  2. multivalued term means what? a field contains multivalues or the
    multi field type offered by ES mapping?

Thanks in advance

On Thursday, April 25, 2013 2:20:13 AM UTC+8, jieren wrote:

Hi everyone

I am still a bit unclear on how terms facets load values into memory.

What people have said is that it loads all the values into memory. Does
that means it loads all the unique values of the fields into memory or the
values of the fields per document?

For example

Suppose I have documents:
{
"id" : "1",
"tags" : ["foo", "bar"]
}

{
"id" : "2",
"tags" : ["foo", "bar"]
}

Will "foo" and "bar" be loaded once or twice into memory?

Thank you!
Jieren

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.