Estimating field cache size for facets in advance

I want to do some planning around how much cache memory it will take to
facet over potentially a lot of records (millions, eventually billions).

These are mainly date histograms and term facets.

So, I have a few questions.

  1. Is it correct to say that running a facet on a field causes every shard
    to load all the values for that field into memory? Before any facet
    filters are applied?

  2. What factors affect the memory consumed when this happens? Is it: number
    of documents in the shard, number of distinct values in that field,
    something else?

  3. Is there a formula for calculating/estimating the overall usage?
    (FieldDataLoader is a bit opaque if you're not a Lucene specialist.)

  4. Is the document type taken into account anywhere in this process? Or is
    the data loading done across all types in the index?

Let me go into 4 in a little more detail. Our index contains a large number
of different types (around a hundred I think) which have most of the same
fields in common. If someone does a facet on one type, will the data for
that field in across all types get loaded?

If that's the case, are we perhaps better off having a separate index for
each type?

Thanks in advance,

A.

--

Would love to see answers to this questions too.

An important feature for ES would be a graceful rejection of faceting over
a field by precomputing the memory consumption to prevent OOMs. Right now
ES throws OOM if faceting fails, but will not automatically recover the
index from that state (only manual cluster restart helps).

Jörg

On Sunday, October 7, 2012 11:45:44 AM UTC+2, Andrew Clegg wrote:

I want to do some planning around how much cache memory it will take to
facet over potentially a lot of records (millions, eventually billions).

These are mainly date histograms and term facets.

So, I have a few questions.

  1. Is it correct to say that running a facet on a field causes every shard
    to load all the values for that field into memory? Before any facet
    filters are applied?

  2. What factors affect the memory consumed when this happens? Is it:
    number of documents in the shard, number of distinct values in that field,
    something else?

  3. Is there a formula for calculating/estimating the overall usage?
    (FieldDataLoader is a bit opaque if you're not a Lucene specialist.)

  4. Is the document type taken into account anywhere in this process? Or is
    the data loading done across all types in the index?

Let me go into 4 in a little more detail. Our index contains a large
number of different types (around a hundred I think) which have most of the
same fields in common. If someone does a facet on one type, will the data
for that field in across all types get loaded?

If that's the case, are we perhaps better off having a separate index
for each type?

Thanks in advance,

A.

--

Yeah, that would be interesting, especially OOM is a real problem for us
too at the moment. So knowing if changing cache type or heapsize would help
is for sure a benefit (at least an estimation, maybe?)

One interesting thing came up while playing with cache settings. I can set
expiration time using curl and everything is fine:

curl -XPUT host:/port_settings -d '{ "index" : { "cache.field.expire" :
"10m"}}'

After trying to set the default value again (curl -XPUT host:port/_settings
-d '{ "index" : { "cache.field.expire" : "-1"}}') I am getting an error
including the following message:
Caused by: java.lang.IllegalArgumentException: duration cannot be negative:
-1000000 NANOSECONDS
Is there a bug in parsing the argument or am I doing something wrong?

--

This is the third time I've tried to write this reply, thanks Google Groups.

I think you'd have to iterate over the field data twice, once to construct
the estimate, and once again to load the data, so it might slow things
down. And really it'd be meaningless unless you ran a GC first, as there's
no way to know how much memory is potentially available until after a GC.
So you'd have to have a user-specified limit.

Would this be a really silly idea:

Wrap the whole FieldDataLoader#load method in a try/catch for
OutOfMemoryError.

Then if you get one, do an immediate GC (in the catch block so all the
local variables are out of scope).

Then throw an IOException: "Unable to load field data: out of heap space"
instead.

Is that crazy? It kinda sounds crazy, but no worse than being able to take
down a node with a single bad facet.

In answer to my own questions 1 and 4: I'm now 99% sure that filters and
document type are irrelevant when loading field data into the cache, so
faceting really will cause you to load all the field values across all the
types in your index.

(Can anyone confirm/deny please?)

On Monday, 8 October 2012 09:29:58 UTC+1, Jörg Prante wrote:

Would love to see answers to this questions too.

An important feature for ES would be a graceful rejection of faceting over
a field by precomputing the memory consumption to prevent OOMs. Right now
ES throws OOM if faceting fails, but will not automatically recover the
index from that state (only manual cluster restart helps).

Jörg

On Sunday, October 7, 2012 11:45:44 AM UTC+2, Andrew Clegg wrote:

I want to do some planning around how much cache memory it will take to
facet over potentially a lot of records (millions, eventually billions).

These are mainly date histograms and term facets.

So, I have a few questions.

  1. Is it correct to say that running a facet on a field causes every
    shard to load all the values for that field into memory? Before any facet
    filters are applied?

  2. What factors affect the memory consumed when this happens? Is it:
    number of documents in the shard, number of distinct values in that field,
    something else?

  3. Is there a formula for calculating/estimating the overall usage?
    (FieldDataLoader is a bit opaque if you're not a Lucene specialist.)

  4. Is the document type taken into account anywhere in this process? Or
    is the data loading done across all types in the index?

Let me go into 4 in a little more detail. Our index contains a large
number of different types (around a hundred I think) which have most of the
same fields in common. If someone does a facet on one type, will the data
for that field in across all types get loaded?

If that's the case, are we perhaps better off having a separate index
for each type?

Thanks in advance,

A.

--

Would love to see an answer for this. Thanks for the detailed question.

On Monday, October 8, 2012 9:00:09 AM UTC-7, Andrew Clegg wrote:

This is the third time I've tried to write this reply, thanks Google
Groups.

I think you'd have to iterate over the field data twice, once to construct
the estimate, and once again to load the data, so it might slow things
down. And really it'd be meaningless unless you ran a GC first, as there's
no way to know how much memory is potentially available until after a GC.
So you'd have to have a user-specified limit.

Would this be a really silly idea:

Wrap the whole FieldDataLoader#load method in a try/catch for
OutOfMemoryError.

Then if you get one, do an immediate GC (in the catch block so all the
local variables are out of scope).

Then throw an IOException: "Unable to load field data: out of heap space"
instead.

Is that crazy? It kinda sounds crazy, but no worse than being able to take
down a node with a single bad facet.

In answer to my own questions 1 and 4: I'm now 99% sure that filters and
document type are irrelevant when loading field data into the cache, so
faceting really will cause you to load all the field values across all the
types in your index.

(Can anyone confirm/deny please?)

On Monday, 8 October 2012 09:29:58 UTC+1, Jörg Prante wrote:

Would love to see answers to this questions too.

An important feature for ES would be a graceful rejection of faceting
over a field by precomputing the memory consumption to prevent OOMs. Right
now ES throws OOM if faceting fails, but will not automatically recover the
index from that state (only manual cluster restart helps).

Jörg

On Sunday, October 7, 2012 11:45:44 AM UTC+2, Andrew Clegg wrote:

I want to do some planning around how much cache memory it will take to
facet over potentially a lot of records (millions, eventually billions).

These are mainly date histograms and term facets.

So, I have a few questions.

  1. Is it correct to say that running a facet on a field causes every
    shard to load all the values for that field into memory? Before any facet
    filters are applied?

  2. What factors affect the memory consumed when this happens? Is it:
    number of documents in the shard, number of distinct values in that field,
    something else?

  3. Is there a formula for calculating/estimating the overall usage?
    (FieldDataLoader is a bit opaque if you're not a Lucene specialist.)

  4. Is the document type taken into account anywhere in this process? Or
    is the data loading done across all types in the index?

Let me go into 4 in a little more detail. Our index contains a large
number of different types (around a hundred I think) which have most of the
same fields in common. If someone does a facet on one type, will the data
for that field in across all types get loaded?

If that's the case, are we perhaps better off having a separate index
for each type?

Thanks in advance,

A.

--

We discussed this briefly after the ES training course in London a
couple of months ago.

If I understood Shay correctly, here's the rough memory usage (in bytes).

For single-valued fields:

4m + (4n * avg(term length in chars)) [string fields]

4m + (n * term size in bytes) [numeric fields]

For multi-valued field:

(4m * max(num terms in doc)) + (4n * avg(term length in chars)) [string fields]

(4m * max(num terms in doc)) + (n * term size in bytes) [numeric fields]

Where m is the number of documents and n is the number of terms.

On 14 December 2012 16:51, revdev clickingcam@gmail.com wrote:

Would love to see an answer for this. Thanks for the detailed question.

On Monday, October 8, 2012 9:00:09 AM UTC-7, Andrew Clegg wrote:

This is the third time I've tried to write this reply, thanks Google
Groups.

I think you'd have to iterate over the field data twice, once to construct
the estimate, and once again to load the data, so it might slow things down.
And really it'd be meaningless unless you ran a GC first, as there's no way
to know how much memory is potentially available until after a GC. So
you'd have to have a user-specified limit.

Would this be a really silly idea:

Wrap the whole FieldDataLoader#load method in a try/catch for
OutOfMemoryError.

Then if you get one, do an immediate GC (in the catch block so all the
local variables are out of scope).

Then throw an IOException: "Unable to load field data: out of heap space"
instead.

Is that crazy? It kinda sounds crazy, but no worse than being able to take
down a node with a single bad facet.

In answer to my own questions 1 and 4: I'm now 99% sure that filters and
document type are irrelevant when loading field data into the cache, so
faceting really will cause you to load all the field values across all the
types in your index.

(Can anyone confirm/deny please?)

On Monday, 8 October 2012 09:29:58 UTC+1, Jörg Prante wrote:

Would love to see answers to this questions too.

An important feature for ES would be a graceful rejection of faceting
over a field by precomputing the memory consumption to prevent OOMs. Right
now ES throws OOM if faceting fails, but will not automatically recover the
index from that state (only manual cluster restart helps).

Jörg

On Sunday, October 7, 2012 11:45:44 AM UTC+2, Andrew Clegg wrote:

I want to do some planning around how much cache memory it will take to
facet over potentially a lot of records (millions, eventually billions).

These are mainly date histograms and term facets.

So, I have a few questions.

  1. Is it correct to say that running a facet on a field causes every
    shard to load all the values for that field into memory? Before any facet
    filters are applied?

  2. What factors affect the memory consumed when this happens? Is it:
    number of documents in the shard, number of distinct values in that field,
    something else?

  3. Is there a formula for calculating/estimating the overall usage?
    (FieldDataLoader is a bit opaque if you're not a Lucene specialist.)

  4. Is the document type taken into account anywhere in this process? Or
    is the data loading done across all types in the index?

Let me go into 4 in a little more detail. Our index contains a large
number of different types (around a hundred I think) which have most of the
same fields in common. If someone does a facet on one type, will the data
for that field in across all types get loaded?

If that's the case, are we perhaps better off having a separate index
for each type?

Thanks in advance,

A.

--

--

http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg

--

Thanks for the forumula!

So here "n" is the number of unique values that term can take?
If thats so, then I would imagine that storing second level resolution for
a date field would take lot of memory when building field cache. Do you
have any experience with performance and capacity for storing dates?

Vinay

On Friday, December 14, 2012 11:17:48 AM UTC-8, Andrew Clegg wrote:

We discussed this briefly after the ES training course in London a
couple of months ago.

If I understood Shay correctly, here's the rough memory usage (in bytes).

For single-valued fields:

4m + (4n * avg(term length in chars)) [string fields]

4m + (n * term size in bytes) [numeric fields]

For multi-valued field:

(4m * max(num terms in doc)) + (4n * avg(term length in chars)) [string
fields]

(4m * max(num terms in doc)) + (n * term size in bytes) [numeric fields]

Where m is the number of documents and n is the number of terms.

On 14 December 2012 16:51, revdev <click...@gmail.com <javascript:>>
wrote:

Would love to see an answer for this. Thanks for the detailed question.

On Monday, October 8, 2012 9:00:09 AM UTC-7, Andrew Clegg wrote:

This is the third time I've tried to write this reply, thanks Google
Groups.

I think you'd have to iterate over the field data twice, once to
construct
the estimate, and once again to load the data, so it might slow things
down.
And really it'd be meaningless unless you ran a GC first, as there's no
way
to know how much memory is potentially available until after a GC. So
you'd have to have a user-specified limit.

Would this be a really silly idea:

Wrap the whole FieldDataLoader#load method in a try/catch for
OutOfMemoryError.

Then if you get one, do an immediate GC (in the catch block so all the
local variables are out of scope).

Then throw an IOException: "Unable to load field data: out of heap
space"
instead.

Is that crazy? It kinda sounds crazy, but no worse than being able to
take
down a node with a single bad facet.

In answer to my own questions 1 and 4: I'm now 99% sure that filters
and
document type are irrelevant when loading field data into the cache, so
faceting really will cause you to load all the field values across all
the
types in your index.

(Can anyone confirm/deny please?)

On Monday, 8 October 2012 09:29:58 UTC+1, Jörg Prante wrote:

Would love to see answers to this questions too.

An important feature for ES would be a graceful rejection of faceting
over a field by precomputing the memory consumption to prevent OOMs.
Right
now ES throws OOM if faceting fails, but will not automatically
recover the
index from that state (only manual cluster restart helps).

Jörg

On Sunday, October 7, 2012 11:45:44 AM UTC+2, Andrew Clegg wrote:

I want to do some planning around how much cache memory it will take
to
facet over potentially a lot of records (millions, eventually
billions).

These are mainly date histograms and term facets.

So, I have a few questions.

  1. Is it correct to say that running a facet on a field causes every
    shard to load all the values for that field into memory? Before any
    facet
    filters are applied?

  2. What factors affect the memory consumed when this happens? Is it:
    number of documents in the shard, number of distinct values in that
    field,
    something else?

  3. Is there a formula for calculating/estimating the overall usage?
    (FieldDataLoader is a bit opaque if you're not a Lucene specialist.)

  4. Is the document type taken into account anywhere in this process?
    Or
    is the data loading done across all types in the index?

Let me go into 4 in a little more detail. Our index contains a large
number of different types (around a hundred I think) which have most
of the
same fields in common. If someone does a facet on one type, will the
data
for that field in across all types get loaded?

If that's the case, are we perhaps better off having a separate
index
for each type?

Thanks in advance,

A.

--

--

http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg

--

Sorry, catching up on backlog here...

n is indeed the number of unique terms in the field you're caching.

And yes, you wouldn't want to load second-level resolution into the
field data cache if possible (e.g. sorting or faceting).

If we're planning to facet on a datetime field, we truncate it to
minute before indexing. (No reason you can't index two copies of the
field, one for sorting/faceting and one for querying.)

On 16 December 2012 01:51, revdev clickingcam@gmail.com wrote:

Thanks for the forumula!

So here "n" is the number of unique values that term can take?
If thats so, then I would imagine that storing second level resolution for a
date field would take lot of memory when building field cache. Do you have
any experience with performance and capacity for storing dates?

Vinay

On Friday, December 14, 2012 11:17:48 AM UTC-8, Andrew Clegg wrote:

We discussed this briefly after the ES training course in London a
couple of months ago.

If I understood Shay correctly, here's the rough memory usage (in bytes).

For single-valued fields:

4m + (4n * avg(term length in chars)) [string fields]

4m + (n * term size in bytes) [numeric fields]

For multi-valued field:

(4m * max(num terms in doc)) + (4n * avg(term length in chars)) [string
fields]

(4m * max(num terms in doc)) + (n * term size in bytes) [numeric fields]

Where m is the number of documents and n is the number of terms.

On 14 December 2012 16:51, revdev click...@gmail.com wrote:

Would love to see an answer for this. Thanks for the detailed question.

On Monday, October 8, 2012 9:00:09 AM UTC-7, Andrew Clegg wrote:

This is the third time I've tried to write this reply, thanks Google
Groups.

I think you'd have to iterate over the field data twice, once to
construct
the estimate, and once again to load the data, so it might slow things
down.
And really it'd be meaningless unless you ran a GC first, as there's no
way
to know how much memory is potentially available until after a GC. So
you'd have to have a user-specified limit.

Would this be a really silly idea:

Wrap the whole FieldDataLoader#load method in a try/catch for
OutOfMemoryError.

Then if you get one, do an immediate GC (in the catch block so all the
local variables are out of scope).

Then throw an IOException: "Unable to load field data: out of heap
space"
instead.

Is that crazy? It kinda sounds crazy, but no worse than being able to
take
down a node with a single bad facet.

In answer to my own questions 1 and 4: I'm now 99% sure that filters
and
document type are irrelevant when loading field data into the cache, so
faceting really will cause you to load all the field values across all
the
types in your index.

(Can anyone confirm/deny please?)

On Monday, 8 October 2012 09:29:58 UTC+1, Jörg Prante wrote:

Would love to see answers to this questions too.

An important feature for ES would be a graceful rejection of faceting
over a field by precomputing the memory consumption to prevent OOMs.
Right
now ES throws OOM if faceting fails, but will not automatically
recover the
index from that state (only manual cluster restart helps).

Jörg

On Sunday, October 7, 2012 11:45:44 AM UTC+2, Andrew Clegg wrote:

I want to do some planning around how much cache memory it will take
to
facet over potentially a lot of records (millions, eventually
billions).

These are mainly date histograms and term facets.

So, I have a few questions.

  1. Is it correct to say that running a facet on a field causes every
    shard to load all the values for that field into memory? Before any
    facet
    filters are applied?

  2. What factors affect the memory consumed when this happens? Is it:
    number of documents in the shard, number of distinct values in that
    field,
    something else?

  3. Is there a formula for calculating/estimating the overall usage?
    (FieldDataLoader is a bit opaque if you're not a Lucene specialist.)

  4. Is the document type taken into account anywhere in this process?
    Or
    is the data loading done across all types in the index?

Let me go into 4 in a little more detail. Our index contains a large
number of different types (around a hundred I think) which have most
of the
same fields in common. If someone does a facet on one type, will the
data
for that field in across all types get loaded?

If that's the case, are we perhaps better off having a separate
index
for each type?

Thanks in advance,

A.

--

--

http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg

--

--

http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg

--

Hi,

On Sunday, January 20, 2013 4:02:45 PM UTC-5, Andrew Clegg wrote:

Sorry, catching up on backlog here...

n is indeed the number of unique terms in the field you're caching.

And yes, you wouldn't want to load second-level resolution into the
field data cache if possible (e.g. sorting or faceting).

If we're planning to facet on a datetime field, we truncate it to
minute before indexing. (No reason you can't index two copies of the

Hm, and I thought this was the trick from the pre-trie-based date/time
fields.
I just quickly grepped the ES code and didn't seem them. Maybe ES doesn't
support them yet?

Otis

ELASTICSEARCH Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

field, one for sorting/faceting and one for querying.)

On 16 December 2012 01:51, revdev <click...@gmail.com <javascript:>>
wrote:

Thanks for the forumula!

So here "n" is the number of unique values that term can take?
If thats so, then I would imagine that storing second level resolution
for a
date field would take lot of memory when building field cache. Do you
have
any experience with performance and capacity for storing dates?

Vinay

On Friday, December 14, 2012 11:17:48 AM UTC-8, Andrew Clegg wrote:

We discussed this briefly after the ES training course in London a
couple of months ago.

If I understood Shay correctly, here's the rough memory usage (in
bytes).

For single-valued fields:

4m + (4n * avg(term length in chars)) [string fields]

4m + (n * term size in bytes) [numeric fields]

For multi-valued field:

(4m * max(num terms in doc)) + (4n * avg(term length in chars)) [string
fields]

(4m * max(num terms in doc)) + (n * term size in bytes) [numeric
fields]

Where m is the number of documents and n is the number of terms.

On 14 December 2012 16:51, revdev click...@gmail.com wrote:

Would love to see an answer for this. Thanks for the detailed
question.

On Monday, October 8, 2012 9:00:09 AM UTC-7, Andrew Clegg wrote:

This is the third time I've tried to write this reply, thanks Google
Groups.

I think you'd have to iterate over the field data twice, once to
construct
the estimate, and once again to load the data, so it might slow
things
down.
And really it'd be meaningless unless you ran a GC first, as there's
no
way
to know how much memory is potentially available until after a GC.
So
you'd have to have a user-specified limit.

Would this be a really silly idea:

Wrap the whole FieldDataLoader#load method in a try/catch for
OutOfMemoryError.

Then if you get one, do an immediate GC (in the catch block so all
the
local variables are out of scope).

Then throw an IOException: "Unable to load field data: out of heap
space"
instead.

Is that crazy? It kinda sounds crazy, but no worse than being able
to
take
down a node with a single bad facet.

In answer to my own questions 1 and 4: I'm now 99% sure that filters
and
document type are irrelevant when loading field data into the cache,
so
faceting really will cause you to load all the field values across
all
the
types in your index.

(Can anyone confirm/deny please?)

On Monday, 8 October 2012 09:29:58 UTC+1, Jörg Prante wrote:

Would love to see answers to this questions too.

An important feature for ES would be a graceful rejection of
faceting
over a field by precomputing the memory consumption to prevent
OOMs.
Right
now ES throws OOM if faceting fails, but will not automatically
recover the
index from that state (only manual cluster restart helps).

Jörg

On Sunday, October 7, 2012 11:45:44 AM UTC+2, Andrew Clegg wrote:

I want to do some planning around how much cache memory it will
take
to
facet over potentially a lot of records (millions, eventually
billions).

These are mainly date histograms and term facets.

So, I have a few questions.

  1. Is it correct to say that running a facet on a field causes
    every
    shard to load all the values for that field into memory? Before
    any
    facet
    filters are applied?

  2. What factors affect the memory consumed when this happens? Is
    it:
    number of documents in the shard, number of distinct values in
    that
    field,
    something else?

  3. Is there a formula for calculating/estimating the overall
    usage?
    (FieldDataLoader is a bit opaque if you're not a Lucene
    specialist.)

  4. Is the document type taken into account anywhere in this
    process?
    Or
    is the data loading done across all types in the index?

Let me go into 4 in a little more detail. Our index contains a
large
number of different types (around a hundred I think) which have
most
of the
same fields in common. If someone does a facet on one type, will
the
data
for that field in across all types get loaded?

If that's the case, are we perhaps better off having a separate
index
for each type?

Thanks in advance,

A.

--

--

http://tinyurl.com/andrew-clegg-linkedin |
http://twitter.com/andrew_clegg

--

--

http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg

--

I was curious if anyone knows if this formula is still valid for
Elasticsearch 0.90.x?

On Friday, December 14, 2012 2:17:48 PM UTC-5, Andrew Clegg wrote:

We discussed this briefly after the ES training course in London a
couple of months ago.

If I understood Shay correctly, here's the rough memory usage (in bytes).

For single-valued fields:

4m + (4n * avg(term length in chars)) [string fields]

4m + (n * term size in bytes) [numeric fields]

For multi-valued field:

(4m * max(num terms in doc)) + (4n * avg(term length in chars)) [string
fields]

(4m * max(num terms in doc)) + (n * term size in bytes) [numeric fields]

Where m is the number of documents and n is the number of terms.

On 14 December 2012 16:51, revdev <click...@gmail.com <javascript:>>
wrote:

Would love to see an answer for this. Thanks for the detailed question.

On Monday, October 8, 2012 9:00:09 AM UTC-7, Andrew Clegg wrote:

This is the third time I've tried to write this reply, thanks Google
Groups.

I think you'd have to iterate over the field data twice, once to
construct
the estimate, and once again to load the data, so it might slow things
down.
And really it'd be meaningless unless you ran a GC first, as there's no
way
to know how much memory is potentially available until after a GC. So
you'd have to have a user-specified limit.

Would this be a really silly idea:

Wrap the whole FieldDataLoader#load method in a try/catch for
OutOfMemoryError.

Then if you get one, do an immediate GC (in the catch block so all the
local variables are out of scope).

Then throw an IOException: "Unable to load field data: out of heap
space"
instead.

Is that crazy? It kinda sounds crazy, but no worse than being able to
take
down a node with a single bad facet.

In answer to my own questions 1 and 4: I'm now 99% sure that filters
and
document type are irrelevant when loading field data into the cache, so
faceting really will cause you to load all the field values across all
the
types in your index.

(Can anyone confirm/deny please?)

On Monday, 8 October 2012 09:29:58 UTC+1, Jörg Prante wrote:

Would love to see answers to this questions too.

An important feature for ES would be a graceful rejection of faceting
over a field by precomputing the memory consumption to prevent OOMs.
Right
now ES throws OOM if faceting fails, but will not automatically
recover the
index from that state (only manual cluster restart helps).

Jörg

On Sunday, October 7, 2012 11:45:44 AM UTC+2, Andrew Clegg wrote:

I want to do some planning around how much cache memory it will take
to
facet over potentially a lot of records (millions, eventually
billions).

These are mainly date histograms and term facets.

So, I have a few questions.

  1. Is it correct to say that running a facet on a field causes every
    shard to load all the values for that field into memory? Before any
    facet
    filters are applied?

  2. What factors affect the memory consumed when this happens? Is it:
    number of documents in the shard, number of distinct values in that
    field,
    something else?

  3. Is there a formula for calculating/estimating the overall usage?
    (FieldDataLoader is a bit opaque if you're not a Lucene specialist.)

  4. Is the document type taken into account anywhere in this process?
    Or
    is the data loading done across all types in the index?

Let me go into 4 in a little more detail. Our index contains a large
number of different types (around a hundred I think) which have most
of the
same fields in common. If someone does a facet on one type, will the
data
for that field in across all types get loaded?

If that's the case, are we perhaps better off having a separate
index
for each type?

Thanks in advance,

A.

--

--

http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

To answer the original question 1:
Each shard will load the field you want to facet on into memory
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-facets-terms-facet.html#_memory_considerations_2