Estimating field cache size for facets in advance

Andrew_Clegg · October 7, 2012, 9:45am

I want to do some planning around how much cache memory it will take to
facet over potentially a lot of records (millions, eventually billions).

These are mainly date histograms and term facets.

So, I have a few questions.

Is it correct to say that running a facet on a field causes every shard
to load all the values for that field into memory? Before any facet
filters are applied?
What factors affect the memory consumed when this happens? Is it: number
of documents in the shard, number of distinct values in that field,
something else?
Is there a formula for calculating/estimating the overall usage?
(FieldDataLoader is a bit opaque if you're not a Lucene specialist.)
Is the document type taken into account anywhere in this process? Or is
the data loading done across all types in the index?

Let me go into 4 in a little more detail. Our index contains a large number
of different types (around a hundred I think) which have most of the same
fields in common. If someone does a facet on one type, will the data for
that field in across all types get loaded?

If that's the case, are we perhaps better off having a separate index for
each type?

Thanks in advance,

A.

--

jprante · October 8, 2012, 8:29am

Would love to see answers to this questions too.

An important feature for ES would be a graceful rejection of faceting over
a field by precomputing the memory consumption to prevent OOMs. Right now
ES throws OOM if faceting fails, but will not automatically recover the
index from that state (only manual cluster restart helps).

Jörg

On Sunday, October 7, 2012 11:45:44 AM UTC+2, Andrew Clegg wrote:

I want to do some planning around how much cache memory it will take to
facet over potentially a lot of records (millions, eventually billions).

These are mainly date histograms and term facets.

So, I have a few questions.

Is it correct to say that running a facet on a field causes every shard
to load all the values for that field into memory? Before any facet
filters are applied?

What factors affect the memory consumed when this happens? Is it:
number of documents in the shard, number of distinct values in that field,
something else?

Is there a formula for calculating/estimating the overall usage?
(FieldDataLoader is a bit opaque if you're not a Lucene specialist.)

Is the document type taken into account anywhere in this process? Or is
the data loading done across all types in the index?

Let me go into 4 in a little more detail. Our index contains a large
number of different types (around a hundred I think) which have most of the
same fields in common. If someone does a facet on one type, will the data
for that field in across all types get loaded?

If that's the case, are we perhaps better off having a separate index
for each type?

Thanks in advance,

A.

--

Andrej · October 8, 2012, 3:27pm

Yeah, that would be interesting, especially OOM is a real problem for us
too at the moment. So knowing if changing cache type or heapsize would help
is for sure a benefit (at least an estimation, maybe?)

One interesting thing came up while playing with cache settings. I can set
expiration time using curl and everything is fine:

curl -XPUT host:/port_settings -d '{ "index" : { "cache.field.expire" :
"10m"}}'

After trying to set the default value again (curl -XPUT host:port/_settings
-d '{ "index" : { "cache.field.expire" : "-1"}}') I am getting an error
including the following message:
Caused by: java.lang.IllegalArgumentException: duration cannot be negative:
-1000000 NANOSECONDS
Is there a bug in parsing the argument or am I doing something wrong?

--

Andrew_Clegg · October 8, 2012, 4:00pm

This is the third time I've tried to write this reply, thanks Google Groups.

I think you'd have to iterate over the field data twice, once to construct
the estimate, and once again to load the data, so it might slow things
down. And really it'd be meaningless unless you ran a GC first, as there's
no way to know how much memory is potentially available until after a GC.
So you'd have to have a user-specified limit.

Would this be a really silly idea:

Wrap the whole FieldDataLoader#load method in a try/catch for
OutOfMemoryError.

Then if you get one, do an immediate GC (in the catch block so all the
local variables are out of scope).

Then throw an IOException: "Unable to load field data: out of heap space"
instead.

Is that crazy? It kinda sounds crazy, but no worse than being able to take
down a node with a single bad facet.

In answer to my own questions 1 and 4: I'm now 99% sure that filters and
document type are irrelevant when loading field data into the cache, so
faceting really will cause you to load all the field values across all the
types in your index.

(Can anyone confirm/deny please?)

On Monday, 8 October 2012 09:29:58 UTC+1, Jörg Prante wrote:

Would love to see answers to this questions too.

An important feature for ES would be a graceful rejection of faceting over
a field by precomputing the memory consumption to prevent OOMs. Right now
ES throws OOM if faceting fails, but will not automatically recover the
index from that state (only manual cluster restart helps).

Jörg

On Sunday, October 7, 2012 11:45:44 AM UTC+2, Andrew Clegg wrote:

I want to do some planning around how much cache memory it will take to
facet over potentially a lot of records (millions, eventually billions).

These are mainly date histograms and term facets.

So, I have a few questions.

Is it correct to say that running a facet on a field causes every
shard to load all the values for that field into memory? Before any facet
filters are applied?

What factors affect the memory consumed when this happens? Is it:
number of documents in the shard, number of distinct values in that field,
something else?

Is there a formula for calculating/estimating the overall usage?
(FieldDataLoader is a bit opaque if you're not a Lucene specialist.)

Is the document type taken into account anywhere in this process? Or
is the data loading done across all types in the index?

Let me go into 4 in a little more detail. Our index contains a large
number of different types (around a hundred I think) which have most of the
same fields in common. If someone does a facet on one type, will the data
for that field in across all types get loaded?

If that's the case, are we perhaps better off having a separate index
for each type?

Thanks in advance,

A.

--

revdev · December 14, 2012, 4:51pm

Would love to see an answer for this. Thanks for the detailed question.

On Monday, October 8, 2012 9:00:09 AM UTC-7, Andrew Clegg wrote:

This is the third time I've tried to write this reply, thanks Google
Groups.

I think you'd have to iterate over the field data twice, once to construct
the estimate, and once again to load the data, so it might slow things
down. And really it'd be meaningless unless you ran a GC first, as there's
no way to know how much memory is potentially available until after a GC.
So you'd have to have a user-specified limit.

Would this be a really silly idea:

Wrap the whole FieldDataLoader#load method in a try/catch for
OutOfMemoryError.

Then if you get one, do an immediate GC (in the catch block so all the
local variables are out of scope).

Then throw an IOException: "Unable to load field data: out of heap space"
instead.

Is that crazy? It kinda sounds crazy, but no worse than being able to take
down a node with a single bad facet.

In answer to my own questions 1 and 4: I'm now 99% sure that filters and
document type are irrelevant when loading field data into the cache, so
faceting really will cause you to load all the field values across all the
types in your index.

(Can anyone confirm/deny please?)

On Monday, 8 October 2012 09:29:58 UTC+1, Jörg Prante wrote:

Would love to see answers to this questions too.

An important feature for ES would be a graceful rejection of faceting
over a field by precomputing the memory consumption to prevent OOMs. Right
now ES throws OOM if faceting fails, but will not automatically recover the
index from that state (only manual cluster restart helps).

Jörg

On Sunday, October 7, 2012 11:45:44 AM UTC+2, Andrew Clegg wrote:

I want to do some planning around how much cache memory it will take to
facet over potentially a lot of records (millions, eventually billions).

These are mainly date histograms and term facets.

So, I have a few questions.

Is it correct to say that running a facet on a field causes every
shard to load all the values for that field into memory? Before any facet
filters are applied?

What factors affect the memory consumed when this happens? Is it:
number of documents in the shard, number of distinct values in that field,
something else?

Is there a formula for calculating/estimating the overall usage?
(FieldDataLoader is a bit opaque if you're not a Lucene specialist.)

Is the document type taken into account anywhere in this process? Or
is the data loading done across all types in the index?

Let me go into 4 in a little more detail. Our index contains a large
number of different types (around a hundred I think) which have most of the
same fields in common. If someone does a facet on one type, will the data
for that field in across all types get loaded?

If that's the case, are we perhaps better off having a separate index
for each type?

Thanks in advance,

A.

--

Andrew_Clegg · December 14, 2012, 7:17pm

We discussed this briefly after the ES training course in London a
couple of months ago.

If I understood Shay correctly, here's the rough memory usage (in bytes).

For single-valued fields:

4m + (4n * avg(term length in chars)) [string fields]

4m + (n * term size in bytes) [numeric fields]

For multi-valued field:

(4m * max(num terms in doc)) + (4n * avg(term length in chars)) [string fields]

(4m * max(num terms in doc)) + (n * term size in bytes) [numeric fields]

Where m is the number of documents and n is the number of terms.

On 14 December 2012 16:51, revdev clickingcam@gmail.com wrote:

Would love to see an answer for this. Thanks for the detailed question.

On Monday, October 8, 2012 9:00:09 AM UTC-7, Andrew Clegg wrote:

This is the third time I've tried to write this reply, thanks Google
Groups.

I think you'd have to iterate over the field data twice, once to construct
the estimate, and once again to load the data, so it might slow things down.
And really it'd be meaningless unless you ran a GC first, as there's no way
to know how much memory is potentially available until after a GC. So
you'd have to have a user-specified limit.

Would this be a really silly idea:

Wrap the whole FieldDataLoader#load method in a try/catch for
OutOfMemoryError.

Then if you get one, do an immediate GC (in the catch block so all the
local variables are out of scope).

Then throw an IOException: "Unable to load field data: out of heap space"
instead.

Is that crazy? It kinda sounds crazy, but no worse than being able to take
down a node with a single bad facet.

In answer to my own questions 1 and 4: I'm now 99% sure that filters and
document type are irrelevant when loading field data into the cache, so
faceting really will cause you to load all the field values across all the
types in your index.

(Can anyone confirm/deny please?)

On Monday, 8 October 2012 09:29:58 UTC+1, Jörg Prante wrote:

Would love to see answers to this questions too.

An important feature for ES would be a graceful rejection of faceting
over a field by precomputing the memory consumption to prevent OOMs. Right
now ES throws OOM if faceting fails, but will not automatically recover the
index from that state (only manual cluster restart helps).

Jörg

On Sunday, October 7, 2012 11:45:44 AM UTC+2, Andrew Clegg wrote:

I want to do some planning around how much cache memory it will take to
facet over potentially a lot of records (millions, eventually billions).

These are mainly date histograms and term facets.

So, I have a few questions.

Is it correct to say that running a facet on a field causes every
shard to load all the values for that field into memory? Before any facet
filters are applied?

What factors affect the memory consumed when this happens? Is it:
number of documents in the shard, number of distinct values in that field,
something else?

Is there a formula for calculating/estimating the overall usage?
(FieldDataLoader is a bit opaque if you're not a Lucene specialist.)

Is the document type taken into account anywhere in this process? Or
is the data loading done across all types in the index?

Let me go into 4 in a little more detail. Our index contains a large
number of different types (around a hundred I think) which have most of the
same fields in common. If someone does a facet on one type, will the data
for that field in across all types get loaded?

If that's the case, are we perhaps better off having a separate index
for each type?

Thanks in advance,

A.

--

--

http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg

--

revdev · December 16, 2012, 1:51am

Thanks for the forumula!

So here "n" is the number of unique values that term can take?
If thats so, then I would imagine that storing second level resolution for
a date field would take lot of memory when building field cache. Do you
have any experience with performance and capacity for storing dates?

Vinay

On Friday, December 14, 2012 11:17:48 AM UTC-8, Andrew Clegg wrote:

We discussed this briefly after the ES training course in London a
couple of months ago.

If I understood Shay correctly, here's the rough memory usage (in bytes).

For single-valued fields:

4m + (4n * avg(term length in chars)) [string fields]

4m + (n * term size in bytes) [numeric fields]

For multi-valued field:

(4m * max(num terms in doc)) + (4n * avg(term length in chars)) [string
fields]

(4m * max(num terms in doc)) + (n * term size in bytes) [numeric fields]

Where m is the number of documents and n is the number of terms.

On 14 December 2012 16:51, revdev <click...@gmail.com <javascript:>>
wrote:

Would love to see an answer for this. Thanks for the detailed question.

On Monday, October 8, 2012 9:00:09 AM UTC-7, Andrew Clegg wrote:

This is the third time I've tried to write this reply, thanks Google
Groups.

I think you'd have to iterate over the field data twice, once to
construct
the estimate, and once again to load the data, so it might slow things
down.
And really it'd be meaningless unless you ran a GC first, as there's no
way
to know how much memory is potentially available until after a GC. So
you'd have to have a user-specified limit.

Would this be a really silly idea:

Wrap the whole FieldDataLoader#load method in a try/catch for
OutOfMemoryError.

Then if you get one, do an immediate GC (in the catch block so all the
local variables are out of scope).

Then throw an IOException: "Unable to load field data: out of heap
space"
instead.

Is that crazy? It kinda sounds crazy, but no worse than being able to
take
down a node with a single bad facet.

In answer to my own questions 1 and 4: I'm now 99% sure that filters
and
document type are irrelevant when loading field data into the cache, so
faceting really will cause you to load all the field values across all
the
types in your index.

(Can anyone confirm/deny please?)

On Monday, 8 October 2012 09:29:58 UTC+1, Jörg Prante wrote:

Would love to see answers to this questions too.

An important feature for ES would be a graceful rejection of faceting
over a field by precomputing the memory consumption to prevent OOMs.
Right
now ES throws OOM if faceting fails, but will not automatically
recover the
index from that state (only manual cluster restart helps).

Jörg

On Sunday, October 7, 2012 11:45:44 AM UTC+2, Andrew Clegg wrote:

I want to do some planning around how much cache memory it will take
to
facet over potentially a lot of records (millions, eventually
billions).

These are mainly date histograms and term facets.

So, I have a few questions.

Is it correct to say that running a facet on a field causes every
shard to load all the values for that field into memory? Before any
facet
filters are applied?

What factors affect the memory consumed when this happens? Is it:
number of documents in the shard, number of distinct values in that
field,
something else?

Is there a formula for calculating/estimating the overall usage?
(FieldDataLoader is a bit opaque if you're not a Lucene specialist.)

Is the document type taken into account anywhere in this process?
Or
is the data loading done across all types in the index?

Let me go into 4 in a little more detail. Our index contains a large
number of different types (around a hundred I think) which have most
of the
same fields in common. If someone does a facet on one type, will the
data
for that field in across all types get loaded?

If that's the case, are we perhaps better off having a separate
index
for each type?

Thanks in advance,

A.

--

--

http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg

--

Andrew_Clegg · January 20, 2013, 9:02pm

Sorry, catching up on backlog here...

n is indeed the number of unique terms in the field you're caching.

And yes, you wouldn't want to load second-level resolution into the
field data cache if possible (e.g. sorting or faceting).

If we're planning to facet on a datetime field, we truncate it to
minute before indexing. (No reason you can't index two copies of the
field, one for sorting/faceting and one for querying.)

On 16 December 2012 01:51, revdev clickingcam@gmail.com wrote:

Thanks for the forumula!

So here "n" is the number of unique values that term can take?
If thats so, then I would imagine that storing second level resolution for a
date field would take lot of memory when building field cache. Do you have
any experience with performance and capacity for storing dates?

Vinay

On Friday, December 14, 2012 11:17:48 AM UTC-8, Andrew Clegg wrote:

We discussed this briefly after the ES training course in London a
couple of months ago.

If I understood Shay correctly, here's the rough memory usage (in bytes).

For single-valued fields:

4m + (4n * avg(term length in chars)) [string fields]

4m + (n * term size in bytes) [numeric fields]

For multi-valued field:

(4m * max(num terms in doc)) + (4n * avg(term length in chars)) [string
fields]

(4m * max(num terms in doc)) + (n * term size in bytes) [numeric fields]

Where m is the number of documents and n is the number of terms.

On 14 December 2012 16:51, revdev click...@gmail.com wrote:

Would love to see an answer for this. Thanks for the detailed question.

On Monday, October 8, 2012 9:00:09 AM UTC-7, Andrew Clegg wrote:

This is the third time I've tried to write this reply, thanks Google
Groups.

I think you'd have to iterate over the field data twice, once to
construct
the estimate, and once again to load the data, so it might slow things
down.
And really it'd be meaningless unless you ran a GC first, as there's no
way
to know how much memory is potentially available until after a GC. So
you'd have to have a user-specified limit.

Would this be a really silly idea:

Wrap the whole FieldDataLoader#load method in a try/catch for
OutOfMemoryError.

Then if you get one, do an immediate GC (in the catch block so all the
local variables are out of scope).

Then throw an IOException: "Unable to load field data: out of heap
space"
instead.

Is that crazy? It kinda sounds crazy, but no worse than being able to
take
down a node with a single bad facet.

In answer to my own questions 1 and 4: I'm now 99% sure that filters
and
document type are irrelevant when loading field data into the cache, so
faceting really will cause you to load all the field values across all
the
types in your index.

(Can anyone confirm/deny please?)

On Monday, 8 October 2012 09:29:58 UTC+1, Jörg Prante wrote:

Would love to see answers to this questions too.

An important feature for ES would be a graceful rejection of faceting
over a field by precomputing the memory consumption to prevent OOMs.
Right
now ES throws OOM if faceting fails, but will not automatically
recover the
index from that state (only manual cluster restart helps).

Jörg

On Sunday, October 7, 2012 11:45:44 AM UTC+2, Andrew Clegg wrote:

I want to do some planning around how much cache memory it will take
to
facet over potentially a lot of records (millions, eventually
billions).

These are mainly date histograms and term facets.

So, I have a few questions.

Is it correct to say that running a facet on a field causes every
shard to load all the values for that field into memory? Before any
facet
filters are applied?

What factors affect the memory consumed when this happens? Is it:
number of documents in the shard, number of distinct values in that
field,
something else?

Is there a formula for calculating/estimating the overall usage?
(FieldDataLoader is a bit opaque if you're not a Lucene specialist.)

Is the document type taken into account anywhere in this process?
Or
is the data loading done across all types in the index?

Let me go into 4 in a little more detail. Our index contains a large
number of different types (around a hundred I think) which have most
of the
same fields in common. If someone does a facet on one type, will the
data
for that field in across all types get loaded?

If that's the case, are we perhaps better off having a separate
index
for each type?

Thanks in advance,

A.

--

--

http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg

--

--

http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg

--

otisg · January 21, 2013, 3:01am

Hi,

On Sunday, January 20, 2013 4:02:45 PM UTC-5, Andrew Clegg wrote:

Sorry, catching up on backlog here...

n is indeed the number of unique terms in the field you're caching.

And yes, you wouldn't want to load second-level resolution into the
field data cache if possible (e.g. sorting or faceting).

If we're planning to facet on a datetime field, we truncate it to
minute before indexing. (No reason you can't index two copies of the

Hm, and I thought this was the trick from the pre-trie-based date/time
fields.
I just quickly grepped the ES code and didn't seem them. Maybe ES doesn't
support them yet?

Otis

ELASTICSEARCH Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

field, one for sorting/faceting and one for querying.)

On 16 December 2012 01:51, revdev <click...@gmail.com <javascript:>>
wrote:

Thanks for the forumula!

So here "n" is the number of unique values that term can take?
If thats so, then I would imagine that storing second level resolution
for a
date field would take lot of memory when building field cache. Do you
have
any experience with performance and capacity for storing dates?

Vinay

On Friday, December 14, 2012 11:17:48 AM UTC-8, Andrew Clegg wrote:

We discussed this briefly after the ES training course in London a
couple of months ago.

If I understood Shay correctly, here's the rough memory usage (in
bytes).

For single-valued fields:

4m + (4n * avg(term length in chars)) [string fields]

4m + (n * term size in bytes) [numeric fields]

For multi-valued field:

(4m * max(num terms in doc)) + (4n * avg(term length in chars)) [string
fields]

(4m * max(num terms in doc)) + (n * term size in bytes) [numeric
fields]

Where m is the number of documents and n is the number of terms.

On 14 December 2012 16:51, revdev click...@gmail.com wrote:

Would love to see an answer for this. Thanks for the detailed
question.

On Monday, October 8, 2012 9:00:09 AM UTC-7, Andrew Clegg wrote:

This is the third time I've tried to write this reply, thanks Google
Groups.

I think you'd have to iterate over the field data twice, once to
construct
the estimate, and once again to load the data, so it might slow
things
down.
And really it'd be meaningless unless you ran a GC first, as there's
no
way
to know how much memory is potentially available until after a GC.
So
you'd have to have a user-specified limit.

Would this be a really silly idea:

Wrap the whole FieldDataLoader#load method in a try/catch for
OutOfMemoryError.

Then if you get one, do an immediate GC (in the catch block so all
the
local variables are out of scope).

Then throw an IOException: "Unable to load field data: out of heap
space"
instead.

Is that crazy? It kinda sounds crazy, but no worse than being able
to
take
down a node with a single bad facet.

In answer to my own questions 1 and 4: I'm now 99% sure that filters
and
document type are irrelevant when loading field data into the cache,
so
faceting really will cause you to load all the field values across
all
the
types in your index.

(Can anyone confirm/deny please?)

On Monday, 8 October 2012 09:29:58 UTC+1, Jörg Prante wrote:

Would love to see answers to this questions too.

An important feature for ES would be a graceful rejection of
faceting
over a field by precomputing the memory consumption to prevent
OOMs.
Right
now ES throws OOM if faceting fails, but will not automatically
recover the
index from that state (only manual cluster restart helps).

Jörg

On Sunday, October 7, 2012 11:45:44 AM UTC+2, Andrew Clegg wrote:

I want to do some planning around how much cache memory it will
take
to
facet over potentially a lot of records (millions, eventually
billions).

These are mainly date histograms and term facets.

So, I have a few questions.

Is it correct to say that running a facet on a field causes
every
shard to load all the values for that field into memory? Before
any
facet
filters are applied?

What factors affect the memory consumed when this happens? Is
it:
number of documents in the shard, number of distinct values in
that
field,
something else?

Is there a formula for calculating/estimating the overall
usage?
(FieldDataLoader is a bit opaque if you're not a Lucene
specialist.)

Is the document type taken into account anywhere in this
process?
Or
is the data loading done across all types in the index?

Let me go into 4 in a little more detail. Our index contains a
large
number of different types (around a hundred I think) which have
most
of the
same fields in common. If someone does a facet on one type, will
the
data
for that field in across all types get loaded?

If that's the case, are we perhaps better off having a separate
index
for each type?

Thanks in advance,

A.

--

--

http://tinyurl.com/andrew-clegg-linkedin |
http://twitter.com/andrew_clegg

--

--

http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg

--

Curt_Kohler · August 22, 2013, 11:56am

I was curious if anyone knows if this formula is still valid for
Elasticsearch 0.90.x?

On Friday, December 14, 2012 2:17:48 PM UTC-5, Andrew Clegg wrote:

We discussed this briefly after the ES training course in London a
couple of months ago.

If I understood Shay correctly, here's the rough memory usage (in bytes).

For single-valued fields:

4m + (4n * avg(term length in chars)) [string fields]

4m + (n * term size in bytes) [numeric fields]

For multi-valued field:

(4m * max(num terms in doc)) + (4n * avg(term length in chars)) [string
fields]

(4m * max(num terms in doc)) + (n * term size in bytes) [numeric fields]

Where m is the number of documents and n is the number of terms.

On 14 December 2012 16:51, revdev <click...@gmail.com <javascript:>>
wrote:

Would love to see an answer for this. Thanks for the detailed question.

On Monday, October 8, 2012 9:00:09 AM UTC-7, Andrew Clegg wrote:

This is the third time I've tried to write this reply, thanks Google
Groups.

I think you'd have to iterate over the field data twice, once to
construct
the estimate, and once again to load the data, so it might slow things
down.
And really it'd be meaningless unless you ran a GC first, as there's no
way
to know how much memory is potentially available until after a GC. So
you'd have to have a user-specified limit.

Would this be a really silly idea:

Wrap the whole FieldDataLoader#load method in a try/catch for
OutOfMemoryError.

Then if you get one, do an immediate GC (in the catch block so all the
local variables are out of scope).

Then throw an IOException: "Unable to load field data: out of heap
space"
instead.

Is that crazy? It kinda sounds crazy, but no worse than being able to
take
down a node with a single bad facet.

In answer to my own questions 1 and 4: I'm now 99% sure that filters
and
document type are irrelevant when loading field data into the cache, so
faceting really will cause you to load all the field values across all
the
types in your index.

(Can anyone confirm/deny please?)

On Monday, 8 October 2012 09:29:58 UTC+1, Jörg Prante wrote:

Would love to see answers to this questions too.

An important feature for ES would be a graceful rejection of faceting
over a field by precomputing the memory consumption to prevent OOMs.
Right
now ES throws OOM if faceting fails, but will not automatically
recover the
index from that state (only manual cluster restart helps).

Jörg

On Sunday, October 7, 2012 11:45:44 AM UTC+2, Andrew Clegg wrote:

I want to do some planning around how much cache memory it will take
to
facet over potentially a lot of records (millions, eventually
billions).

These are mainly date histograms and term facets.

So, I have a few questions.

Is it correct to say that running a facet on a field causes every
shard to load all the values for that field into memory? Before any
facet
filters are applied?

What factors affect the memory consumed when this happens? Is it:
number of documents in the shard, number of distinct values in that
field,
something else?

Is there a formula for calculating/estimating the overall usage?
(FieldDataLoader is a bit opaque if you're not a Lucene specialist.)

Is the document type taken into account anywhere in this process?
Or
is the data loading done across all types in the index?

Let me go into 4 in a little more detail. Our index contains a large
number of different types (around a hundred I think) which have most
of the
same fields in common. If someone does a facet on one type, will the
data
for that field in across all types get loaded?

If that's the case, are we perhaps better off having a separate
index
for each type?

Thanks in advance,

A.

--

--

http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Meghan_Mahoney · October 30, 2013, 2:21pm

To answer the original question 1:
Each shard will load the field you want to facet on into memory
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-facets-terms-facet.html#_memory_considerations_2

Topic		Replies	Views
Yet another facet/memory question Elasticsearch	2	348	July 6, 2017
Facet Bug - Memory dropped from 80GB to 5GB Elasticsearch	2	346	July 6, 2017
Understanding the effects of "low memory" (not OOM) on nodes (or: should I just add a new node to my cluster and get on with my life?!) Elasticsearch	15	330	July 6, 2017
Memory issues with facets Elasticsearch	4	371	July 6, 2017
Question on index field cache stats reported per node Elasticsearch	1	346	July 6, 2017

Estimating field cache size for facets in advance

Otis

Related topics