Faceting on a field with very many unique values, on a very large index


(Mark MacGillivray) #1

Hi there, I am using elasticsearch (which is totally brilliant by the way)
for an index with approximately 21 million records in it, and within those
records I have one particular field that has between 1 and perhaps 10
values, and those values are often unique to just that record. The values
are text strings - names of people. I am using a dynamic mapping.

I would like to be able to facet on this field, but whatever I do, I just
crash my index. So I am looking for further suggestions.

I have stored this field unanalysed, and I have tried the field cache field
type set to soft and not set at all, and tried field cache max size to
various values ranging from 1 to 10,000,000.

I have run this on a single machine with 60gb memory reserved to
elasticsearch. It eventually fails with an Out of Memory error and tries to
dump the heap.

I have also tried running it on a cluster of 8 machines with 6gb for
elasticsearch on each, trying with between 1 and 16 shards, and between 1
and 8 replicas. Also on a cluster of 4 machines with 12gb each. However it
again fails with OOM, a bit sooner than the one big machine.

Are other people running facets on fields with this many potentially unique
values - on the order of 70,000,000? Am I just pushing elasticsearch too
far, or is it worth trying with more machines / one even bigger machine /
many even bigger machines?

Any feedback from people doing this sort of scale of faceting would be
appreciated, or any other settings suggestions you can provide would be
great, so that I can get an idea if it is worth trying any further or just
give up faceting on this field.

Thanks!

--


(David Pilato) #2

Hi Mark,

As you have seen it, faceting in 0.19.x with huge dataset is a problem.
Although Shay wrote before that there will be some memory usage optimization
when using facets in 0.20, I'm not sure that you will be able to facet on
70,000,000 unique values (I did not test 0.20 SNAPSHOT yet myself).

But, I'm wondering about your use case. What are you trying to achieve here? If
all values are unique, why do you need to group and count them as you will have
probably 1 count per value?

In the past, the only way I found to avoid OOM was to restrict with
filters/queries the number of documents to facet on.

So, don't you have in your documents somewhere a field that could help you to
reduce the number of documents? (a date field for example?)

Not sure my answer helps... :frowning:
David.

Le 26 septembre 2012 à 02:02, Mark MacGillivray mark@cottagelabs.com a écrit :

Hi there, I am using elasticsearch (which is totally brilliant by the way) for
an index with approximately 21 million records in it, and within those records
I have one particular field that has between 1 and perhaps 10 values, and
those values are often unique to just that record. The values are text strings

  • names of people. I am using a dynamic mapping.

I would like to be able to facet on this field, but whatever I do, I just
crash my index. So I am looking for further suggestions.

I have stored this field unanalysed, and I have tried the field cache field
type set to soft and not set at all, and tried field cache max size to various
values ranging from 1 to 10,000,000.

I have run this on a single machine with 60gb memory reserved to
elasticsearch. It eventually fails with an Out of Memory error and tries to
dump the heap.

I have also tried running it on a cluster of 8 machines with 6gb for
elasticsearch on each, trying with between 1 and 16 shards, and between 1 and
8 replicas. Also on a cluster of 4 machines with 12gb each. However it again
fails with OOM, a bit sooner than the one big machine.

Are other people running facets on fields with this many potentially unique
values - on the order of 70,000,000? Am I just pushing elasticsearch too far,
or is it worth trying with more machines / one even bigger machine / many even
bigger machines?

Any feedback from people doing this sort of scale of faceting would be
appreciated, or any other settings suggestions you can provide would be great,
so that I can get an idea if it is worth trying any further or just give up
faceting on this field.

Thanks!

--

--
David Pilato
http://www.scrutmydocs.org/
http://dev.david.pilato.fr/
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

--


(Igor Motov) #3

Hi Mark,

If your field is stored, you can try using script_field instead of standard
field terms facet. Assuming that your field is called "authors", you can
try something like this:

"facets": {
"authors_facet": {
"terms" : {
"script_field" : "_fields.authors.values"
}
}
}

Start with really small result set. Running this request on all 70 mln
records will take really long time. However, if your result sets
are relatively small, you might get acceptable performance out of it.

Igor

On Wednesday, September 26, 2012 8:06:10 AM UTC-4, David Pilato wrote:

Hi Mark,

As you have seen it, faceting in 0.19.x with huge dataset is a problem.
Although Shay wrote before that there will be some memory usage
optimization when using facets in 0.20, I'm not sure that you will be able
to facet on 70,000,000 unique values (I did not test 0.20 SNAPSHOT yet
myself).

But, I'm wondering about your use case. What are you trying to achieve
here? If all values are unique, why do you need to group and count them as
you will have probably 1 count per value?

In the past, the only way I found to avoid OOM was to restrict with
filters/queries the number of documents to facet on.

So, don't you have in your documents somewhere a field that could help
you to reduce the number of documents? (a date field for example?)

Not sure my answer helps... :frowning:
David.

Le 26 septembre 2012 à 02:02, Mark MacGillivray mark@cottagelabs.com a
écrit :

Hi there, I am using elasticsearch (which is totally brilliant by the way)
for an index with approximately 21 million records in it, and within those
records I have one particular field that has between 1 and perhaps 10
values, and those values are often unique to just that record. The values
are text strings - names of people. I am using a dynamic mapping.

I would like to be able to facet on this field, but whatever I do, I just
crash my index. So I am looking for further suggestions.

I have stored this field unanalysed, and I have tried the field cache
field type set to soft and not set at all, and tried field cache max size
to various values ranging from 1 to 10,000,000.

I have run this on a single machine with 60gb memory reserved to
elasticsearch. It eventually fails with an Out of Memory error and tries to
dump the heap.

I have also tried running it on a cluster of 8 machines with 6gb for
elasticsearch on each, trying with between 1 and 16 shards, and between 1
and 8 replicas. Also on a cluster of 4 machines with 12gb each. However it
again fails with OOM, a bit sooner than the one big machine.

Are other people running facets on fields with this many potentially
unique values - on the order of 70,000,000? Am I just pushing elasticsearch
too far, or is it worth trying with more machines / one even bigger machine
/ many even bigger machines?

Any feedback from people doing this sort of scale of faceting would be
appreciated, or any other settings suggestions you can provide would be
great, so that I can get an idea if it is worth trying any further or just
give up faceting on this field.

Thanks!

--

--
David Pilato
http://www.scrutmydocs.org/
http://dev.david.pilato.fr/
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

--


(Otis Gospodnetić) #4

Hi,

I was going to suggest to try limiting faceting to only top N results.
There is an issue open for this, but no implementation. If that is an
option (and this means that you don't need exact counts) then you could try
doing this in the client, too - just get the results including authors,
chop up the authors field and count the names.

Maybe limiting to top N is something that could be accomplished through the
script_field Igor mentioned?

Otis

Search Analytics - http://sematext.com/search-analytics/index.html
Performance Monitoring - http://sematext.com/spm/index.html

On Wednesday, September 26, 2012 9:29:16 AM UTC-4, Igor Motov wrote:

Hi Mark,

If your field is stored, you can try using script_field instead of
standard field terms facet. Assuming that your field is called "authors",
you can try something like this:

"facets": {
"authors_facet": {
"terms" : {
"script_field" : "_fields.authors.values"
}
}
}

Start with really small result set. Running this request on all 70 mln
records will take really long time. However, if your result sets
are relatively small, you might get acceptable performance out of it.

Igor

On Wednesday, September 26, 2012 8:06:10 AM UTC-4, David Pilato wrote:

Hi Mark,

As you have seen it, faceting in 0.19.x with huge dataset is a problem.
Although Shay wrote before that there will be some memory usage
optimization when using facets in 0.20, I'm not sure that you will be able
to facet on 70,000,000 unique values (I did not test 0.20 SNAPSHOT yet
myself).

But, I'm wondering about your use case. What are you trying to achieve
here? If all values are unique, why do you need to group and count them as
you will have probably 1 count per value?

In the past, the only way I found to avoid OOM was to restrict with
filters/queries the number of documents to facet on.

So, don't you have in your documents somewhere a field that could help
you to reduce the number of documents? (a date field for example?)

Not sure my answer helps... :frowning:
David.

Le 26 septembre 2012 à 02:02, Mark MacGillivray <ma...@cottagelabs.com<javascript:>>
a écrit :

Hi there, I am using elasticsearch (which is totally brilliant by the
way) for an index with approximately 21 million records in it, and within
those records I have one particular field that has between 1 and perhaps 10
values, and those values are often unique to just that record. The values
are text strings - names of people. I am using a dynamic mapping.

I would like to be able to facet on this field, but whatever I do, I
just crash my index. So I am looking for further suggestions.

I have stored this field unanalysed, and I have tried the field cache
field type set to soft and not set at all, and tried field cache max size
to various values ranging from 1 to 10,000,000.

I have run this on a single machine with 60gb memory reserved to
elasticsearch. It eventually fails with an Out of Memory error and tries to
dump the heap.

I have also tried running it on a cluster of 8 machines with 6gb for
elasticsearch on each, trying with between 1 and 16 shards, and between 1
and 8 replicas. Also on a cluster of 4 machines with 12gb each. However it
again fails with OOM, a bit sooner than the one big machine.

Are other people running facets on fields with this many potentially
unique values - on the order of 70,000,000? Am I just pushing elasticsearch
too far, or is it worth trying with more machines / one even bigger machine
/ many even bigger machines?

Any feedback from people doing this sort of scale of faceting would be
appreciated, or any other settings suggestions you can provide would be
great, so that I can get an idea if it is worth trying any further or just
give up faceting on this field.

Thanks!

--

--
David Pilato
http://www.scrutmydocs.org/
http://dev.david.pilato.fr/
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

--


(Ivan Brusic) #5

Are the names of people constantly changing or is it static? I facet only
on numeric keys (ints and longs) and convert the values to the proper text
values using a mapping on the client side.

--
Ivan

On Wed, Sep 26, 2012 at 6:44 PM, Otis Gospodnetic <
otis.gospodnetic@gmail.com> wrote:

Hi,

I was going to suggest to try limiting faceting to only top N results.
There is an issue open for this, but no implementation. If that is an
option (and this means that you don't need exact counts) then you could try
doing this in the client, too - just get the results including authors,
chop up the authors field and count the names.

Maybe limiting to top N is something that could be accomplished through
the script_field Igor mentioned?

Otis

Search Analytics - http://sematext.com/search-analytics/index.html
Performance Monitoring - http://sematext.com/spm/index.html

On Wednesday, September 26, 2012 9:29:16 AM UTC-4, Igor Motov wrote:

Hi Mark,

If your field is stored, you can try using script_field instead of
standard field terms facet. Assuming that your field is called "authors",
you can try something like this:

"facets": {
"authors_facet": {
"terms" : {
"script_field" : "_fields.authors.values"
}
}
}

Start with really small result set. Running this request on all 70 mln
records will take really long time. However, if your result sets
are relatively small, you might get acceptable performance out of it.

Igor

On Wednesday, September 26, 2012 8:06:10 AM UTC-4, David Pilato wrote:

Hi Mark,

As you have seen it, faceting in 0.19.x with huge dataset is a problem.
Although Shay wrote before that there will be some memory usage
optimization when using facets in 0.20, I'm not sure that you will be able
to facet on 70,000,000 unique values (I did not test 0.20 SNAPSHOT yet
myself).

But, I'm wondering about your use case. What are you trying to achieve
here? If all values are unique, why do you need to group and count them as
you will have probably 1 count per value?

In the past, the only way I found to avoid OOM was to restrict with
filters/queries the number of documents to facet on.

So, don't you have in your documents somewhere a field that could help
you to reduce the number of documents? (a date field for example?)

Not sure my answer helps... :frowning:
David.

Le 26 septembre 2012 à 02:02, Mark MacGillivray ma...@cottagelabs.com
a écrit :

Hi there, I am using elasticsearch (which is totally brilliant by the
way) for an index with approximately 21 million records in it, and within
those records I have one particular field that has between 1 and perhaps 10
values, and those values are often unique to just that record. The values
are text strings - names of people. I am using a dynamic mapping.

I would like to be able to facet on this field, but whatever I do, I
just crash my index. So I am looking for further suggestions.

I have stored this field unanalysed, and I have tried the field cache
field type set to soft and not set at all, and tried field cache max size
to various values ranging from 1 to 10,000,000.

I have run this on a single machine with 60gb memory reserved to
elasticsearch. It eventually fails with an Out of Memory error and tries to
dump the heap.

I have also tried running it on a cluster of 8 machines with 6gb for
elasticsearch on each, trying with between 1 and 16 shards, and between 1
and 8 replicas. Also on a cluster of 4 machines with 12gb each. However it
again fails with OOM, a bit sooner than the one big machine.

Are other people running facets on fields with this many potentially
unique values - on the order of 70,000,000? Am I just pushing elasticsearch
too far, or is it worth trying with more machines / one even bigger machine
/ many even bigger machines?

Any feedback from people doing this sort of scale of faceting would be
appreciated, or any other settings suggestions you can provide would be
great, so that I can get an idea if it is worth trying any further or just
give up faceting on this field.

Thanks!

--

--
David Pilato
http://www.scrutmydocs.org/
http://dev.david.pilato.fr/
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

--

--


(system) #6