Faceting too memory hungry

Hi all,

I want to return facet counts on a field that has many possible
values. It is an "author" field, in a large database of documents =
there are potentially even more distinct values than there are total
documents.

The field mapping:

"author": {
"type": "multi_field",
"fields": {
"author": {
"type" : "string",
},
"untouched" : {
"index" : "not_analyzed",
"type": "string",
}
}
}

(the reason for multifield is that I want to be able to search the
field for "Jan Ramon" as well as "ramon jan" = analyzed, but I also
want to return facet counts for the whole field = unanalyzed.)

Trouble is, ES runs out of heap space as soon I ask for this facet,
such as with

"facets": {
"author": {
"terms": {
"field": "author.untouched",
"size": 10
}
}
}

Is there a way to make faceting less memory-hungry? What is a good way
to facet on the author field?

Many thanks in advance.

Hello!

Faceting on high cardinality fields can be expensive in terms of
memory needed. How much heap memory your Elasticsearch nodes
have assigned ?

--
Regards,
Rafał Kuć
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Elasticsearch

Hi all,

I want to return facet counts on a field that has many possible
values. It is an "author" field, in a large database of documents =
there are potentially even more distinct values than there are total
documents.

The field mapping:

"author": {
"type": "multi_field",
"fields": {
"author": {
"type" : "string",
},
"untouched" : {
"index" : "not_analyzed",
"type": "string",
}
}
}

(the reason for multifield is that I want to be able to search the
field for "Jan Ramon" as well as "ramon jan" = analyzed, but I also
want to return facet counts for the whole field = unanalyzed.)

Trouble is, ES runs out of heap space as soon I ask for this facet,
such as with

"facets": {
"author": {
"terms": {
"field": "author.untouched",
"size": 10
}
}
}

Is there a way to make faceting less memory-hungry? What is a good way
to facet on the author field?

Many thanks in advance.

Hello Rafał,

On Jun 12, 5:40 pm, Rafał Kuć r....@solr.pl wrote:

Faceting on high cardinality fields can be expensive in terms of
memory needed. How much heap memory your Elasticsearch nodes
have assigned ?

right now, it's 4GB, with memlock turned on.
But if the memory for faceting grows linearly with the number of
distinct values (does it?), it's bound to exceed the heap size sooner
or later, no matter the absolute value.

Is there a way/workaround/hack to get faceting working on such a nasty
"author" field? I imagine faceting on author is a very common
scenario?

Btw the facet counts don't need to be 100% accurate, but if
inaccurate, it must be at least "consistently inaccurate", across
queries.

Cheers.

Hello!

You may try with different field data cache type than the default
resident or setting the maximum size of the cache. I hope nobody will
get mad if I point you here -
ElasticSearch Cache Usage - Sematext . You
can get some information what to change.

However if you will use limited field data cache or soft type field
data cache, please be aware that rebuilding field data cache is
expensive and your faceting performance will suffer because of that.

--
Regards,
Rafał Kuć
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Elasticsearch

Hello Rafał,

On Jun 12, 5:40 pm, Rafał Kuć r....@solr.pl wrote:

Faceting on high cardinality fields can be expensive in terms of
memory needed. How much heap memory your Elasticsearch nodes
have assigned ?

right now, it's 4GB, with memlock turned on.
But if the memory for faceting grows linearly with the number of
distinct values (does it?), it's bound to exceed the heap size sooner
or later, no matter the absolute value.

Is there a way/workaround/hack to get faceting working on such a nasty
"author" field? I imagine faceting on author is a very common
scenario?

Btw the facet counts don't need to be 100% accurate, but if
inaccurate, it must be at least "consistently inaccurate", across
queries.

Cheers.

Thanks for the link, Rafał, very useful.

Counting frequency of unique strings is such a fundamental operation
(+well researched and optimized to death) that I am surprised this is
such a big deal in ES.

Maybe I'm better off using a separate application to keep track of the
string facet counts + return the top-N on request? I'm thinking a
simple string->count mapping, based on a DB of sorts, like SQLite or
Berkeley DB. Then I'd merge the search results from ES with the facet
results from this app. Downside is I'd need to index things twice (in
two separate places), at least for the author field, which adds a lot
of system complexity.

What do you think?

On Jun 12, 6:05 pm, Rafał Kuć r....@solr.pl wrote:

Hello!

You may try with different field data cache type than the default
resident or setting the maximum size of the cache. I hope nobody will
get mad if I point you here -ElasticSearch Cache Usage - Sematext. You
can get some information what to change.

However if you will use limited field data cache or soft type field
data cache, please be aware that rebuilding field data cache is
expensive and your faceting performance will suffer because of that.

--
Regards,
Rafał Kuć
Sematext ::http://sematext.com/:: Solr - Lucene - Nutch - Elasticsearch

Hello Rafał,
On Jun 12, 5:40 pm, Rafał Kuć r....@solr.pl wrote:

Faceting on high cardinality fields can be expensive in terms of
memory needed. How much heap memory your Elasticsearch nodes
have assigned ?
right now, it's 4GB, with memlock turned on.
But if the memory for faceting grows linearly with the number of
distinct values (does it?), it's bound to exceed the heap size sooner
or later, no matter the absolute value.
Is there a way/workaround/hack to get faceting working on such a nasty
"author" field? I imagine faceting on author is a very common
scenario?
Btw the facet counts don't need to be 100% accurate, but if
inaccurate, it must be at least "consistently inaccurate", across
queries.
Cheers.

Ah no, scratch that, I forgot the facet counts are query-dependent :slight_smile:

I'll have to give this more thought, and in the mean time, turn off
author faceting.

On Jun 12, 6:42 pm, Crwe tester.teste...@gmail.com wrote:

Thanks for the link, Rafał, very useful.

Counting frequency of unique strings is such a fundamental operation
(+well researched and optimized to death) that I am surprised this is
such a big deal in ES.

Maybe I'm better off using a separate application to keep track of the
string facet counts + return the top-N on request? I'm thinking a
simple string->count mapping, based on a DB of sorts, like SQLite or
Berkeley DB. Then I'd merge the search results from ES with the facet
results from this app. Downside is I'd need to index things twice (in
two separate places), at least for the author field, which adds a lot
of system complexity.

What do you think?

On Jun 12, 6:05 pm, Rafał Kuć r....@solr.pl wrote:

Hello!

You may try with different field data cache type than the default
resident or setting the maximum size of the cache. I hope nobody will
get mad if I point you here -ElasticSearch Cache Usage - Sematext. You
can get some information what to change.

However if you will use limited field data cache or soft type field
data cache, please be aware that rebuilding field data cache is
expensive and your faceting performance will suffer because of that.

--
Regards,
Rafał Kuć
Sematext ::http://sematext.com/::Solr - Lucene - Nutch - Elasticsearch

Hello Rafał,
On Jun 12, 5:40 pm, Rafał Kuć r....@solr.pl wrote:

Faceting on high cardinality fields can be expensive in terms of
memory needed. How much heap memory your Elasticsearch nodes
have assigned ?
right now, it's 4GB, with memlock turned on.
But if the memory for faceting grows linearly with the number of
distinct values (does it?), it's bound to exceed the heap size sooner
or later, no matter the absolute value.
Is there a way/workaround/hack to get faceting working on such a nasty
"author" field? I imagine faceting on author is a very common
scenario?
Btw the facet counts don't need to be 100% accurate, but if
inaccurate, it must be at least "consistently inaccurate", across
queries.
Cheers.

Also have a look
at https://groups.google.com/d/msg/elasticsearch/ePJgCtBpyrs/39pzczoRokoJ

Otis

Search Analytics - Cloud Monitoring Tools & Services | Sematext
Scalable Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Tuesday, June 12, 2012 11:32:26 AM UTC-4, Crwe wrote:

Hi all,

I want to return facet counts on a field that has many possible
values. It is an "author" field, in a large database of documents =
there are potentially even more distinct values than there are total
documents.

The field mapping:

"author": {
"type": "multi_field",
"fields": {
"author": {
"type" : "string",
},
"untouched" : {
"index" : "not_analyzed",
"type": "string",
}
}
}

(the reason for multifield is that I want to be able to search the
field for "Jan Ramon" as well as "ramon jan" = analyzed, but I also
want to return facet counts for the whole field = unanalyzed.)

Trouble is, ES runs out of heap space as soon I ask for this facet,
such as with

"facets": {
"author": {
"terms": {
"field": "author.untouched",
"size": 10
}
}
}

Is there a way to make faceting less memory-hungry? What is a good way
to facet on the author field?

Many thanks in advance.

On Tuesday, June 12, 2012 7:33:59 PM UTC+2, Otis Gospodnetic wrote:

Also have a look at
https://groups.google.com/d/msg/elasticsearch/ePJgCtBpyrs/39pzczoRokoJ

Excellent info, thanks Otis. Solr's FieldCache looks exactly like what we
need.

I'll keep an eye on the "improved faceting" progress of ES, while keeping
an eye on the "improved NRT" progress in Solr -- because I need them both!
Whichever engine gets there first :slight_smile: