I want to return facet counts on a field that has many possible
values. It is an "author" field, in a large database of documents =
there are potentially even more distinct values than there are total
documents.
(the reason for multifield is that I want to be able to search the
field for "Jan Ramon" as well as "ramon jan" = analyzed, but I also
want to return facet counts for the whole field = unanalyzed.)
Trouble is, ES runs out of heap space as soon I ask for this facet,
such as with
I want to return facet counts on a field that has many possible
values. It is an "author" field, in a large database of documents =
there are potentially even more distinct values than there are total
documents.
(the reason for multifield is that I want to be able to search the
field for "Jan Ramon" as well as "ramon jan" = analyzed, but I also
want to return facet counts for the whole field = unanalyzed.)
Trouble is, ES runs out of heap space as soon I ask for this facet,
such as with
Faceting on high cardinality fields can be expensive in terms of
memory needed. How much heap memory your Elasticsearch nodes
have assigned ?
right now, it's 4GB, with memlock turned on.
But if the memory for faceting grows linearly with the number of
distinct values (does it?), it's bound to exceed the heap size sooner
or later, no matter the absolute value.
Is there a way/workaround/hack to get faceting working on such a nasty
"author" field? I imagine faceting on author is a very common
scenario?
Btw the facet counts don't need to be 100% accurate, but if
inaccurate, it must be at least "consistently inaccurate", across
queries.
You may try with different field data cache type than the default
resident or setting the maximum size of the cache. I hope nobody will
get mad if I point you here - ElasticSearch Cache Usage - Sematext . You
can get some information what to change.
However if you will use limited field data cache or soft type field
data cache, please be aware that rebuilding field data cache is
expensive and your faceting performance will suffer because of that.
Faceting on high cardinality fields can be expensive in terms of
memory needed. How much heap memory your Elasticsearch nodes
have assigned ?
right now, it's 4GB, with memlock turned on.
But if the memory for faceting grows linearly with the number of
distinct values (does it?), it's bound to exceed the heap size sooner
or later, no matter the absolute value.
Is there a way/workaround/hack to get faceting working on such a nasty
"author" field? I imagine faceting on author is a very common
scenario?
Btw the facet counts don't need to be 100% accurate, but if
inaccurate, it must be at least "consistently inaccurate", across
queries.
Counting frequency of unique strings is such a fundamental operation
(+well researched and optimized to death) that I am surprised this is
such a big deal in ES.
Maybe I'm better off using a separate application to keep track of the
string facet counts + return the top-N on request? I'm thinking a
simple string->count mapping, based on a DB of sorts, like SQLite or
Berkeley DB. Then I'd merge the search results from ES with the facet
results from this app. Downside is I'd need to index things twice (in
two separate places), at least for the author field, which adds a lot
of system complexity.
You may try with different field data cache type than the default
resident or setting the maximum size of the cache. I hope nobody will
get mad if I point you here -ElasticSearch Cache Usage - Sematext. You
can get some information what to change.
However if you will use limited field data cache or soft type field
data cache, please be aware that rebuilding field data cache is
expensive and your faceting performance will suffer because of that.
Hello Rafał,
On Jun 12, 5:40 pm, Rafał Kuć r....@solr.pl wrote:
Faceting on high cardinality fields can be expensive in terms of
memory needed. How much heap memory your Elasticsearch nodes
have assigned ?
right now, it's 4GB, with memlock turned on.
But if the memory for faceting grows linearly with the number of
distinct values (does it?), it's bound to exceed the heap size sooner
or later, no matter the absolute value.
Is there a way/workaround/hack to get faceting working on such a nasty
"author" field? I imagine faceting on author is a very common
scenario?
Btw the facet counts don't need to be 100% accurate, but if
inaccurate, it must be at least "consistently inaccurate", across
queries.
Cheers.
Counting frequency of unique strings is such a fundamental operation
(+well researched and optimized to death) that I am surprised this is
such a big deal in ES.
Maybe I'm better off using a separate application to keep track of the
string facet counts + return the top-N on request? I'm thinking a
simple string->count mapping, based on a DB of sorts, like SQLite or
Berkeley DB. Then I'd merge the search results from ES with the facet
results from this app. Downside is I'd need to index things twice (in
two separate places), at least for the author field, which adds a lot
of system complexity.
You may try with different field data cache type than the default
resident or setting the maximum size of the cache. I hope nobody will
get mad if I point you here -ElasticSearch Cache Usage - Sematext. You
can get some information what to change.
However if you will use limited field data cache or soft type field
data cache, please be aware that rebuilding field data cache is
expensive and your faceting performance will suffer because of that.
Hello Rafał,
On Jun 12, 5:40 pm, Rafał Kuć r....@solr.pl wrote:
Faceting on high cardinality fields can be expensive in terms of
memory needed. How much heap memory your Elasticsearch nodes
have assigned ?
right now, it's 4GB, with memlock turned on.
But if the memory for faceting grows linearly with the number of
distinct values (does it?), it's bound to exceed the heap size sooner
or later, no matter the absolute value.
Is there a way/workaround/hack to get faceting working on such a nasty
"author" field? I imagine faceting on author is a very common
scenario?
Btw the facet counts don't need to be 100% accurate, but if
inaccurate, it must be at least "consistently inaccurate", across
queries.
Cheers.
On Tuesday, June 12, 2012 11:32:26 AM UTC-4, Crwe wrote:
Hi all,
I want to return facet counts on a field that has many possible
values. It is an "author" field, in a large database of documents =
there are potentially even more distinct values than there are total
documents.
(the reason for multifield is that I want to be able to search the
field for "Jan Ramon" as well as "ramon jan" = analyzed, but I also
want to return facet counts for the whole field = unanalyzed.)
Trouble is, ES runs out of heap space as soon I ask for this facet,
such as with
Excellent info, thanks Otis. Solr's FieldCache looks exactly like what we
need.
I'll keep an eye on the "improved faceting" progress of ES, while keeping
an eye on the "improved NRT" progress in Solr -- because I need them both!
Whichever engine gets there first
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.