Hi, we're thinking of implementing a new feature on top of terms facets.
For each document we have a list of "keywords", similar to tags. Each
document may have from a few (3, 4, 5 keywords) to more than 20 keywords,
although it won't go much further than this.
We want to use facets in order to show a "keyword" counter along the
results, so the user can see the most popular keywords in the results of
that query.
More information:
The number of possible keywords is fixed. We use a closed set of keywords
that has around 8 million different keywords! But in practice our whole
dataset holds around 1.5 million distinct keywords. So the 8 million figure
isn't a concern.
Our index has ~40 million documents. It has 64.8gb size (with _source).
The indexing rate is a few documents per second. ~100 docs/minute
The index has 8 shards in 4 nodes. Each node has ~7GB of RAM with 4GB set
to ES's JVM.
My questions:
(1) Is it feasible to facet on such a field? I'm wondering if a field with
that many terms won't fit in memory (in < 4GB).
(2) I can easily store them keywords as strings or integers (and retrieve
the keyword String from DB). Would it make a great difference to store
integers instead of strings? Would ES use something like a Integer
FieldCache instead of String FieldCache? (keywords may have multiple words,
so at least integers would win by having less characters)
(3) We also have the keyword-document relations stored in the DB (mysql).
So another possibility would be to submit the query as usual but somehow
retrieve a complete list of document ids that match that query. With that
list of documents in hands I would do the "faceting" on the mysql side.
My questions:
(1) Is it feasible to facet on such a field? I'm wondering if a field with
that many terms won't fit in memory (in < 4GB).
This might be an issue for you. In general adding more memory or nodes
(but then also increase # shards), will fix this problem. In the case
that you can't add more memory or nodes, then you have a problem which
can't be solved easily. What you can do is use terms facet scripts and
then use the _source notation in your script. Facet values are then
read from disk each time a document is evaluated. This will lower you
memory usage in the heap space, but your search requests will be many
times slower.
(2) I can easily store them keywords as strings or integers (and retrieve
the keyword String from DB). Would it make a great difference to store
integers instead of strings? Would ES use something like a Integer
FieldCache instead of String FieldCache? (keywords may have multiple words,
so at least integers would win by having less characters)
Yes this does makes a difference. ES has integer based impl of the
field cache. There're also short and byte based impl. If you can use
those types in your mapping this lower the memory usage even further.
By default a long type is associated with a number, so make sure you
define the appropriate type in your mapping.
(3) We also have the keyword-document relations stored in the DB (mysql). So
another possibility would be to submit the query as usual but somehow
retrieve a complete list of document ids that match that query. With that
list of documents in hands I would do the "faceting" on the mysql side.
Not much experience with this solution, but you can use the search_type with option scan in your search request and
then easily stream the document ids out of ES with minimum overhead.
Do you, or anyone else, knows how multivalued fields are cached in ES?
For single-valued integer fields, for example, they have a FieldCache that
basically is a large array of ints. Where array indexes represent the
docIds and array values the field values.
How that works for a field that has a integer array, for example?
I'm trying a back-of-the-envelope calculation to estimate how much memory
we would need.
If you take a look at strings/MultiValueStringFieldData, you will
see that contains the actual unique values and
an int ordinal. The ordinal multi array points to a value in the
values array.The size of the ordinals array is defined by:
[maxdoc][highest_mv_count]
The maxdoc is the highest Lucene docId + 1. This is the higest docId
in the segment where this particular cache entry gets loaded for.
The highest_mv_count is equal to the document inside this segment
with the highest number of values for that field.
The same applies for ints/MultiValueIntFieldData, only here the
values array is of type int array instead of String array.
I hope it helps with your back-of-the-envelope calculation. You
could also just try to load it (by faceting on that field) and use the
node stats api, to tell you the size of the field data cache.
Do you, or anyone else, knows how multivalued fields are cached in ES?
For single-valued integer fields, for example, they have a FieldCache that
basically is a large array of ints. Where array indexes represent the docIds
and array values the field values.
How that works for a field that has a integer array, for example?
I'm trying a back-of-the-envelope calculation to estimate how much memory we
would need.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.