Hi, we're thinking of implementing a new feature on top of terms facets.
For each document we have a list of "keywords", similar to tags. Each
document may have from a few (3, 4, 5 keywords) to more than 20 keywords,
although it won't go much further than this.
We want to use facets in order to show a "keyword" counter along the
results, so the user can see the most popular keywords in the results of
that query.
More information:
- The number of possible keywords is fixed. We use a closed set of keywords
that has around 8 million different keywords! But in practice our whole
dataset holds around 1.5 million distinct keywords. So the 8 million figure
isn't a concern. - Our index has ~40 million documents. It has 64.8gb size (with _source).
- The indexing rate is a few documents per second. ~100 docs/minute
- The index has 8 shards in 4 nodes. Each node has ~7GB of RAM with 4GB set
to ES's JVM.
My questions:
(1) Is it feasible to facet on such a field? I'm wondering if a field with
that many terms won't fit in memory (in < 4GB).
(2) I can easily store them keywords as strings or integers (and retrieve
the keyword String from DB). Would it make a great difference to store
integers instead of strings? Would ES use something like a Integer
FieldCache instead of String FieldCache? (keywords may have multiple words,
so at least integers would win by having less characters)
(3) We also have the keyword-document relations stored in the DB (mysql).
So another possibility would be to submit the query as usual but somehow
retrieve a complete list of document ids that match that query. With that
list of documents in hands I would do the "faceting" on the mysql side.
Any thoughts?
Thanks!!
Felipe Hummel
--