Hi All,
In my use case we need to facet fields with lots of possible unique values.
For example, you can think of the author field in Amazon. Lot's of books
written by lots of people. Still, it is useful to facet it, as some authors
may be very active in a certain area. If the query is focused, an author
facet can be very useful.
For these kind of fields, the current faceting implementation in
ElasticSearch is problematic. The reason is that it (pre) loads all
possible values of the field into memory. On our index this means 30GB for
a single field.
This problem has been discussed before on the list but with a proper
solution. The only thing that comes close is an idea by Andy Wick
( https://groups.google.com/forum/#!msg/elasticsearch/xsMmFDuSVCM/gPzXVBzLlBkJ
) summarized nicely by Otis
on https://groups.google.com/d/msg/elasticsearch/ePJgCtBpyrs/pJ5REcVPPhkJ
. The basic idea of the solution is that instead of loading strings into
memory. The downside is that you have to use a secondary index which is
used while indexing and faceting.
I would like to propose another related idea and ask if people problems
down the road I should be aware of. I'm fairly new to ElasticSearch and
Lucene and would appreciate experienced advice. Of course, if someone else
has a working solution I would love to hear about it too.
The idea is based on the following two assumptions:
- While it is infeasible to load all unique strings into memory, it's OK
to load an equal amount of numbers. - For facets you typically need the string representation of a way smaller
number of terms when compared to the total. Much similar to searching. - For every document there is only one term. In the author example this
means every book is written by just one author. This restriction simplifies
things. I have some ideas how this can be removed, if the works fine.
The implementation consists of building a new FieldCache which instead of
mapping every document id in a segment to strings, it maps it to another
document id which has the same term. Using the books example again,say
Twenty Thousand Leagues Under the Sea by Jules Vernes has id 1 and Journey
to the Center of the Earth has id 2. The Cache will contain an entry
mapping 1 to itself and 2 to 1 (which has the same author).
Using that field cache faceting would work as such:
- Extract the top n terms (represented by an document id), using the same
classes the numerical facets work - Once we have the top n terms, lookup the relevant document and extract
the actual field value. This can be done during the facet() phase of
collections
As far as I can tell, this should all work but for a couple of potential
problems:
- To retrieve the documents during the facet() phase you need an
IndexReader. The FieldCache is created for every segment and can store such
an indexreader. However, I'm not sure whether you can safely store a copy
in the cache? What happens when the segement is merged? What happens when
the FieldCache is evacuated from memory? - The FieldCache might need to refer to documents which have been deleted
in the mean time. Is it a problem to access deleted documents via an
IndexReader? - A couple of more things I probably missed
I would really appreciate it if people with more experience will comment on
this and give feedback. In return (and anyway:) ) I wil keep the list up to
date on how things work out.
Thanks in advance,
Boaz