Faceting high cardinality fields - idea review


(Boaz Leskes) #1

Hi All,

In my use case we need to facet fields with lots of possible unique values.
For example, you can think of the author field in Amazon. Lot's of books
written by lots of people. Still, it is useful to facet it, as some authors
may be very active in a certain area. If the query is focused, an author
facet can be very useful.

For these kind of fields, the current faceting implementation in
ElasticSearch is problematic. The reason is that it (pre) loads all
possible values of the field into memory. On our index this means 30GB for
a single field.

This problem has been discussed before on the list but with a proper
solution. The only thing that comes close is an idea by Andy Wick
( https://groups.google.com/forum/#!msg/elasticsearch/xsMmFDuSVCM/gPzXVBzLlBkJ
) summarized nicely by Otis
on https://groups.google.com/d/msg/elasticsearch/ePJgCtBpyrs/pJ5REcVPPhkJ
. The basic idea of the solution is that instead of loading strings into
memory. The downside is that you have to use a secondary index which is
used while indexing and faceting.

I would like to propose another related idea and ask if people problems
down the road I should be aware of. I'm fairly new to ElasticSearch and
Lucene and would appreciate experienced advice. Of course, if someone else
has a working solution I would love to hear about it too.

The idea is based on the following two assumptions:

  1. While it is infeasible to load all unique strings into memory, it's OK
    to load an equal amount of numbers.
  2. For facets you typically need the string representation of a way smaller
    number of terms when compared to the total. Much similar to searching.
  3. For every document there is only one term. In the author example this
    means every book is written by just one author. This restriction simplifies
    things. I have some ideas how this can be removed, if the works fine.

The implementation consists of building a new FieldCache which instead of
mapping every document id in a segment to strings, it maps it to another
document id which has the same term. Using the books example again,say
Twenty Thousand Leagues Under the Sea by Jules Vernes has id 1 and Journey
to the Center of the Earth has id 2. The Cache will contain an entry
mapping 1 to itself and 2 to 1 (which has the same author).

Using that field cache faceting would work as such:

  1. Extract the top n terms (represented by an document id), using the same
    classes the numerical facets work
  2. Once we have the top n terms, lookup the relevant document and extract
    the actual field value. This can be done during the facet() phase of
    collections

As far as I can tell, this should all work but for a couple of potential
problems:

  1. To retrieve the documents during the facet() phase you need an
    IndexReader. The FieldCache is created for every segment and can store such
    an indexreader. However, I'm not sure whether you can safely store a copy
    in the cache? What happens when the segement is merged? What happens when
    the FieldCache is evacuated from memory?
  2. The FieldCache might need to refer to documents which have been deleted
    in the mean time. Is it a problem to access deleted documents via an
    IndexReader?
  3. A couple of more things I probably missed :slight_smile:

I would really appreciate it if people with more experience will comment on
this and give feedback. In return (and anyway:) ) I wil keep the list up to
date on how things work out.

Thanks in advance,
Boaz


(Jörg Prante) #2

Hi Boaz,

I will try sketching the idea how libraries deal with faceting when they
need to construct catalogs. The problem statement is nearly the same, for
example, for library union catalogs with millions of authors and millions
of titles.

(In fact, you mean faceting low cardinality - you find only few titles for
every author on average - because high cardinality is just the case where
inverted index engines really shine)

The idea of mapping author names to discrete numbers (index positions) is
well-known since decades. This is known as discretization. Libraries use
authority record IDs. In Germany we have a file named GND, in US there is
LC authorities (http://authorities.loc.gov/) - other national libraries
have similar solutions. These IDs are shorter than author names and they
are unique. So you can save a lot of space. If IDs can be mapped to longs,
you can save even more space. For facet visualization, you need to map the
IDs back to the descriptive name.

If you don't have access to authority records - there is a big move to open
data going on at this time in the library community - you can build such
files by yourself while harvesting the title data. Before online
computerization, library staff invented abbreviation techniques for author
names, to scan faster over endless paper printouts looking for names. Think
of only using the first four (ascii) characters of a name (which is 32 bit)
or other algorithms. If you can create unique abbreviation forms, this will
also save space for faceting.

Don't assume you only have one author per title.

I won't mess with the field cache until you really fail in reducing the
amount of values for the facet field and you can't spend on enough RAM for
faceting. The reason is that the field cache is very sensitive to
performance. You have already noticed that deletes will invalidate the
cache, and rebuilding the field cache is expensive.

Best regards,

Jörg

On Saturday, June 23, 2012 9:47:00 AM UTC+2, Boaz Leskes wrote:

Hi All,

In my use case we need to facet fields with lots of possible unique
values. For example, you can think of the author field in Amazon. Lot's of
books written by lots of people. Still, it is useful to facet it, as some
authors may be very active in a certain area. If the query is focused, an
author facet can be very useful.

For these kind of fields, the current faceting implementation in
ElasticSearch is problematic. The reason is that it (pre) loads all
possible values of the field into memory. On our index this means 30GB for
a single field.

This problem has been discussed before on the list but with a proper
solution. The only thing that comes close is an idea by Andy Wick (
https://groups.google.com/forum/#!msg/elasticsearch/xsMmFDuSVCM/gPzXVBzLlBkJ) summarized nicely by Otis on
https://groups.google.com/d/msg/elasticsearch/ePJgCtBpyrs/pJ5REcVPPhkJ .
The basic idea of the solution is that instead of loading strings into
memory. The downside is that you have to use a secondary index which is
used while indexing and faceting.

I would like to propose another related idea and ask if people problems
down the road I should be aware of. I'm fairly new to ElasticSearch and
Lucene and would appreciate experienced advice. Of course, if someone else
has a working solution I would love to hear about it too.

The idea is based on the following two assumptions:

  1. While it is infeasible to load all unique strings into memory, it's
    OK to load an equal amount of numbers.
  2. For facets you typically need the string representation of a way
    smaller number of terms when compared to the total. Much similar to
    searching.
  3. For every document there is only one term. In the author example this
    means every book is written by just one author. This restriction simplifies
    things. I have some ideas how this can be removed, if the works fine.

The implementation consists of building a new FieldCache which instead of
mapping every document id in a segment to strings, it maps it to another
document id which has the same term. Using the books example again,say
Twenty Thousand Leagues Under the Sea by Jules Vernes has id 1 and Journey
to the Center of the Earth has id 2. The Cache will contain an entry
mapping 1 to itself and 2 to 1 (which has the same author).

Using that field cache faceting would work as such:

  1. Extract the top n terms (represented by an document id), using the same
    classes the numerical facets work
  2. Once we have the top n terms, lookup the relevant document and extract
    the actual field value. This can be done during the facet() phase of
    collections

As far as I can tell, this should all work but for a couple of potential
problems:

  1. To retrieve the documents during the facet() phase you need an
    IndexReader. The FieldCache is created for every segment and can store such
    an indexreader. However, I'm not sure whether you can safely store a copy
    in the cache? What happens when the segement is merged? What happens when
    the FieldCache is evacuated from memory?
  2. The FieldCache might need to refer to documents which have been deleted
    in the mean time. Is it a problem to access deleted documents via an
    IndexReader?
  3. A couple of more things I probably missed :slight_smile:

I would really appreciate it if people with more experience will comment
on this and give feedback. In return (and anyway:) ) I wil keep the list up
to date on how things work out.

Thanks in advance,
Boaz


(Radim) #3

+1 on improving faceting in ElasticSearch. This also includes fixing
the "wrong facet counts bug", https://github.com/elasticsearch/elasticsearch/issues/1305
.

I don't really understand Boaz's solution here, and I don't know the
insides of ES enough to architect an alternative solution. But
faceting on authors is a very important use case for us, so if a
solution is found and blessed by Shay, I am willing to help with
implementing/testing it.

Best,
Radim

On Jun 23, 9:47 am, Boaz Leskes b.les...@gmail.com wrote:

Hi All,

In my use case we need to facet fields with lots of possible unique values.
For example, you can think of the author field in Amazon. Lot's of books
written by lots of people. Still, it is useful to facet it, as some authors
may be very active in a certain area. If the query is focused, an author
facet can be very useful.

For these kind of fields, the current faceting implementation in
ElasticSearch is problematic. The reason is that it (pre) loads all
possible values of the field into memory. On our index this means 30GB for
a single field.

This problem has been discussed before on the list but with a proper
solution. The only thing that comes close is an idea by Andy Wick
(https://groups.google.com/forum/#!msg/elasticsearch/xsMmFDuSVCM/gPzXV...
) summarized nicely by Otis
onhttps://groups.google.com/d/msg/elasticsearch/ePJgCtBpyrs/pJ5REcVPPhkJ
. The basic idea of the solution is that instead of loading strings into
memory. The downside is that you have to use a secondary index which is
used while indexing and faceting.

I would like to propose another related idea and ask if people problems
down the road I should be aware of. I'm fairly new to ElasticSearch and
Lucene and would appreciate experienced advice. Of course, if someone else
has a working solution I would love to hear about it too.

The idea is based on the following two assumptions:

  1. While it is infeasible to load all unique strings into memory, it's OK
    to load an equal amount of numbers.
  2. For facets you typically need the string representation of a way smaller
    number of terms when compared to the total. Much similar to searching.
  3. For every document there is only one term. In the author example this
    means every book is written by just one author. This restriction simplifies
    things. I have some ideas how this can be removed, if the works fine.

The implementation consists of building a new FieldCache which instead of
mapping every document id in a segment to strings, it maps it to another
document id which has the same term. Using the books example again,say
Twenty Thousand Leagues Under the Sea by Jules Vernes has id 1 and Journey
to the Center of the Earth has id 2. The Cache will contain an entry
mapping 1 to itself and 2 to 1 (which has the same author).

Using that field cache faceting would work as such:

  1. Extract the top n terms (represented by an document id), using the same
    classes the numerical facets work
  2. Once we have the top n terms, lookup the relevant document and extract
    the actual field value. This can be done during the facet() phase of
    collections

As far as I can tell, this should all work but for a couple of potential
problems:

  1. To retrieve the documents during the facet() phase you need an
    IndexReader. The FieldCache is created for every segment and can store such
    an indexreader. However, I'm not sure whether you can safely store a copy
    in the cache? What happens when the segement is merged? What happens when
    the FieldCache is evacuated from memory?
  2. The FieldCache might need to refer to documents which have been deleted
    in the mean time. Is it a problem to access deleted documents via an
    IndexReader?
  3. A couple of more things I probably missed :slight_smile:

I would really appreciate it if people with more experience will comment on
this and give feedback. In return (and anyway:) ) I wil keep the list up to
date on how things work out.

Thanks in advance,
Boaz


(Otis Gospodnetić) #4

I think Robert Muir is on this list. While at Berlin Buzzwords I chatted
with Robert while procrastinating with applying some final touches to our
presentations, and somehow faceting was brought up. Robert has this idea
(or a plan?) for changing how faceting is done at the Lucene level all
together and the approach he has in mind should use a lot less memory, but
will require one to specify at index-time which fields will be used for
faceting. You can smell a bit of the desire to change this stuff in this
1-year old email: http://search-lucene.com/m/EOsg6urICZ1 (the whole thread
seems relevant).

Maybe it's time...

Otis

Search Analytics - http://sematext.com/search-analytics/index.html
Scalable Performance Monitoring - http://sematext.com/spm/index.html

On Saturday, June 23, 2012 3:47:00 AM UTC-4, Boaz Leskes wrote:

Hi All,

In my use case we need to facet fields with lots of possible unique
values. For example, you can think of the author field in Amazon. Lot's of
books written by lots of people. Still, it is useful to facet it, as some
authors may be very active in a certain area. If the query is focused, an
author facet can be very useful.

For these kind of fields, the current faceting implementation in
ElasticSearch is problematic. The reason is that it (pre) loads all
possible values of the field into memory. On our index this means 30GB for
a single field.

This problem has been discussed before on the list but with a proper
solution. The only thing that comes close is an idea by Andy Wick (
https://groups.google.com/forum/#!msg/elasticsearch/xsMmFDuSVCM/gPzXVBzLlBkJ) summarized nicely by Otis on
https://groups.google.com/d/msg/elasticsearch/ePJgCtBpyrs/pJ5REcVPPhkJ .
The basic idea of the solution is that instead of loading strings into
memory. The downside is that you have to use a secondary index which is
used while indexing and faceting.

I would like to propose another related idea and ask if people problems
down the road I should be aware of. I'm fairly new to ElasticSearch and
Lucene and would appreciate experienced advice. Of course, if someone else
has a working solution I would love to hear about it too.

The idea is based on the following two assumptions:

  1. While it is infeasible to load all unique strings into memory, it's
    OK to load an equal amount of numbers.
  2. For facets you typically need the string representation of a way
    smaller number of terms when compared to the total. Much similar to
    searching.
  3. For every document there is only one term. In the author example this
    means every book is written by just one author. This restriction simplifies
    things. I have some ideas how this can be removed, if the works fine.

The implementation consists of building a new FieldCache which instead of
mapping every document id in a segment to strings, it maps it to another
document id which has the same term. Using the books example again,say
Twenty Thousand Leagues Under the Sea by Jules Vernes has id 1 and Journey
to the Center of the Earth has id 2. The Cache will contain an entry
mapping 1 to itself and 2 to 1 (which has the same author).

Using that field cache faceting would work as such:

  1. Extract the top n terms (represented by an document id), using the same
    classes the numerical facets work
  2. Once we have the top n terms, lookup the relevant document and extract
    the actual field value. This can be done during the facet() phase of
    collections

As far as I can tell, this should all work but for a couple of potential
problems:

  1. To retrieve the documents during the facet() phase you need an
    IndexReader. The FieldCache is created for every segment and can store such
    an indexreader. However, I'm not sure whether you can safely store a copy
    in the cache? What happens when the segement is merged? What happens when
    the FieldCache is evacuated from memory?
  2. The FieldCache might need to refer to documents which have been deleted
    in the mean time. Is it a problem to access deleted documents via an
    IndexReader?
  3. A couple of more things I probably missed :slight_smile:

I would really appreciate it if people with more experience will comment
on this and give feedback. In return (and anyway:) ) I wil keep the list up
to date on how things work out.

Thanks in advance,
Boaz


(Radim) #5

That fills me with hope, thanks Otis :slight_smile:

FST storage in Lucene sounds great, if a bit sci-fi. But this seems
hopeful and already implemented: http://blog.mikemccandless.com/2010/07/lucenes-ram-usage-for-searching.html
(from that same thread).

Apparently the new Lucene (4.0) is much more memory-friendly in this
respect, compared to 3.X that ships with ElasticSearch.

Radim

On Jun 25, 6:23 am, Otis Gospodnetic otis.gospodne...@gmail.com
wrote:

I think Robert Muir is on this list. While at Berlin Buzzwords I chatted
with Robert while procrastinating with applying some final touches to our
presentations, and somehow faceting was brought up. Robert has this idea
(or a plan?) for changing how faceting is done at the Lucene level all
together and the approach he has in mind should use a lot less memory, but
will require one to specify at index-time which fields will be used for
faceting. You can smell a bit of the desire to change this stuff in this
1-year old email:http://search-lucene.com/m/EOsg6urICZ1(the whole thread
seems relevant).

Maybe it's time...

Otis

Search Analytics -http://sematext.com/search-analytics/index.html
Scalable Performance Monitoring -http://sematext.com/spm/index.html

On Saturday, June 23, 2012 3:47:00 AM UTC-4, Boaz Leskes wrote:

Hi All,

In my use case we need to facet fields with lots of possible unique
values. For example, you can think of the author field in Amazon. Lot's of
books written by lots of people. Still, it is useful to facet it, as some
authors may be very active in a certain area. If the query is focused, an
author facet can be very useful.

For these kind of fields, the current faceting implementation in
ElasticSearch is problematic. The reason is that it (pre) loads all
possible values of the field into memory. On our index this means 30GB for
a single field.

This problem has been discussed before on the list but with a proper
solution. The only thing that comes close is an idea by Andy Wick (
https://groups.google.com/forum/#!msg/elasticsearch/xsMmFDuSVCM/gPzXV...) summarized nicely by Otis on
https://groups.google.com/d/msg/elasticsearch/ePJgCtBpyrs/pJ5REcVPPhkJ .
The basic idea of the solution is that instead of loading strings into
memory. The downside is that you have to use a secondary index which is
used while indexing and faceting.

I would like to propose another related idea and ask if people problems
down the road I should be aware of. I'm fairly new to ElasticSearch and
Lucene and would appreciate experienced advice. Of course, if someone else
has a working solution I would love to hear about it too.

The idea is based on the following two assumptions:

  1. While it is infeasible to load all unique strings into memory, it's
    OK to load an equal amount of numbers.
  2. For facets you typically need the string representation of a way
    smaller number of terms when compared to the total. Much similar to
    searching.
  3. For every document there is only one term. In the author example this
    means every book is written by just one author. This restriction simplifies
    things. I have some ideas how this can be removed, if the works fine.

The implementation consists of building a new FieldCache which instead of
mapping every document id in a segment to strings, it maps it to another
document id which has the same term. Using the books example again,say
Twenty Thousand Leagues Under the Sea by Jules Vernes has id 1 and Journey
to the Center of the Earth has id 2. The Cache will contain an entry
mapping 1 to itself and 2 to 1 (which has the same author).

Using that field cache faceting would work as such:

  1. Extract the top n terms (represented by an document id), using the same
    classes the numerical facets work
  2. Once we have the top n terms, lookup the relevant document and extract
    the actual field value. This can be done during the facet() phase of
    collections

As far as I can tell, this should all work but for a couple of potential
problems:

  1. To retrieve the documents during the facet() phase you need an
    IndexReader. The FieldCache is created for every segment and can store such
    an indexreader. However, I'm not sure whether you can safely store a copy
    in the cache? What happens when the segement is merged? What happens when
    the FieldCache is evacuated from memory?
  2. The FieldCache might need to refer to documents which have been deleted
    in the mean time. Is it a problem to access deleted documents via an
    IndexReader?
  3. A couple of more things I probably missed :slight_smile:

I would really appreciate it if people with more experience will comment
on this and give feedback. In return (and anyway:) ) I wil keep the list up
to date on how things work out.

Thanks in advance,
Boaz


(Radim) #6

Bump.

Fixing&improving faceting seems to be critical for many people
(including us) -- is this on the roadmap somewhere? Pointers where to
start if I want to do this myself?

Thanks,
Radim

On Jun 24, 9:58 pm, Radim m...@radimrehurek.com wrote:

+1 on improving faceting in ElasticSearch. This also includes fixing
the "wrong facet counts bug",https://github.com/elasticsearch/elasticsearch/issues/1305
.

I don't really understand Boaz's solution here, and I don't know the
insides of ES enough to architect an alternative solution. But
faceting on authors is a very important use case for us, so if a
solution is found and blessed by Shay, I am willing to help with
implementing/testing it.

Best,Radim

On Jun 23, 9:47 am, Boaz Leskes b.les...@gmail.com wrote:

Hi All,

In my use case we need to facet fields with lots of possible unique values.
For example, you can think of the author field in Amazon. Lot's of books
written by lots of people. Still, it is useful to facet it, as some authors
may be very active in a certain area. If the query is focused, an author
facet can be very useful.

For these kind of fields, the current faceting implementation in
ElasticSearch is problematic. The reason is that it (pre) loads all
possible values of the field into memory. On our index this means 30GB for
a single field.

This problem has been discussed before on the list but with a proper
solution. The only thing that comes close is an idea by Andy Wick
(https://groups.google.com/forum/#!msg/elasticsearch/xsMmFDuSVCM/gPzXV...
) summarized nicely by Otis
onhttps://groups.google.com/d/msg/elasticsearch/ePJgCtBpyrs/pJ5REcVPPhkJ
. The basic idea of the solution is that instead of loading strings into
memory. The downside is that you have to use a secondary index which is
used while indexing and faceting.

I would like to propose another related idea and ask if people problems
down the road I should be aware of. I'm fairly new to ElasticSearch and
Lucene and would appreciate experienced advice. Of course, if someone else
has a working solution I would love to hear about it too.

The idea is based on the following two assumptions:

  1. While it is infeasible to load all unique strings into memory, it's OK
    to load an equal amount of numbers.
  2. For facets you typically need the string representation of a way smaller
    number of terms when compared to the total. Much similar to searching.
  3. For every document there is only one term. In the author example this
    means every book is written by just one author. This restriction simplifies
    things. I have some ideas how this can be removed, if the works fine.

The implementation consists of building a new FieldCache which instead of
mapping every document id in a segment to strings, it maps it to another
document id which has the same term. Using the books example again,say
Twenty Thousand Leagues Under the Sea by Jules Vernes has id 1 and Journey
to the Center of the Earth has id 2. The Cache will contain an entry
mapping 1 to itself and 2 to 1 (which has the same author).

Using that field cache faceting would work as such:

  1. Extract the top n terms (represented by an document id), using the same
    classes the numerical facets work
  2. Once we have the top n terms, lookup the relevant document and extract
    the actual field value. This can be done during the facet() phase of
    collections

As far as I can tell, this should all work but for a couple of potential
problems:

  1. To retrieve the documents during the facet() phase you need an
    IndexReader. The FieldCache is created for every segment and can store such
    an indexreader. However, I'm not sure whether you can safely store a copy
    in the cache? What happens when the segement is merged? What happens when
    the FieldCache is evacuated from memory?
  2. The FieldCache might need to refer to documents which have been deleted
    in the mean time. Is it a problem to access deleted documents via an
    IndexReader?
  3. A couple of more things I probably missed :slight_smile:

I would really appreciate it if people with more experience will comment on
this and give feedback. In return (and anyway:) ) I wil keep the list up to
date on how things work out.

Thanks in advance,
Boaz


(Boaz Leskes) #7

Hi,

I'm also interested to hear what's on the roadmap for faceting. It would be
a shame if energy is wasted on double work.

@Otis - thanks for the link . Interesting read. I'm not sure I have the
time (nor enough understanding) to do that right now.

Can anyone (shay?) please help me with the following:

  1. To retrieve the documents during the facet() phase you need
    an IndexReader. Can I cache the readers I get during the collect phase in a
    custom field cache object? What happens when the segment is merged? What
    happens when the FieldCache is evacuated from memory?

  2. The FieldCache might need to refer to documents which have been
    deleted in the mean time. Is it a problem to access deleted documents via
    an IndexReader?

Thx!
Boaz

On Fri, Jun 29, 2012 at 8:46 AM, Radim me@radimrehurek.com wrote:

Bump.

Fixing&improving faceting seems to be critical for many people
(including us) -- is this on the roadmap somewhere? Pointers where to
start if I want to do this myself?

Thanks,
Radim

On Jun 24, 9:58 pm, Radim m...@radimrehurek.com wrote:

+1 on improving faceting in ElasticSearch. This also includes fixing
the "wrong facet counts bug",
https://github.com/elasticsearch/elasticsearch/issues/1305
.

I don't really understand Boaz's solution here, and I don't know the
insides of ES enough to architect an alternative solution. But
faceting on authors is a very important use case for us, so if a
solution is found and blessed by Shay, I am willing to help with
implementing/testing it.

Best,Radim

On Jun 23, 9:47 am, Boaz Leskes b.les...@gmail.com wrote:

Hi All,

In my use case we need to facet fields with lots of possible unique
values.

For example, you can think of the author field in Amazon. Lot's of
books

written by lots of people. Still, it is useful to facet it, as some
authors

may be very active in a certain area. If the query is focused, an
author

facet can be very useful.

For these kind of fields, the current faceting implementation in
ElasticSearch is problematic. The reason is that it (pre) loads all
possible values of the field into memory. On our index this means 30GB
for

a single field.

This problem has been discussed before on the list but with a proper
solution. The only thing that comes close is an idea by Andy Wick
(
https://groups.google.com/forum/#!msg/elasticsearch/xsMmFDuSVCM/gPzXV...

) summarized nicely by Otis
onhttps://
groups.google.com/d/msg/elasticsearch/ePJgCtBpyrs/pJ5REcVPPhkJ

. The basic idea of the solution is that instead of loading strings
into

memory. The downside is that you have to use a secondary index which is
used while indexing and faceting.

I would like to propose another related idea and ask if people problems
down the road I should be aware of. I'm fairly new to ElasticSearch and
Lucene and would appreciate experienced advice. Of course, if someone
else

has a working solution I would love to hear about it too.

The idea is based on the following two assumptions:

  1. While it is infeasible to load all unique strings into memory,
    it's OK

to load an equal amount of numbers.
2) For facets you typically need the string representation of a way
smaller

number of terms when compared to the total. Much similar to searching.
3) For every document there is only one term. In the author example
this

means every book is written by just one author. This restriction
simplifies

things. I have some ideas how this can be removed, if the works fine.

The implementation consists of building a new FieldCache which instead
of

mapping every document id in a segment to strings, it maps it to
another

document id which has the same term. Using the books example again,say
Twenty Thousand Leagues Under the Sea by Jules Vernes has id 1 and
Journey

to the Center of the Earth has id 2. The Cache will contain an entry
mapping 1 to itself and 2 to 1 (which has the same author).

Using that field cache faceting would work as such:

  1. Extract the top n terms (represented by an document id), using the
    same

classes the numerical facets work
2) Once we have the top n terms, lookup the relevant document and
extract

the actual field value. This can be done during the facet() phase of
collections

As far as I can tell, this should all work but for a couple of
potential

problems:

  1. To retrieve the documents during the facet() phase you need an
    IndexReader. The FieldCache is created for every segment and can store
    such

an indexreader. However, I'm not sure whether you can safely store a
copy

in the cache? What happens when the segement is merged? What happens
when

the FieldCache is evacuated from memory?
2) The FieldCache might need to refer to documents which have been
deleted

in the mean time. Is it a problem to access deleted documents via an
IndexReader?
3) A couple of more things I probably missed :slight_smile:

I would really appreciate it if people with more experience will
comment on

this and give feedback. In return (and anyway:) ) I wil keep the list
up to

date on how things work out.

Thanks in advance,
Boaz


(Radim) #8

Bump.

The bug in faceting is fairly critical and our offer (from at least 2
developers?) to help improve faceting still stands. But we need some
assistance from the old hands.

Or at least an answer...

-rr

On Jun 30, 1:41 pm, Boaz Leskes b.les...@gmail.com wrote:

Hi,

I'm also interested to hear what's on the roadmap for faceting. It would be
a shame if energy is wasted on double work.

@Otis - thanks for the link . Interesting read. I'm not sure I have the
time (nor enough understanding) to do that right now.

Can anyone (shay?) please help me with the following:

  1. To retrieve the documents during the facet() phase you need
    an IndexReader. Can I cache the readers I get during the collect phase in a
    custom field cache object? What happens when the segment is merged? What
    happens when the FieldCache is evacuated from memory?

  2. The FieldCache might need to refer to documents which have been
    deleted in the mean time. Is it a problem to access deleted documents via
    an IndexReader?

Thx!
Boaz

On Fri, Jun 29, 2012 at 8:46 AM, Radim m...@radimrehurek.com wrote:

Bump.

Fixing&improving faceting seems to be critical for many people
(including us) -- is this on the roadmap somewhere? Pointers where to
start if I want to do this myself?

Thanks,
Radim

On Jun 24, 9:58 pm, Radim m...@radimrehurek.com wrote:

+1 on improving faceting in ElasticSearch. This also includes fixing
the "wrong facet counts bug",
https://github.com/elasticsearch/elasticsearch/issues/1305
.

I don't really understand Boaz's solution here, and I don't know the
insides of ES enough to architect an alternative solution. But
faceting on authors is a very important use case for us, so if a
solution is found and blessed by Shay, I am willing to help with
implementing/testing it.

Best,Radim

On Jun 23, 9:47 am, Boaz Leskes b.les...@gmail.com wrote:

Hi All,

In my use case we need to facet fields with lots of possible unique
values.

For example, you can think of the author field in Amazon. Lot's of
books

written by lots of people. Still, it is useful to facet it, as some
authors

may be very active in a certain area. If the query is focused, an
author

facet can be very useful.

For these kind of fields, the current faceting implementation in
ElasticSearch is problematic. The reason is that it (pre) loads all
possible values of the field into memory. On our index this means 30GB
for

a single field.

This problem has been discussed before on the list but with a proper
solution. The only thing that comes close is an idea by Andy Wick
(
https://groups.google.com/forum/#!msg/elasticsearch/xsMmFDuSVCM/gPzXV...

) summarized nicely by Otis
onhttps://
groups.google.com/d/msg/elasticsearch/ePJgCtBpyrs/pJ5REcVPPhkJ

. The basic idea of the solution is that instead of loading strings
into

memory. The downside is that you have to use a secondary index which is
used while indexing and faceting.

I would like to propose another related idea and ask if people problems
down the road I should be aware of. I'm fairly new to ElasticSearch and
Lucene and would appreciate experienced advice. Of course, if someone
else

has a working solution I would love to hear about it too.

The idea is based on the following two assumptions:

  1. While it is infeasible to load all unique strings into memory,
    it's OK

to load an equal amount of numbers.
2) For facets you typically need the string representation of a way
smaller

number of terms when compared to the total. Much similar to searching.
3) For every document there is only one term. In the author example
this

means every book is written by just one author. This restriction
simplifies

things. I have some ideas how this can be removed, if the works fine.

The implementation consists of building a new FieldCache which instead
of

mapping every document id in a segment to strings, it maps it to
another

document id which has the same term. Using the books example again,say
Twenty Thousand Leagues Under the Sea by Jules Vernes has id 1 and
Journey

to the Center of the Earth has id 2. The Cache will contain an entry
mapping 1 to itself and 2 to 1 (which has the same author).

Using that field cache faceting would work as such:

  1. Extract the top n terms (represented by an document id), using the
    same

classes the numerical facets work
2) Once we have the top n terms, lookup the relevant document and
extract

the actual field value. This can be done during the facet() phase of
collections

As far as I can tell, this should all work but for a couple of
potential

problems:

  1. To retrieve the documents during the facet() phase you need an
    IndexReader. The FieldCache is created for every segment and can store
    such

an indexreader. However, I'm not sure whether you can safely store a
copy

in the cache? What happens when the segement is merged? What happens
when

the FieldCache is evacuated from memory?
2) The FieldCache might need to refer to documents which have been
deleted

in the mean time. Is it a problem to access deleted documents via an
IndexReader?
3) A couple of more things I probably missed :slight_smile:

I would really appreciate it if people with more experience will
comment on

this and give feedback. In return (and anyway:) ) I wil keep the list
up to

date on how things work out.

Thanks in advance,
Boaz


(Boaz Leskes) #9

Hi Radim,

I don't know if it still relevant for you but I've started working on a
facet that mimics the standard terms facet but with way lower memory
signature (and slightly more IO overhead). It would be great if you (and
anyone else interested :slight_smile: ) can help me test it. It's available
on https://github.com/bleskes/elasticfacets#hashed-strings-facet . To make
things easy, I've packaged a development version which can be installed on
version 0.19.9 or higher with bin/plugin -install
bleskes/elasticfacets/0.2.1-SNAPSHOT .

Please me know if this is what you need,
Boaz

On Monday, July 16, 2012 10:19:34 AM UTC+2, Radim wrote:

Bump.

The bug in faceting is fairly critical and our offer (from at least 2
developers?) to help improve faceting still stands. But we need some
assistance from the old hands.

Or at least an answer...

-rr

On Jun 30, 1:41 pm, Boaz Leskes b.les...@gmail.com wrote:

Hi,

I'm also interested to hear what's on the roadmap for faceting. It would
be
a shame if energy is wasted on double work.

@Otis - thanks for the link . Interesting read. I'm not sure I have the
time (nor enough understanding) to do that right now.

Can anyone (shay?) please help me with the following:

  1. To retrieve the documents during the facet() phase you need
    an IndexReader. Can I cache the readers I get during the collect phase
    in a
    custom field cache object? What happens when the segment is merged? What
    happens when the FieldCache is evacuated from memory?

  2. The FieldCache might need to refer to documents which have been
    deleted in the mean time. Is it a problem to access deleted documents
    via
    an IndexReader?

Thx!
Boaz

On Fri, Jun 29, 2012 at 8:46 AM, Radim m...@radimrehurek.com wrote:

Bump.

Fixing&improving faceting seems to be critical for many people
(including us) -- is this on the roadmap somewhere? Pointers where to
start if I want to do this myself?

Thanks,
Radim

On Jun 24, 9:58 pm, Radim m...@radimrehurek.com wrote:

+1 on improving faceting in ElasticSearch. This also includes fixing
the "wrong facet counts bug",
https://github.com/elasticsearch/elasticsearch/issues/1305
.

I don't really understand Boaz's solution here, and I don't know the
insides of ES enough to architect an alternative solution. But
faceting on authors is a very important use case for us, so if a
solution is found and blessed by Shay, I am willing to help with
implementing/testing it.

Best,Radim

On Jun 23, 9:47 am, Boaz Leskes b.les...@gmail.com wrote:

Hi All,

In my use case we need to facet fields with lots of possible
unique

values.

For example, you can think of the author field in Amazon. Lot's of
books

written by lots of people. Still, it is useful to facet it, as
some

authors

may be very active in a certain area. If the query is focused, an
author

facet can be very useful.

For these kind of fields, the current faceting implementation in
ElasticSearch is problematic. The reason is that it (pre) loads
all

possible values of the field into memory. On our index this means
30GB

for

a single field.

This problem has been discussed before on the list but with a
proper

solution. The only thing that comes close is an idea by Andy Wick
(
https://groups.google.com/forum/#!msg/elasticsearch/xsMmFDuSVCM/gPzXV...

) summarized nicely by Otis
onhttps://
groups.google.com/d/msg/elasticsearch/ePJgCtBpyrs/pJ5REcVPPhkJ

. The basic idea of the solution is that instead of loading
strings

into

memory. The downside is that you have to use a secondary index
which is

used while indexing and faceting.

I would like to propose another related idea and ask if people
problems

down the road I should be aware of. I'm fairly new to
ElasticSearch and

Lucene and would appreciate experienced advice. Of course, if
someone

else

has a working solution I would love to hear about it too.

The idea is based on the following two assumptions:

  1. While it is infeasible to load all unique strings into
    memory,

it's OK

to load an equal amount of numbers.
2) For facets you typically need the string representation of a
way

smaller

number of terms when compared to the total. Much similar to
searching.

  1. For every document there is only one term. In the author
    example

this

means every book is written by just one author. This restriction
simplifies

things. I have some ideas how this can be removed, if the works
fine.

The implementation consists of building a new FieldCache which
instead

of

mapping every document id in a segment to strings, it maps it to
another

document id which has the same term. Using the books example
again,say

Twenty Thousand Leagues Under the Sea by Jules Vernes has id 1 and
Journey

to the Center of the Earth has id 2. The Cache will contain an
entry

mapping 1 to itself and 2 to 1 (which has the same author).

Using that field cache faceting would work as such:

  1. Extract the top n terms (represented by an document id), using
    the

same

classes the numerical facets work
2) Once we have the top n terms, lookup the relevant document and
extract

the actual field value. This can be done during the facet() phase
of

collections

As far as I can tell, this should all work but for a couple of
potential

problems:

  1. To retrieve the documents during the facet() phase you need an
    IndexReader. The FieldCache is created for every segment and can
    store

such

an indexreader. However, I'm not sure whether you can safely
store a

copy

in the cache? What happens when the segement is merged? What
happens

when

the FieldCache is evacuated from memory?
2) The FieldCache might need to refer to documents which have been
deleted

in the mean time. Is it a problem to access deleted documents via
an

IndexReader?
3) A couple of more things I probably missed :slight_smile:

I would really appreciate it if people with more experience will
comment on

this and give feedback. In return (and anyway:) ) I wil keep the
list

up to

date on how things work out.

Thanks in advance,
Boaz

--


(system) #10