Term faceting without tokenizing and without not_analyzed


(tsuna) #1

Hi,
I have a field I want to facet on, and it happens that this field is
often made of 2 or 3 words. When I use term faceting on this field,
it works but the results I get back are tokenized, so if my field was
"a b c" I'm going to get 3 things back (like "a": 1, "b": 1, "c": 1 –
instead of "a b c": 1).

In a previous post [1], Shay said "You need to have the field you run
the terms facet on marked as not_analyzed in its mapping. You will
need to reindex the data in order for that to take affect (and of
course, create the mappings before you index data)."

Is there a way to work around this limitation? My mapping is created
dynamically and I like it this way, and I've indexed several GB of
data already, I'm not sure how I would go about re-indexing everything
(I only have a copy of the data in ES because it isn't very important
data, but it's kind of a bummer that I'd have to re-index it or nuke
it just because the field I wanna facet on is not not_analyzed).

[1] http://elasticsearch-users.115913.n3.nabble.com/terms-facet-is-tokenizing-a-field-with-special-characters-tp2658666p2659338.html

--
Benoit "tsuna" Sigoure
Software Engineer @ www.StumbleUpon.com


(Shay Banon) #2

You have several options now:

  1. Upgrade specific mapping types create to multi_field mapping with additional internal mapping that is index : not_analyzed. This will apply to newly created docs.
  2. If you want, you can use dynamic templates to apply that to all string fields, though that will require more memory per shard. Its better to control which field you really want to do it.

In the future, there will be an option to build the field cache values (the field cache is used to do faceting) from the _source. This means that it will be as "not_analyzed", though, will take more time to build this field cache.

-shay.banon
On Friday, April 22, 2011 at 9:37 AM, tsuna wrote:

Hi,
I have a field I want to facet on, and it happens that this field is
often made of 2 or 3 words. When I use term faceting on this field,
it works but the results I get back are tokenized, so if my field was
"a b c" I'm going to get 3 things back (like "a": 1, "b": 1, "c": 1 –
instead of "a b c": 1).

In a previous post [1], Shay said "You need to have the field you run
the terms facet on marked as not_analyzed in its mapping. You will
need to reindex the data in order for that to take affect (and of
course, create the mappings before you index data)."

Is there a way to work around this limitation? My mapping is created
dynamically and I like it this way, and I've indexed several GB of
data already, I'm not sure how I would go about re-indexing everything
(I only have a copy of the data in ES because it isn't very important
data, but it's kind of a bummer that I'd have to re-index it or nuke
it just because the field I wanna facet on is not not_analyzed).

[1] http://elasticsearch-users.115913.n3.nabble.com/terms-facet-is-tokenizing-a-field-with-special-characters-tp2658666p2659338.html

--
Benoit "tsuna" Sigoure
Software Engineer @ www.StumbleUpon.com


(tsuna) #3

On Fri, Apr 22, 2011 at 12:53 AM, Shay Banon
shay.banon@elasticsearch.com wrote:

You have several options now:

  1. Upgrade specific mapping types create to multi_field mapping with
    additional internal mapping that is index : not_analyzed. This will apply to
    newly created docs.
  2. If you want, you can use dynamic templates to apply that to all string
    fields, though that will require more memory per shard. Its better to
    control which field you really want to do it.

Does option 2. allow me to do what I want without re-indexing and with
old docs? If yes, can you please explain in a little bit more details
how I go about doing that?

Otherwise, sorry if this is a FAQ, but I still don't see how to
re-index everything with ES. If my data is only stored in ES, how can
I dump it out and send it back in, in an efficient way? Can I create
a new index and have ES do the work automagically for me from all the
_source fields?

--
Benoit "tsuna" Sigoure
Software Engineer @ www.StumbleUpon.com


(Clinton Gormley) #4

Hi Benoit

  1. If you want, you can use dynamic templates to apply that to all string
    fields, though that will require more memory per shard. Its better to
    control which field you really want to do it.

Does option 2. allow me to do what I want without re-indexing and with
old docs? If yes, can you please explain in a little bit more details
how I go about doing that?

No. For any changes to have effect, you need to reindex existing
documents.

Otherwise, sorry if this is a FAQ, but I still don't see how to
re-index everything with ES. If my data is only stored in ES, how can
I dump it out and send it back in, in an efficient way? Can I create
a new index and have ES do the work automagically for me from all the
_source fields?

First, use index aliases instead of using indices directly, so you can
do:
Create MyAlias to point to OldIndex
Your application can use MyAlias

    Create NewIndex
    Index from OldIndex to NewIndex
    Update MyAlias to point to NewIndex
    Delete OldIndex

For more on aliases, see:
http://www.elasticsearch.org/guide/reference/api/admin-indices-aliases.html

You will need to write a script to read from your old index and bulk
index to your new index.

clint


(Shay Banon) #5

I was talking about the multi_field option, and upgrading existing mapping for a field to be multi_field, check this: http://www.elasticsearch.org/guide/reference/mapping/multi-field-type.html. I suggest you play with this a bit locally, and see how it can be done, basically, you just need to put_mapping the delta change (in your case, the mapping for the specific field you want to upgrade to multi_field mapping). Note, when doing so, only new data will be indexed using the new mapping.

As clinton suggested, you can reindex data from one index to the other. The new scan search type in upcoming 0.16 has good support for that (http://www.elasticsearch.org/guide/reference/api/search/search-type.html. There is no built in ability to reindex data from one index to another (at least not currently).

-shay.banon
On Friday, April 22, 2011 at 7:21 PM, Clinton Gormley wrote:
Hi Benoit

  1. If you want, you can use dynamic templates to apply that to all string
    fields, though that will require more memory per shard. Its better to
    control which field you really want to do it.

Does option 2. allow me to do what I want without re-indexing and with
old docs? If yes, can you please explain in a little bit more details
how I go about doing that?

No. For any changes to have effect, you need to reindex existing
documents.

Otherwise, sorry if this is a FAQ, but I still don't see how to
re-index everything with ES. If my data is only stored in ES, how can
I dump it out and send it back in, in an efficient way? Can I create
a new index and have ES do the work automagically for me from all the
_source fields?

First, use index aliases instead of using indices directly, so you can
do:
Create MyAlias to point to OldIndex
Your application can use MyAlias

Create NewIndex
Index from OldIndex to NewIndex
Update MyAlias to point to NewIndex
Delete OldIndex

For more on aliases, see:
http://www.elasticsearch.org/guide/reference/api/admin-indices-aliases.html

You will need to write a script to read from your old index and bulk
index to your new index.

clint


(tsuna) #6

OK I think I'm going to delete all the data and start over... It
seems that re-indexing everything is going to take me too much time,
since there's nothing out-of-the-box to dump-and-reupload everything
unchanged (correct me if I'm wrong).

I'm curious though: can anyone explain the technical reason as to why
things work the way they work? Why is ES unable to NOT tokenize the
field when faceting? I'm guessing it has to do with the way Lucene
works and how the data is indexed, but I'd be curious to know the
implementation details that explain this behavior.

--
Benoit "tsuna" Sigoure
Software Engineer @ www.StumbleUpon.com


(Shay Banon) #7

On Friday, April 22, 2011 at 10:08 PM, tsuna wrote:
OK I think I'm going to delete all the data and start over... It

seems that re-indexing everything is going to take me too much time,
since there's nothing out-of-the-box to dump-and-reupload everything
unchanged (correct me if I'm wrong).
Really, up to you. The scan API is pretty simple to use and writing a script to reindex data is quite simple. No, there is nothing out of the box for it.

I'm curious though: can anyone explain the technical reason as to why
things work the way they work? Why is ES unable to NOT tokenize the
field when faceting? I'm guessing it has to do with the way Lucene
works and how the data is indexed, but I'd be curious to know the
implementation details that explain this behavior.
In short, (usually) for facet calculation, the process of uninverting the inverted index happens, from an index that has token and a list of doc ids it exists on (ver/y simplified), to one that has docId->value (docId is an lucene internal document id). Then, when processing hits, we can get those values fast per doc.

The process of uninverting the data for field is done by iterating over the field tokens (that are created when indexing the field data). This is a very fast process (think scanning in hbase). For this reason, the way the field was indexed is important, since it controls which tokens are created.

Saying that, there isn't really a technical reason not to be able to also build the "uninveted" index (in lucene terms, its called field cache, though ES has its own version of it) using the _source field, thus not really taking the analysis process into account. It will be slower to build though.

--
Benoit "tsuna" Sigoure
Software Engineer @ www.StumbleUpon.com


(system) #8