Summary of index types


(James Cook) #1

I don't have a Lucene background, so this may be rudimentary.

The index types are defined here with an explanation:
http://www.elasticsearch.org/guide/reference/mapping/core-types.html

analyzed (default)

Indexed, Searchable, Tokenized

*not_analyzed *

Indexed, Searchable, Not Tokenized

no

Not Indexed, Not Searchable, Not Tokenized

The first option (analyzed) and the last (no) seem to make a lot of sense,
and I understand them.

However, I do struggle to come up with a use case where not_analyzed should
be used. Since it is not tokenized, I would expect a match only when the
exact same search term is provided in a query. In fact, since it is not
tokenized, it has to match right down to stop words, whitespace and case,
correct? Maybe for matching on a hash code, a case-insensitive username, or
a zip code?

If I have a not_analyzed field containing "XY&Z Company", will I only get a
match if I query for "XY&Z Company"?


(ppearcy) #2

We use not_analyzed for generating facet results that can be used for
display purposes. Also, there are for fields that are exact match
filters, this can be appropriate, similar to what you were guessing
below.

Best Regards,
Paul

On Aug 12, 11:58 am, James Cook jc...@tracermedia.com wrote:

I don't have a Lucene background, so this may be rudimentary.

The index types are defined here with an explanation:http://www.elasticsearch.org/guide/reference/mapping/core-types.html

analyzed (default)

Indexed, Searchable, Tokenized

*not_analyzed *

Indexed, Searchable, Not Tokenized

no

Not Indexed, Not Searchable, Not Tokenized

The first option (analyzed) and the last (no) seem to make a lot of sense,
and I understand them.

However, I do struggle to come up with a use case where not_analyzed should
be used. Since it is not tokenized, I would expect a match only when the
exact same search term is provided in a query. In fact, since it is not
tokenized, it has to match right down to stop words, whitespace and case,
correct? Maybe for matching on a hash code, a case-insensitive username, or
a zip code?

If I have a not_analyzed field containing "XY&Z Company", will I only get a
match if I query for "XY&Z Company"?


(Shay Banon) #3

Just another note on not_analyzed fields, those are exactly the same as
fields that are analyzed with a keyword tokenizer (keyword tokenizer simply
treats the whole text as a single token). People many times want this
behavior, but also do things like lowercasing, in which case, one can create
a custom analyzer that has a keyword tokenizer and a lowercase filter, and
use that as the analyzer to the field (and the field will still be
"analyzed").

On Fri, Aug 12, 2011 at 10:28 PM, ppearcy ppearcy@gmail.com wrote:

We use not_analyzed for generating facet results that can be used for
display purposes. Also, there are for fields that are exact match
filters, this can be appropriate, similar to what you were guessing
below.

Best Regards,
Paul

On Aug 12, 11:58 am, James Cook jc...@tracermedia.com wrote:

I don't have a Lucene background, so this may be rudimentary.

The index types are defined here with an explanation:
http://www.elasticsearch.org/guide/reference/mapping/core-types.html

analyzed (default)

Indexed, Searchable, Tokenized

*not_analyzed *

Indexed, Searchable, Not Tokenized

no

Not Indexed, Not Searchable, Not Tokenized

The first option (analyzed) and the last (no) seem to make a lot of
sense,
and I understand them.

However, I do struggle to come up with a use case where not_analyzed
should
be used. Since it is not tokenized, I would expect a match only when the
exact same search term is provided in a query. In fact, since it is not
tokenized, it has to match right down to stop words, whitespace and case,
correct? Maybe for matching on a hash code, a case-insensitive username,
or
a zip code?

If I have a not_analyzed field containing "XY&Z Company", will I only get
a
match if I query for "XY&Z Company"?


(James Cook) #4

I had found this Lucene presentationhttp://www.slideshare.net/otisg/lucene-introductionby Otis Gospodnetic. He presented the built-in analyzers in a very clear
way.

The quick brown fox jumped over the lazy dog.

WhitespaceAnalyzer:

[The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs]

SimpleAnalyzer:

[the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs]

StopAnalyzer:

[quick] [brown] [fox] [jumped] [over] [lazy] [dogs] 

StandardAnalyzer:

[quick] [brown] [fox] [jumped] [over] [lazy] [dogs] 

"*XY&Z Corporation - *xyz@example.com xyz@example.com"

WhitespaceAnalyzer:

[XY&Z] [Corporation] [-] [xyz@example.com] 

SimpleAnalyzer:

[xy] [z] [corporation] [xyz] [example] [com] 

StopAnalyzer:

[xy] [z] [corporation] [xyz] [example] [com] 

StandardAnalyzer:

[xy&z] [corporation] [xyz@example.com]

As Shay stated, we use a keyword tokenizer with a lowercase filter for
username lookups. I suppose based on the analyzer example above, we could
use the SimpleAnalyzer to achieve a similar result?

index.analysis.analyzer.lowercase_keyword.type=custom
index.analysis.analyzer.lowercase_keyword.tokenizer=keyword
index.analysis.analyzer.lowercase_keyword.filter.0=lowercase


(Shay Banon) #5

simple analyzer breaks text into tokens at non letters...., so its different
than the keyword tokenizer, which will treat the whole text as a single
token.

On Sat, Aug 13, 2011 at 2:01 PM, James Cook jcook@tracermedia.com wrote:

I had found this Lucene presentationhttp://www.slideshare.net/otisg/lucene-introductionby Otis Gospodnetic. He presented the built-in analyzers in a very clear
way.

The quick brown fox jumped over the lazy dog.

WhitespaceAnalyzer:

[The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs]

SimpleAnalyzer:

[the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs]

StopAnalyzer:

[quick] [brown] [fox] [jumped] [over] [lazy] [dogs]

StandardAnalyzer:

[quick] [brown] [fox] [jumped] [over] [lazy] [dogs]

"*XY&Z Corporation - *xyz@example.com xyz@example.com"

WhitespaceAnalyzer:

[XY&Z] [Corporation] [-] [xyz@example.com]

SimpleAnalyzer:

[xy] [z] [corporation] [xyz] [example] [com]

StopAnalyzer:

[xy] [z] [corporation] [xyz] [example] [com]

StandardAnalyzer:

[xy&z] [corporation] [xyz@example.com]

As Shay stated, we use a keyword tokenizer with a lowercase filter for
username lookups. I suppose based on the analyzer example above, we could
use the SimpleAnalyzer to achieve a similar result?

index.analysis.analyzer.lowercase_keyword.type=custom
index.analysis.analyzer.lowercase_keyword.tokenizer=keyword
index.analysis.analyzer.lowercase_keyword.filter.0=lowercase


(James Cook) #6

Great information. Even though I've been using ES for 18 months, I am still
learning the very basics it seems.

Can't wait for the book! :slight_smile:


(system) #7