The first option (analyzed) and the last (no) seem to make a lot of sense,
and I understand them.
However, I do struggle to come up with a use case where not_analyzed should
be used. Since it is not tokenized, I would expect a match only when the
exact same search term is provided in a query. In fact, since it is not
tokenized, it has to match right down to stop words, whitespace and case,
correct? Maybe for matching on a hash code, a case-insensitive username, or
a zip code?
If I have a not_analyzed field containing "XY&Z Company", will I only get a
match if I query for "XY&Z Company"?
We use not_analyzed for generating facet results that can be used for
display purposes. Also, there are for fields that are exact match
filters, this can be appropriate, similar to what you were guessing
below.
The first option (analyzed) and the last (no) seem to make a lot of sense,
and I understand them.
However, I do struggle to come up with a use case where not_analyzed should
be used. Since it is not tokenized, I would expect a match only when the
exact same search term is provided in a query. In fact, since it is not
tokenized, it has to match right down to stop words, whitespace and case,
correct? Maybe for matching on a hash code, a case-insensitive username, or
a zip code?
If I have a not_analyzed field containing "XY&Z Company", will I only get a
match if I query for "XY&Z Company"?
Just another note on not_analyzed fields, those are exactly the same as
fields that are analyzed with a keyword tokenizer (keyword tokenizer simply
treats the whole text as a single token). People many times want this
behavior, but also do things like lowercasing, in which case, one can create
a custom analyzer that has a keyword tokenizer and a lowercase filter, and
use that as the analyzer to the field (and the field will still be
"analyzed").
We use not_analyzed for generating facet results that can be used for
display purposes. Also, there are for fields that are exact match
filters, this can be appropriate, similar to what you were guessing
below.
The first option (analyzed) and the last (no) seem to make a lot of
sense,
and I understand them.
However, I do struggle to come up with a use case where not_analyzed
should
be used. Since it is not tokenized, I would expect a match only when the
exact same search term is provided in a query. In fact, since it is not
tokenized, it has to match right down to stop words, whitespace and case,
correct? Maybe for matching on a hash code, a case-insensitive username,
or
a zip code?
If I have a not_analyzed field containing "XY&Z Company", will I only get
a
match if I query for "XY&Z Company"?
As Shay stated, we use a keyword tokenizer with a lowercase filter for
username lookups. I suppose based on the analyzer example above, we could
use the SimpleAnalyzer to achieve a similar result?
simple analyzer breaks text into tokens at non letters...., so its different
than the keyword tokenizer, which will treat the whole text as a single
token.
As Shay stated, we use a keyword tokenizer with a lowercase filter for
username lookups. I suppose based on the analyzer example above, we could
use the SimpleAnalyzer to achieve a similar result?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.