The lowercase tokenizer, like the letter tokenizer breaks text into terms whenever it encounters a character which is not a letter, but it also lowercases all terms.
Only one tokenizer can be defined per analyzer. Keep in mind that
tokenizers and token filters are different items, with the former being
executed first (of the two) in the analysis chain.
The lowercase tokenizer [1] is based on the letter tokenizer [2], which
simply breaks on non-letter characters. The standard tokenizer [3] is far
more complex, with various rules mostly based on the English language. It
all depends on your corpus and use cases. Data such as names and titles
could use a simpler letter tokenizer, but free form text that might
included urls or email address is probably best tokenized by the standard
tokenizer.
As an aside (unrelated to the original question), the English part of this statement is not true. It is based on the Unicode Text Segmentation algorithm. See UAX #29: Unicode Text Segmentation. The standard analyzer has some English stuff, specifically the default set of English stop words.
Very true Ryan. I meant to say based on Latin character set languages, but
even that is false. I hope that the OP sees the difference between
tokenizers and token filters, especially for the standard tokenizer/token
filter. The former does tons, the latter does nothing!
Okay, then I am messing up the terms (I am really confused now). I thought a token filter was made up by one or more tokenizers (that's at least what I made of this text).
A diagram can be found here: https://www.elastic.co/blog/found-text-analysis-part-1 The concepts come
straight from Lucene, so any informations sources regarding analysis in
Lucene/Solr will apply to Elasticsearch if you care to read more.
That diagram does not highlight the fact that you can have several
character filters and token filters, but only one tokenizer. In general,
character filters are seldom used (mainly for pattern removal or
substitution), then a simple tokenizer, followed by several token filters
which work on the tokens generated by the tokenizer. Chances are you want
to focus on the token filters.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.