Setting up a custom analyzer


(Karoline Brynildsen) #1

Hi!
I am creating a custom analyzer for one of my indexes, and I have some questions about the lowercase and standard token filters.

In the documentation it says this about the lowecase tokenizer

The lowercase tokenizer, like the letter tokenizer breaks text into terms whenever it encounters a character which is not a letter, but it also lowercases all terms.

While this is said for the standard tokenizer

The standard tokenizer provides grammar based tokenization

Does this mean that there is no point in using them both? Does the lowecase tokenizer overlap the standard tokenizer?


(Ivan Brusic) #2

Only one tokenizer can be defined per analyzer. Keep in mind that
tokenizers and token filters are different items, with the former being
executed first (of the two) in the analysis chain.

The lowercase tokenizer [1] is based on the letter tokenizer [2], which
simply breaks on non-letter characters. The standard tokenizer [3] is far
more complex, with various rules mostly based on the English language. It
all depends on your corpus and use cases. Data such as names and titles
could use a simpler letter tokenizer, but free form text that might
included urls or email address is probably best tokenized by the standard
tokenizer.

[1]


[2]

[3]

Cheers,

Ivan


(Ryan Ernst) #3

As an aside (unrelated to the original question), the English part of this statement is not true. It is based on the Unicode Text Segmentation algorithm. See http://unicode.org/reports/tr29/. The standard analyzer has some English stuff, specifically the default set of English stop words.


(Ivan Brusic) #4

Very true Ryan. I meant to say based on Latin character set languages, but
even that is false. I hope that the OP sees the difference between
tokenizers and token filters, especially for the standard tokenizer/token
filter. The former does tons, the latter does nothing!


(Karoline Brynildsen) #5

Okay, then I am messing up the terms (I am really confused now). I thought a token filter was made up by one or more tokenizers (that's at least what I made of this text).


(Ivan Brusic) #6

Analyzers are made up of filters and tokenizers as described here
https://www.elastic.co/guide/en/elasticsearch/reference/current/analyzer-anatomy.html

A diagram can be found here:
https://www.elastic.co/blog/found-text-analysis-part-1 The concepts come
straight from Lucene, so any informations sources regarding analysis in
Lucene/Solr will apply to Elasticsearch if you care to read more.

That diagram does not highlight the fact that you can have several
character filters and token filters, but only one tokenizer. In general,
character filters are seldom used (mainly for pattern removal or
substitution), then a simple tokenizer, followed by several token filters
which work on the tokens generated by the tokenizer. Chances are you want
to focus on the token filters.


(system) #7

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.