Setting up a custom analyzer

Karolinebryn · September 22, 2017, 12:18pm

Hi!
I am creating a custom analyzer for one of my indexes, and I have some questions about the lowercase and standard token filters.

In the documentation it says this about the lowecase tokenizer

The lowercase tokenizer, like the letter tokenizer breaks text into terms whenever it encounters a character which is not a letter, but it also lowercases all terms.

While this is said for the standard tokenizer

The standard tokenizer provides grammar based tokenization

Does this mean that there is no point in using them both? Does the lowecase tokenizer overlap the standard tokenizer?

Ivan · September 22, 2017, 5:21pm

Only one tokenizer can be defined per analyzer. Keep in mind that
tokenizers and token filters are different items, with the former being
executed first (of the two) in the analysis chain.

The lowercase tokenizer [1] is based on the letter tokenizer [2], which
simply breaks on non-letter characters. The standard tokenizer [3] is far
more complex, with various rules mostly based on the English language. It
all depends on your corpus and use cases. Data such as names and titles
could use a simpler letter tokenizer, but free form text that might
included urls or email address is probably best tokenized by the standard
tokenizer.

[1]

github.com

apache/lucene-solr/blob/branch_6x/lucene/analysis/common/src/java/org/apache/lucene/analysis/core/LowerCaseTokenizer.java

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
package org.apache.lucene.analysis.core;


import org.apache.lucene.analysis.Tokenizer;

This file has been truncated. show original

[2]

github.com

apache/lucene-solr/blob/branch_6x/lucene/analysis/common/src/java/org/apache/lucene/analysis/core/LetterTokenizer.java

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
package org.apache.lucene.analysis.core;


import org.apache.lucene.analysis.Tokenizer;

This file has been truncated. show original

[3]

github.com

apache/lucene-solr/blob/branch_6x/lucene/core/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.java

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.apache.lucene.analysis.standard;

import java.io.IOException;

This file has been truncated. show original

Cheers,

Ivan

rjernst · September 22, 2017, 7:09pm

As an aside (unrelated to the original question), the English part of this statement is not true. It is based on the Unicode Text Segmentation algorithm. See UAX #29: Unicode Text Segmentation. The standard analyzer has some English stuff, specifically the default set of English stop words.

Ivan · September 22, 2017, 8:57pm

Very true Ryan. I meant to say based on Latin character set languages, but
even that is false. I hope that the OP sees the difference between
tokenizers and token filters, especially for the standard tokenizer/token
filter. The former does tons, the latter does nothing!

Karolinebryn · September 25, 2017, 7:36am

Okay, then I am messing up the terms (I am really confused now). I thought a token filter was made up by one or more tokenizers (that's at least what I made of this text).

Ivan · September 25, 2017, 2:50pm

Analyzers are made up of filters and tokenizers as described here
https://www.elastic.co/guide/en/elasticsearch/reference/current/analyzer-anatomy.html

A diagram can be found here:
https://www.elastic.co/blog/found-text-analysis-part-1 The concepts come
straight from Lucene, so any informations sources regarding analysis in
Lucene/Solr will apply to Elasticsearch if you care to read more.

That diagram does not highlight the fact that you can have several
character filters and token filters, but only one tokenizer. In general,
character filters are seldom used (mainly for pattern removal or
substitution), then a simple tokenizer, followed by several token filters
which work on the tokens generated by the tokenizer. Chances are you want
to focus on the token filters.

system · October 23, 2017, 2:50pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Design custom analyzer with custom tokenizers Elasticsearch	3	984	July 5, 2017
Altering the standard analyzer Elasticsearch	3	777	July 5, 2017
Standard analyzer vs standard tokenizer Elasticsearch	1	462	September 20, 2018
Standard analyzer Elasticsearch	6	349	June 6, 2019
Multiple tokenizers inside one Custom Analyser in Elasticsearch Elasticsearch	1	1949	October 26, 2018

Setting up a custom analyzer

Related topics