Kuromoji analyzer default character/token filters

Luke3113 · December 5, 2016, 3:21am

Hi I'm using Elasticsearch 5.0.1 with the Kuromoji plugin.
Until now I've merely used its default configuration by using the following mapping:

"analyzer": "kuromoji"

However as stated in the docs (Kuromoji analyzer) it consists of a character filter, tokenizer and various token filters. Some of these seem to be applied in the analyzer's default settings (e.g. kuromoji_baseform), some not (e.g. kuromoji_number token filter).

I would like to know which filters are being used in Kuromojis default setting. Is there any way to find out with an API-call?
I've tried looking into the Plugin-sources as this doesn't seem to be documented: KuromojiAnalyzerProvider.java
However, there doesn't seem to be any character-filters or tokenizers defined.

Thank you!

dadoonet · December 5, 2016, 5:18am

May be look here:

github.com

apache/lucene-solr/blob/master/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseAnalyzer.java

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
package org.apache.lucene.analysis.ja;


import java.io.IOException;

This file has been truncated. show original

Luke3113 · December 5, 2016, 7:29am

Oh great thanks, I didn't know the JapaneseAnalyzer is also part of Kuromoji!
So it seems from this file that following components are used by default:

kuromoji_tokenizer - tokenizer
kuromoji_baseform - token filter
kuromoji_part_of_speech - token filter
cjk_width - token filter
ja_stop - token filter
kuromoji_stemmer - token filter
lowercase - token filter

Doesn't seem to be used by default:

kuromoji_iteration_mark - character filter
kuromoji_number - token filter
kuromoji_readingform - token filter

johtani · December 5, 2016, 8:03am

And additional information, stoptags and stopwords are here.

https://github.com/apache/lucene-solr/tree/master/lucene/analysis/kuromoji/src/resources/org/apache/lucene/analysis/ja

system · January 2, 2017, 8:03am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
(Plugin Kuromoji) Can you help me resolve config elasticsearch.yml create analyzer? 日本語による質問・議論はこちら	5	1664	July 6, 2017
Special Character Search with kuromoji analyzer Elasticsearch	1	440	August 31, 2018
Need Help with Japanese analyzer - (Kuromoji) Elasticsearch	1	363	July 6, 2017
Kuromoji analyzer filters out text in Arabic Elasticsearch	1	165	October 26, 2021
Can you help me resolve config elasticsearch.yml create analyzer? Elasticsearch	2	570	July 5, 2017

Kuromoji analyzer default character/token filters

Related topics