Kuromoji analyzer default character/token filters


(Lukas Schmid) #1

Hi I'm using Elasticsearch 5.0.1 with the Kuromoji plugin.
Until now I've merely used its default configuration by using the following mapping:

"analyzer": "kuromoji"

However as stated in the docs (Kuromoji analyzer) it consists of a character filter, tokenizer and various token filters. Some of these seem to be applied in the analyzer's default settings (e.g. kuromoji_baseform), some not (e.g. kuromoji_number token filter).

I would like to know which filters are being used in Kuromojis default setting. Is there any way to find out with an API-call?
I've tried looking into the Plugin-sources as this doesn't seem to be documented: KuromojiAnalyzerProvider.java
However, there doesn't seem to be any character-filters or tokenizers defined.

Thank you!


(David Pilato) #2

May be look here:


(Lukas Schmid) #3

Oh great thanks, I didn't know the JapaneseAnalyzer is also part of Kuromoji!
So it seems from this file that following components are used by default:

  • kuromoji_tokenizer - tokenizer
  • kuromoji_baseform - token filter
  • kuromoji_part_of_speech - token filter
  • cjk_width - token filter
  • ja_stop - token filter
  • kuromoji_stemmer - token filter
  • lowercase - token filter

Doesn't seem to be used by default:

  • kuromoji_iteration_mark - character filter
  • kuromoji_number - token filter
  • kuromoji_readingform - token filter

(Jun Ohtani) #4

And additional information, stoptags and stopwords are here.

https://github.com/apache/lucene-solr/tree/master/lucene/analysis/kuromoji/src/resources/org/apache/lucene/analysis/ja


(system) #5

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.