Kuromoji_stemmer と kuromoji_readingform の同時使用について

tatsuyaoiw · May 11, 2017, 1:51am

Elasticsearch Version: 5.3.1

kuromoji_stemmer token filterとkuromoji_readingform token filterを使って、最後の長音の除去とカタカナ読みへの変換を同時に設定しようとしたのですが、kuromoji_stemmerの後にkuromoji_readingformを設定すると、kuromoji_stemmerで除去されたはずの長音がkuromoji_readingformで再び追加されてしまいます。

設定例:

PUT http://localhost:9200/my_index
{
	"settings": {
		"analysis": {
			"filter": {
				"katakana_readingform": {
					"type": "kuromoji_readingform",
					"use_romaji": false
				}
			},
			"analyzer": {
				"stemmer": {
					"type": "custom",
					"tokenizer": "kuromoji_tokenizer",
					"filter": [
						"kuromoji_stemmer"
					]
				},
				"reading": {
					"type": "custom",
					"tokenizer": "kuromoji_tokenizer",
					"filter": [
						"katakana_readingform"
					]
				},
				"stemmer_reading": {
					"type": "custom",
					"tokenizer": "kuromoji_tokenizer",
					"filter": [
						"kuromoji_stemmer",
						"katakana_readingform"
					]
				},
				"reading_stemmer": {
					"type": "custom",
					"tokenizer": "kuromoji_tokenizer",
					"filter": [
						"katakana_readingform",
						"kuromoji_stemmer"
					]
				}
			}
		}
	}
}

クエリ例:

POST http://localhost:9200/my_index/_analyze
{
	"analyzer": "stemmer_reading",
	"text": "パーカー"
}

クエリ結果:

{
  "tokens": [
    {
      "token": "パーカー",
      "start_offset": 0,
      "end_offset": 4,
      "type": "word",
      "position": 0
    }
  ]
}

クエリ例（explain=true）:

POST http://localhost:9200/my_index/_analyze
{
	"analyzer": "stemmer_reading",
	"text": "パーカー",
        "explain": true
}

クエリ結果（explain=true）:

{
  "detail": {
    "custom_analyzer": true,
    "charfilters": [],
    "tokenizer": {
      "name": "kuromoji_tokenizer",
      "tokens": [
        {
          "token": "パーカー",
          "start_offset": 0,
          "end_offset": 4,
          "type": "word",
          "position": 0,
          "baseForm": null,
          "bytes": "[e3 83 91 e3 83 bc e3 82 ab e3 83 bc]",
          "inflectionForm": null,
          "inflectionForm (en)": null,
          "inflectionType": null,
          "inflectionType (en)": null,
          "partOfSpeech": "名詞-一般",
          "partOfSpeech (en)": "noun-common",
          "positionLength": 1,
          "pronunciation": "パーカー",
          "pronunciation (en)": "paka",
          "reading": "パーカー",
          "reading (en)": "paka"
        }
      ]
    },
    "tokenfilters": [
      {
        "name": "kuromoji_stemmer",
        "tokens": [
          {
            "token": "パーカ",
            "start_offset": 0,
            "end_offset": 4,
            "type": "word",
            "position": 0,
            "baseForm": null,
            "bytes": "[e3 83 91 e3 83 bc e3 82 ab]",
            "inflectionForm": null,
            "inflectionForm (en)": null,
            "inflectionType": null,
            "inflectionType (en)": null,
            "keyword": false,
            "partOfSpeech": "名詞-一般",
            "partOfSpeech (en)": "noun-common",
            "positionLength": 1,
            "pronunciation": "パーカー",
            "pronunciation (en)": "paka",
            "reading": "パーカー",
            "reading (en)": "paka"
          }
        ]
      },
      {
        "name": "katakana_readingform",
        "tokens": [
          {
            "token": "パーカー",
            "start_offset": 0,
            "end_offset": 4,
            "type": "word",
            "position": 0,
            "baseForm": null,
            "bytes": "[e3 83 91 e3 83 bc e3 82 ab e3 83 bc]",
            "inflectionForm": null,
            "inflectionForm (en)": null,
            "inflectionType": null,
            "inflectionType (en)": null,
            "keyword": false,
            "partOfSpeech": "名詞-一般",
            "partOfSpeech (en)": "noun-common",
            "positionLength": 1,
            "pronunciation": "パーカー",
            "pronunciation (en)": "paka",
            "reading": "パーカー",
            "reading (en)": "paka"
          }
        ]
      }
    ]
  }
}

この問題は、kuromoji_stemmerとkuromoji_readingformの順番を逆にすることでひとまず回避できそうですが、もし原因がわかれば教えていただきたいです。

POST http://localhost:9200/my_index/_analyze
{
	"analyzer": "reading_stemmer",
	"text": "パーカー"
}

{
  "tokens": [
    {
      "token": "パーカ",
      "start_offset": 0,
      "end_offset": 4,
      "type": "word",
      "position": 0
    }
  ]
}

よろしくお願いいたします。

johtani · May 11, 2017, 2:08am

順番の問題です。
filterは指定した順番に適用されます。
kuromoji_stemmerはtokenに対してのみ処理を行うためです。
kuromoji_readingformはtokenに対してreadingにある文字列を設定し直すものになります。
ですので、stemmerを行なった後にreadingで上書きするので、長音を消す前に戻るということです。

tatsuyaoiw · May 11, 2017, 2:18am

なるほど。理解しました。ありがとうございます！

system · June 8, 2017, 2:20am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Kuromoji_readingform の意図しない出力について日本語による質問・議論はこちら	3	3956	July 6, 2017
Edge NGram Token Filterを使用した場合のhighlightについて日本語による質問・議論はこちら	5	1730	February 28, 2019
Kuromojiユーザ辞書に定義済みの単語で構成された複合語の形態素解析について日本語による質問・議論はこちら	3	3829	November 1, 2021
Kuromoji_readingform を使用して読み仮名でサジェストを得たい日本語による質問・議論はこちら	3	1205	July 9, 2020
Special Character Search with kuromoji analyzer Elasticsearch	1	440	August 31, 2018

Kuromoji_stemmer と kuromoji_readingform の同時使用について

Related topics