How can I correctly index @screen_name, #hashtag and url in Japanese text?

mpyw · September 10, 2018, 11:13am

What I need to do

Use kuromoji as a tokenizer and an analyzer for handling Japanese fulltext search
Recognize tokens that match /\b@\w{1,30}\b/ (screen_name)
Recognize tokens that match URL Pattern
Recognize tokens that match /\b[#＃][^\p{Zs}]{1,50}(?=\p{Zs}|$)/ (hashtag)
Highlight matches

How can I achieve all of them at once?

Current index defintion

Loughly like this:

{
  "app": {
    "mappings": {
      "_doc": {
        "dynamic": "false",
        "_source": {
          "enabled": false
        },
        "properties": {
          "type": {
            "type": "keyword"
          },
          "text": {
            "type": "text",
            "store": true,
            "analyzer": "kuromoji_analyzer"
          },
        }
      }
    },
    "settings": {
      "index": {
        "number_of_shards": "1",
        "provided_name": "app",
        "creation_date": "1536550044056",
        "analysis": {
          "analyzer": {
            "kuromoji_analyzer": {
              "filter": [
                "kuromoji_baseform",
                "kuromoji_part_of_speech",
                "cjk_width",
                "stop",
                "ja_stop",
                "kuromoji_stemmer",
                "lowercase"
              ],
              "type": "custom",
              "tokenizer": "kuromoji_tokenizer_search"
            },
          },
          "tokenizer": {
            "kuromoji_tokenizer_search": {
              "mode": "search",
              "type": "kuromoji_tokenizer",
              "discard_punctuation": "true"
            }
          }
        },
        "number_of_replicas": "0",
        "uuid": "AYdPqQSZTjCBJkopeG1_5Q",
        "version": {
          "created": "6030299"
        }
      }
    }
  }
}

system · October 8, 2018, 11:13am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Hashtag searches and Japanese full text search Elasticsearch	2	353	December 19, 2023
Kuromoji tokenizers とURLリンク分解日本語による質問・議論はこちら	2	758	November 19, 2018
Kuromoji analyzer filters out text in Arabic Elasticsearch	1	165	October 26, 2021
Kuromoji tokenizers and uax_url_email Elasticsearch	1	401	November 14, 2018
Need Help with Japanese analyzer - (Kuromoji) Elasticsearch	1	363	July 6, 2017

How can I correctly index @screen_name, #hashtag and url in Japanese text?

What I need to do

Current index defintion

Related topics