Prevent thai tokenizer from tokenizing hashtag

TFP · December 1, 2022, 8:19am

I'm currently using the thai tokenizer and wanted to preserve hashtag words but the tokenizer keeps on removing the hashtag symbol.

ES version: 7.16.2

GET /_analyze
{
	"tokenizer": "thai",
	"text": "#รายการพ #hashtag"
}

Response

{
	"tokens": [
		{
			"token": "รายการ",
			"start_offset": 1,
			"end_offset": 7,
			"type": "word",
			"position": 0
		},
		{
			"token": "พ",
			"start_offset": 7,
			"end_offset": 8,
			"type": "word",
			"position": 1
		},
		{
			"token": "hashtag",
			"start_offset": 10,
			"end_offset": 17,
			"type": "word",
			"position": 2
		}
	]
}

I expect it to be like

{
	"tokens": [
		{
			"token": "#รายการพ",
			"start_offset": 1,
			"end_offset": 7,
			"type": "word",
			"position": 0
		},
		{
			"token": "#hashtag",
			"start_offset": 10,
			"end_offset": 17,
			"type": "word",
			"position": 2
		}
	]
}

I tried using char_filter to replace all hashtags with some placeholder and replace it on filter to hashtag symbol again. But the result is not what expected.

{
	"tokenizer": "thai",
	"filter": [
		{
			"pattern": "hashtagplaceholder([^\\s*]+)",
			"type": "pattern_replace",
			"replacement": "#$1"
		}
	],
	"char_filter": [
		{
			"pattern": "#([^\\s*]+)",
			"type": "pattern_replace",
			"replacement": "hashtagplaceholder$1"
		}
	],
	"text": "#รายการพ #hashtag"
}

Response:

{
	"tokens": [
		{
			"token": "hashtagplaceholder",
			"start_offset": 0,
			"end_offset": 7,
			"type": "word",
			"position": 0
		},
		{
			"token": "รายการ",
			"start_offset": 7,
			"end_offset": 7,
			"type": "word",
			"position": 1
		},
		{
			"token": "พ",
			"start_offset": 7,
			"end_offset": 8,
			"type": "word",
			"position": 2
		},
		{
			"token": "#hashtag",
			"start_offset": 9,
			"end_offset": 17,
			"type": "word",
			"position": 3
		}
	]
}

system · December 29, 2022, 8:19am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Tokenizing a Hashtag containing underscores Elasticsearch	1	1410	July 5, 2017
Cjk and thai analyzer customization Elasticsearch	4	696	July 6, 2017
Search for hashtags - Find exact matches only Elasticsearch	3	3098	July 6, 2017
Hash tag field analyzer not being applied Elasticsearch	2	1012	July 6, 2017
Extending Thai analyzer Elasticsearch	5	1036	July 6, 2017

Prevent thai tokenizer from tokenizing hashtag

Related topics