Custom Elasticsearch Analyzer :- Tokenization and Detokenize Text Processing

alliswell · May 27, 2024, 4:50pm

I am working on building a custom analyzer that needs to implement a unique text processing workflow. Here’s a breakdown of the required steps:

Tokenize the input text based on whitespace.
Stem each token using an English stemmer.
Detokenize the tokens by merging them back into a single string.
Re-tokenize the resulting string based on the presence of hash ('#') characters.

Example Process:

Input Text: "athletic shoes # running shoes"
After Step 1 (Tokenize by whitespace): ["athletic", "shoes", "#", "running", "shoes"]
After Step 2 (Stemming): ["athlet", "shoe", "#", "run", "shoe"]
After Step 3 (Detokenize): "athlet shoe # run shoe"
After Step 4 (Tokenize by '#'): ["athlet shoe ", "run shoe"]

Is it possible to achieve step 3 (detokenize) using the existing Elasticsearch inbuilt functionalities?

Since what I'm trying to do is non-liner text processing, I'm not sure if it's possible to achieve this using the existing Elasticsearch inbuilt functionalities. Just asking here before I start developing a custom plugin or pre-processing the text outside of Elasticsearch.

Topic		Replies	Views
Whitespace analyzer (char-filter And token-filter) Elasticsearch	7	1217	November 27, 2019
Aalyzer issue - terms not getting tokenized on whitespace Elasticsearch	1	302	July 6, 2017
New language - Custom analyzer plugin or token filter Elasticsearch	1	541	March 21, 2017
Changing tokenizer from whitespace to standard Elasticsearch	4	2560	July 6, 2017
Analyzer settings for breaking up words on hyphens Elasticsearch	4	2219	July 6, 2017

Custom Elasticsearch Analyzer :- Tokenization and Detokenize Text Processing

Related topics