I am working on building a custom analyzer that needs to implement a unique text processing workflow. Here’s a breakdown of the required steps:
-
Tokenize the input text based on whitespace.
-
Stem each token using an English stemmer.
-
Detokenize the tokens by merging them back into a single string.
-
Re-tokenize the resulting string based on the presence of hash ('#') characters.
Example Process:
-
Input Text: "athletic shoes # running shoes"
-
After Step 1 (Tokenize by whitespace): ["athletic", "shoes", "#", "running", "shoes"]
-
After Step 2 (Stemming): ["athlet", "shoe", "#", "run", "shoe"]
-
After Step 3 (Detokenize): "athlet shoe # run shoe"
-
After Step 4 (Tokenize by '#'): ["athlet shoe ", "run shoe"]
Is it possible to achieve step 3 (detokenize) using the existing Elasticsearch inbuilt functionalities?
Since what I'm trying to do is non-liner text processing, I'm not sure if it's possible to achieve this using the existing Elasticsearch inbuilt functionalities. Just asking here before I start developing a custom plugin or pre-processing the text outside of Elasticsearch.