Custom Elasticsearch Analyzer :- Tokenization and Detokenize Text Processing

I am working on building a custom analyzer that needs to implement a unique text processing workflow. Here’s a breakdown of the required steps:

  1. Tokenize the input text based on whitespace.

  2. Stem each token using an English stemmer.

  3. Detokenize the tokens by merging them back into a single string.

  4. Re-tokenize the resulting string based on the presence of hash ('#') characters.

Example Process:

  • Input Text: "athletic shoes # running shoes"

  • After Step 1 (Tokenize by whitespace): ["athletic", "shoes", "#", "running", "shoes"]

  • After Step 2 (Stemming): ["athlet", "shoe", "#", "run", "shoe"]

  • After Step 3 (Detokenize): "athlet shoe # run shoe"

  • After Step 4 (Tokenize by '#'): ["athlet shoe ", "run shoe"]

Is it possible to achieve step 3 (detokenize) using the existing Elasticsearch inbuilt functionalities?

Since what I'm trying to do is non-liner text processing, I'm not sure if it's possible to achieve this using the existing Elasticsearch inbuilt functionalities. Just asking here before I start developing a custom plugin or pre-processing the text outside of Elasticsearch.