Configuring tokenization in Elasticsearch

Hi all, I am facing difficulty in directing ES to tokenize in a specific way.

Suppose I have a text like "Young Donor Plasma Transfusion and Age-Related Biomarkers"

In my synonyms file I have.

plasma transfusions,transfusion plasma,plasma transfusion,transfusions plasma => OTHERS_ID_996981,
plasma donor,donors plasma,donor plasma => OTHERS_ID_922024,
transfusion,transfusing blood products,transfusions,transfusion blood,blood transfusion => OTHERS_ID_1383248

The output I get from elyzer is this:

TOKENIZER: iplexus_tokenizer
{0:Young}	{1:Donor}	{2:Plasma}	{3:Transfusion}	{4:and}	{5:Age}	{6:Related} {7:Biomarkers}	

TOKEN_FILTER: lowercase
{0:young}	{1:donor}	{2:plasma}	{3:transfusion}	{4:and}	{5:age}	{6:related}	{7:biomarkers}	

TOKEN_FILTER: synonym_rule
{0:young}	{1:OTHERS_ID_922024,OTHERSCLASS}	{2:OTHERS_ID_1383248,OTHERSCLASS}	{3:and}	{4:age}	{5:related}	{6:OTHERS_ID_418498,OTHERSCLASS}

Here we see that we have overlapping keywords in case of "donar plasma" and "plasma transfusion".

Elasticsearch only tokenizes it as "donar plasma" and "transfusion".

Is there a way to direct ES to tokenize overlapping keywords if found in synonym rule?

Here I expect the tokens to be:
"OTHERS_ID_922024" (donor plasma), "OTHERS_ID_922024" (plasma transfusion), "OTHERS_ID_1383248" (transfusion)

Analyzer settings are as follows, synonym file contains the above synonym logic:

{"settings": {
    "analysis": {
     "analyzer": {
        "analyzer_search": {
           "type": "custom",
           "tokenizer": "iplexus_tokenizer",
           "filter": [
              "lowercase",
              "synonym_rule"
           ]
        },
        "analyzer_q": {
           "type": "custom",
           "tokenizer": "iplexus_tokenizer",
           "filter": [
              "lowercase",
              "synonym_rule_q"
           ]
        },
        "analyzer_summary": {
           "type": "custom",
           "tokenizer": "iplexus_tokenizer",
           "filter": [
              "lowercase",
              "synonym_rule",
              "biomedical_concept"
           ]
        }
     },
     "tokenizer": {
        "iplexus_tokenizer": {
           "type": "pattern",
           "pattern": "[^a-zA-Z0-9\\p{InGreek}\\p{No}\\p{Lm}\\+\\−]",
           "max_token_length": "256"
        }
     },
     "filter": {
        "synonym_rule": {
           "type": "synonym",
           "synonyms_path": "synonyms_iplexus_index_v18.txt"
        },
        "synonym_rule_q": {
           "type": "synonym",
           "synonyms_path": "synonyms_iplexus_query_v18.txt"
        }
     }
  }
 }
}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.