Synonym_graph token filter and synonyms including comma's

erikNL · October 10, 2022, 12:06pm

I am trying to create a index that uses a synonym_graph token filter to search for synonyms of chemical compound.

For instance, when I search for "benzene", I also want to find sentences containing the following:
"benzol, polystream, benzin, carbon oil, cyclohex-1,3,5-triene"

To accomplish that, I've created the following index:

PUT /test_index_compounds
{
   "mappings": {
      "properties": {
        "text": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          },
          "analyzer": "standard",
          "search_analyzer": "synonym"
        }}}, 
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "synonym": {
            "tokenizer": "whitespace",
            "filter": [ "synonym" ]
          }
        },
        "filter": {
          "synonym": {
            "type": "synonym_graph",
            "synonyms_path": "analysis/compound_synonyms.txt"
          }
        }
      }
    }
  }
}

The compound_synonyms.txt file contains:

benzene => benzol,polystream,benzin,carbon oil, cyclohex-1,3,5-triene

This works well for sentences containing words like benzol,polystream,carbon oil, etc, but phrases containing comma's cause a problem. In the above example cyclohex-1, 3 and 5-triene are seen as separate synonyms. I think the comma's should be esacped in one or another way, but can't figure out how to do it.

I've tried:

cyclohex-1,3,5-triene
cyclohex-1\,3\,5-triene
"cyclohex-1,3,5-triene"

but none of them seems to help. When I add this document:

POST /test_index_compounds/_doc 
{
  "text" : "this is a test with cyclohex-1,3,5-triene"
}

it is not returned when executing this query:

GET /test_index_compounds/_search
{
  "query": {
    "match": {
      "text": {
        "query": "benzene"
      }
    }
}
}

Is there any other way to get this to work?

RabBit_BR · October 10, 2022, 12:46pm

Hi @erikNL .

test_index_compounds != test_index_compounds-2/

Try test:

"synonym": {
  "type": "synonym_graph",
  "synonyms": ["benzene => benzol,polystream,benzin,carbon oil, cyclohex-1\\,3\\,5-triene"]
          }

GET test_index_compounds/_analyze
{
  "text": ["benzene"],
  "analyzer": "synonym"
}

Response:

...
{
      "token": "cyclohex-1,3,5-triene",
      "start_offset": 0,
      "end_offset": 7,
      "type": "SYNONYM",
      "position": 0,
      "positionLength": 2
    }
...

erikNL · October 11, 2022, 7:07am

test_index_compounds != test_index_compounds-2/

Sorry, that was a typo.

When I use this instead of a file:

"synonym": {
  "type": "synonym_graph",
  "synonyms": ["benzene => benzol,polystream,benzin,carbon oil, cyclohex-1\\,3\\,5-triene"]
          }

and run the analyse command, I indeed see the "cyclohex-1,3,5-triene" synonym.
When I run the analyse command when using the file, I see:

{
      "token" : """cyclohex-1\""",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "SYNONYM",
      "position" : 0,
      "positionLength" : 2
    },
    {
      "token" : """3\""",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "SYNONYM",
      "position" : 0,
      "positionLength" : 2
    },
    {
      "token" : "5-triene",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "SYNONYM",
      "position" : 0,
      "positionLength" : 2
    },

So there's something going wrong there. For testing purposes, I can use the array in the create index call, but in the end I would like to use the file.

When I use the array in the create index command, the synonym is returned when running the analyse command, but when I add the sample sentence from above to the index and execute the search query, it is still not returned.

cbuescher · October 18, 2022, 12:56pm

Be carful with using different tokenization here at index and at search time. The standard analyzer at index time will index a term like "

like this in three tokens:

{
      "token": "cyclohex",
      "start_offset": 20,
      "end_offset": 28,
      "type": "<ALPHANUM>",
      "position": 5
    },
    {
      "token": "1,3,5",
      "start_offset": 29,
      "end_offset": 34,
      "type": "<NUM>",
      "position": 6
    },
    {
      "token": "triene",
      "start_offset": 35,
      "end_offset": 41,
      "type": "<ALPHANUM>",
      "position": 7
    }

You can check this with:

POST /test_index_compounds/_analyze
{
  "analyzer": "standard", 
  "text" : "this is a test with cyclohex-1,3,5-triene"
}

On the other hand, the whitespace tokenizer alone leaves the commas attached to the words in the search analyzer, so thats not ideal either:

POST /test_index_compounds/_analyze
{
  "analyzer": "synonym", 
  "text" : "this is a test, with benzol, bezin, but also with with cyclohex-1,3,5-triene"
}

will give you tokens like "benzol," or "benzin,". Not ideal either. as it will also keep the "cyclohex-1,3,4-triene" as one token. I think before thinking about synonyms here the tokenization strategy must be though of as well.

cbuescher · October 18, 2022, 1:00pm

This might works as you expect it using the Kibana console, in a file you would need to escape the commas differently I guess to keep them from being interpreted as separating synonym rules:

DELETE /test_index_compounds

PUT /test_index_compounds
{
   "mappings": {
      "properties": {
        "text": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          },
          "analyzer": "standard",
          "search_analyzer": "synonym"
        }}}, 
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "synonym": {
            "tokenizer": "standard",
            "filter": [ "synonym" ]
          }
        },
        "filter": {
          "synonym": {
            "type": "synonym_graph",
            "synonyms": ["benzene => benzol,polystream,benzin,carbon oil, cyclohex-\\1\\,3\\,5-triene"]
          }
        }
      }
    }
  }
}

POST /test_index_compounds/_doc 
{
  "text" : "this is a test with cyclohex-1,3,5-triene"
}



POST /test_index_compounds/_search
{
  "query": {"match": {
    "text": "benzene"
  }}
}

POST /test_index_compounds/_analyze
{
  "analyzer": "synonym", 
  "text" : "benzene"
}

erikNL · November 1, 2022, 8:52am

Thanks for the suggestions. With the pointers here I was able to get a working solution.

In the synonym file, comma’s should be escaped with a single backslash, so it becomes something like for example:

1-aminopropan-2-ol => 2-propanol\, 1-amino-

Furthermore, I switched to the classic tokenizer instead of the whitespace tokenizer, so termes directly followed by a comma do not pose a problem.

I also added a lowercase filter to prevent uppercase/lowercase mismatches

To top it off, I added a lemmagen filter, but tat basically doesn’t have anything to do with the synonyms.

My index now looks like this:

PUT /test_index_compounds
{
   "mappings":{
      "properties":{
         "sentence":{
            "type":"text",
            "analyzer":"classic_lowercase_analyser",
            "search_analyzer":"whitespace_with_synonyms_analyser"
         }
      }
   },
   "settings":{
      "index":{
         "analysis":{
            "analyzer":{
              "whitespace_with_synonyms_analyser":{
                  "tokenizer":"classic",
                  "filter":[
                     "synonym_compound_filter",
                     "lemmagen_filter_en"
                  ]
               },
               "classic_lowercase_analyser":{
                  "tokenizer":"classic",
                  "filter":[
                     "lowercase",
                     "lemmagen_filter_en"
                  ]
               }
            },
            "filter":{
              "synonym_compound_filter":{
                  "type":"synonym_graph",
                  "synonyms_path":"analysis/compound_synonyms/compound_and_effect_synonys.txt",
                  "updateable":true,
               },
               "lemmagen_filter_en": {
            "type": "lemmagen",
            "lexicon": "en"
          }
            }
         }
      }
   }
}

system · November 29, 2022, 8:53am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Problem with synonym token filter Elasticsearch	8	503	July 6, 2017
Elasticsearch synonym_graph filter not giving all tokens Elasticsearch	1	385	November 6, 2020
Word Delimiter Graph Token + Synonym Graph Token Elasticsearch	1	1106	August 13, 2021
Unable to bypass restriction with synonym token filter Elasticsearch	7	1057	October 3, 2019
Synonyms with Keyword Tokenizer Elasticsearch	2	969	July 6, 2017

Synonym_graph token filter and synonyms including comma's

Related topics