Synonym_graph token filter and synonyms including comma's

I am trying to create a index that uses a synonym_graph token filter to search for synonyms of chemical compound.

For instance, when I search for "benzene", I also want to find sentences containing the following:
"benzol, polystream, benzin, carbon oil, cyclohex-1,3,5-triene"

To accomplish that, I've created the following index:

PUT /test_index_compounds
{
   "mappings": {
      "properties": {
        "text": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          },
          "analyzer": "standard",
          "search_analyzer": "synonym"
        }}}, 
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "synonym": {
            "tokenizer": "whitespace",
            "filter": [ "synonym" ]
          }
        },
        "filter": {
          "synonym": {
            "type": "synonym_graph",
            "synonyms_path": "analysis/compound_synonyms.txt"
          }
        }
      }
    }
  }
}

The compound_synonyms.txt file contains:

benzene => benzol,polystream,benzin,carbon oil, cyclohex-1,3,5-triene

This works well for sentences containing words like benzol,polystream,carbon oil, etc, but phrases containing comma's cause a problem. In the above example cyclohex-1, 3 and 5-triene are seen as separate synonyms. I think the comma's should be esacped in one or another way, but can't figure out how to do it.

I've tried:

cyclohex-1,3,5-triene
cyclohex-1\,3\,5-triene
"cyclohex-1,3,5-triene"

but none of them seems to help. When I add this document:

POST /test_index_compounds/_doc 
{
  "text" : "this is a test with cyclohex-1,3,5-triene"
}

it is not returned when executing this query:

GET /test_index_compounds/_search
{
  "query": {
    "match": {
      "text": {
        "query": "benzene"
      }
    }
}
}

Is there any other way to get this to work?

Hi @erikNL .

test_index_compounds != test_index_compounds-2/

Try test:

"synonym": {
  "type": "synonym_graph",
  "synonyms": ["benzene => benzol,polystream,benzin,carbon oil, cyclohex-1\\,3\\,5-triene"]
          }
GET test_index_compounds/_analyze
{
  "text": ["benzene"],
  "analyzer": "synonym"
}

Response:

...
{
      "token": "cyclohex-1,3,5-triene",
      "start_offset": 0,
      "end_offset": 7,
      "type": "SYNONYM",
      "position": 0,
      "positionLength": 2
    }
...

test_index_compounds != test_index_compounds-2/

Sorry, that was a typo.

When I use this instead of a file:

"synonym": {
  "type": "synonym_graph",
  "synonyms": ["benzene => benzol,polystream,benzin,carbon oil, cyclohex-1\\,3\\,5-triene"]
          }

and run the analyse command, I indeed see the "cyclohex-1,3,5-triene" synonym.
When I run the analyse command when using the file, I see:

{
      "token" : """cyclohex-1\""",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "SYNONYM",
      "position" : 0,
      "positionLength" : 2
    },
    {
      "token" : """3\""",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "SYNONYM",
      "position" : 0,
      "positionLength" : 2
    },
    {
      "token" : "5-triene",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "SYNONYM",
      "position" : 0,
      "positionLength" : 2
    },

So there's something going wrong there. For testing purposes, I can use the array in the create index call, but in the end I would like to use the file.

When I use the array in the create index command, the synonym is returned when running the analyse command, but when I add the sample sentence from above to the index and execute the search query, it is still not returned.

Be carful with using different tokenization here at index and at search time. The standard analyzer at index time will index a term like "

like this in three tokens:

{
      "token": "cyclohex",
      "start_offset": 20,
      "end_offset": 28,
      "type": "<ALPHANUM>",
      "position": 5
    },
    {
      "token": "1,3,5",
      "start_offset": 29,
      "end_offset": 34,
      "type": "<NUM>",
      "position": 6
    },
    {
      "token": "triene",
      "start_offset": 35,
      "end_offset": 41,
      "type": "<ALPHANUM>",
      "position": 7
    }

You can check this with:

POST /test_index_compounds/_analyze
{
  "analyzer": "standard", 
  "text" : "this is a test with cyclohex-1,3,5-triene"
}

On the other hand, the whitespace tokenizer alone leaves the commas attached to the words in the search analyzer, so thats not ideal either:

POST /test_index_compounds/_analyze
{
  "analyzer": "synonym", 
  "text" : "this is a test, with benzol, bezin, but also with with cyclohex-1,3,5-triene"
}

will give you tokens like "benzol," or "benzin,". Not ideal either. as it will also keep the "cyclohex-1,3,4-triene" as one token. I think before thinking about synonyms here the tokenization strategy must be though of as well.

This might works as you expect it using the Kibana console, in a file you would need to escape the commas differently I guess to keep them from being interpreted as separating synonym rules:

DELETE /test_index_compounds

PUT /test_index_compounds
{
   "mappings": {
      "properties": {
        "text": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          },
          "analyzer": "standard",
          "search_analyzer": "synonym"
        }}}, 
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "synonym": {
            "tokenizer": "standard",
            "filter": [ "synonym" ]
          }
        },
        "filter": {
          "synonym": {
            "type": "synonym_graph",
            "synonyms": ["benzene => benzol,polystream,benzin,carbon oil, cyclohex-\\1\\,3\\,5-triene"]
          }
        }
      }
    }
  }
}

POST /test_index_compounds/_doc 
{
  "text" : "this is a test with cyclohex-1,3,5-triene"
}



POST /test_index_compounds/_search
{
  "query": {"match": {
    "text": "benzene"
  }}
}

POST /test_index_compounds/_analyze
{
  "analyzer": "synonym", 
  "text" : "benzene"
}

Thanks for the suggestions. With the pointers here I was able to get a working solution.

In the synonym file, comma’s should be escaped with a single backslash, so it becomes something like for example:

1-aminopropan-2-ol => 2-propanol\, 1-amino-

Furthermore, I switched to the classic tokenizer instead of the whitespace tokenizer, so termes directly followed by a comma do not pose a problem.

I also added a lowercase filter to prevent uppercase/lowercase mismatches

To top it off, I added a lemmagen filter, but tat basically doesn’t have anything to do with the synonyms.

My index now looks like this:

PUT /test_index_compounds
{
   "mappings":{
      "properties":{
         "sentence":{
            "type":"text",
            "analyzer":"classic_lowercase_analyser",
            "search_analyzer":"whitespace_with_synonyms_analyser"
         }
      }
   },
   "settings":{
      "index":{
         "analysis":{
            "analyzer":{
              "whitespace_with_synonyms_analyser":{
                  "tokenizer":"classic",
                  "filter":[
                     "synonym_compound_filter",
                     "lemmagen_filter_en"
                  ]
               },
               "classic_lowercase_analyser":{
                  "tokenizer":"classic",
                  "filter":[
                     "lowercase",
                     "lemmagen_filter_en"
                  ]
               }
            },
            "filter":{
              "synonym_compound_filter":{
                  "type":"synonym_graph",
                  "synonyms_path":"analysis/compound_synonyms/compound_and_effect_synonys.txt",
                  "updateable":true,
               },
               "lemmagen_filter_en": {
            "type": "lemmagen",
            "lexicon": "en"
          }
            }
         }
      }
   }
}
1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.