I am trying to create a index that uses a synonym_graph
token filter to search for synonyms of chemical compound.
For instance, when I search for "benzene", I also want to find sentences containing the following:
"benzol, polystream, benzin, carbon oil, cyclohex-1,3,5-triene"
To accomplish that, I've created the following index:
PUT /test_index_compounds
{
"mappings": {
"properties": {
"text": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "standard",
"search_analyzer": "synonym"
}}},
"settings": {
"index": {
"analysis": {
"analyzer": {
"synonym": {
"tokenizer": "whitespace",
"filter": [ "synonym" ]
}
},
"filter": {
"synonym": {
"type": "synonym_graph",
"synonyms_path": "analysis/compound_synonyms.txt"
}
}
}
}
}
}
The compound_synonyms.txt file contains:
benzene => benzol,polystream,benzin,carbon oil, cyclohex-1,3,5-triene
This works well for sentences containing words like benzol,polystream,carbon oil, etc, but phrases containing comma's cause a problem. In the above example cyclohex-1, 3 and 5-triene are seen as separate synonyms. I think the comma's should be esacped in one or another way, but can't figure out how to do it.
I've tried:
cyclohex-1,3,5-triene
cyclohex-1\,3\,5-triene
"cyclohex-1,3,5-triene"
but none of them seems to help. When I add this document:
POST /test_index_compounds/_doc
{
"text" : "this is a test with cyclohex-1,3,5-triene"
}
it is not returned when executing this query:
GET /test_index_compounds/_search
{
"query": {
"match": {
"text": {
"query": "benzene"
}
}
}
}
Is there any other way to get this to work?