Equivocations and stop words

What is the recommended way to configure the analysis process in order to avoid
that a stop word filter swallows equivocations?

For example the German language knows the conjuction "oder" (English "or") and the noun "Oder" (river). To lowercase und to use a stop words filter during indexing process would normally eliminate every reference to the river and queries for the river "Oder" would yield no hits.

Actually my solution is to use an analyzer configuration which translates equivocations to artifical tokens via a synonym filter in a first analyzing step. So stop words are discarded but there possible equivocations are made searchable.
Then at query time an different analyzer is used. This analyzer does not have a stop words filter. But in order to find documents which contain references to the river "Oder" it is now necessary to map the artifical placeholder token back to original equivocations which a user knows and would use. This is done by a synonym filter.

Is there another more simple solution for this kind of problem? Is it possible to replace tokens based on a token map? (I'd like to avoid to use any script.)

PUT equivocations
{
  "settings": {
    "analysis": {
      "filter": {
          "german_stop": {
               "type":   "stop",
                "stopwords_path":  "analysis/german_stop.txt"
           },
          "custom_german_stemmer": {
        "type": "stemmer",
        "name": "light_german"
      },
     "equivocation_synonyms": {
        "type": "synonym",
        "synonyms_path": "analysis/equivocation_synonyms.txt"    
    },
     "german_synonyms_search": {
        "type": "synonym",
        "synonyms_path": "analysis/german_synonyms_search.txt"    
    }
      },
      "analyzer": {
        "custom_german": { 
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
              "equivocation_synonyms",
              "lowercase",
              "german_stop",
            "custom_german_stemmer"
          ]
        },
        "custom_german_search": { 
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
        "german_synonyms_search",
            "custom_german_stemmer"
          ]
        }
      }
    }
  },
  "mappings": {
    "my_text": {
      "properties": {
        "content": {
          "type": "text",
          "analyzer": "custom_german",
      "search_analyzer": "custom_german_search"
        }
      }
    }
  }
}
POST _bulk
{ "index" : {"_id" : "1", "_type" : "my_text", "_index" : "equivocations"}}
{"content" : "Schwimmen ist in der Oder verboten"}
{ "index" : {"_id" : "2", "_type" : "my_text", "_index" : "equivocations"}}
{"content" : "Er konnte sich nicht entscheiden, ob er Currywurst oder Pizza essen sollte."}
GET equivocations/my_text/_search
{
    "query": {
        "match" : {
            "content" : {
                "query" : "oder"
            }
        }
    }
}

config/analysis/equivocation_synonyms.txt

Oder => flussoder

config/analysis/german_stop.txt

oder

config/analysis/german_synonyms_search.txt

oder => flussoder

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.