Elasticsearch 1.4 - Doesn't match multiwords synonyms exactly


#1

HI ! I indexed a document which contain the words "contrat à durée déterminée". What I want is that all documents containing that exact words match when I search for the acronym "cdd".

Here is my index settings :

'analysis' => array(
    'analyzer' => array(
        'indexAnalyzer' => array(
            'type' => 'custom',
            'tokenizer' => 'nGram',
            'filter' => array('asciifolding', 'lowercase', 'synonym', 'snowball', 'elision', 'worddelimiter', 'stopwords'),
        ),
        'searchAnalyzer' => array(
            'type' => 'custom',
            'tokenizer' => 'standard',
            'filter' => array('asciifolding', 'lowercase', 'elision', 'worddelimiter', 'synonym', 'stopwords'),
        ),
        ...
    ),
    'tokenizer' => array(
        'nGram' => array(
            'type' => 'nGram',
            'min_gram' => 3,
            'max_gram' => 20,
            'token_chars' => array('letter', 'digit'),
        ),
    ),
    'filter' => array(
        'synonym' => array(
            'tokenizer' => 'keyword',
            'type' => 'synonym',
            'synonyms_path' => sfConfig::get('app_elasticsearch_path_synonym'),
            'ignore_case' => true,
        ),
       ...
    ),
), 

The synonyms file contain a line with "CDD,Contrat à Durée Déterminée".

And here, a part of my index mapping :

"idea": {
  "properties": {
     "initial_situation": {
        "properties": {
           "search": {
              "type": "string",
              "analyzer": "searchAnalyzer",
              "include_in_all": true
           }
           ...
        }
     },
     "proposed_solution": {
        "properties": {
           "search": {
              "type": "string",
              "analyzer": "searchAnalyzer",
              "include_in_all": true
           }
           ...
        }
     },
     ...
  }
}

A document sample :

{
  "_index": "clic",
  "_type": "idea",
  "_id": "3863",
  "_score": 0.030160192,
  "_source": {
     "id": "3863",
     "title": {
        "name": "Lorem ipsum ...",
        ...
     },
     "proposed_solution": {
        "search": "Lorem ... contrat à durée déterminée, Lorem ...",
        ...
     },
     ...
  }
}                

When I use the analyze API like this : GET /clic/_analyze?analyzer=searchAnalyzer&text=cdd

It output the correct synonyms :

{
   "tokens": [
      {
         "token": "cdd",
         "start_offset": 0,
         "end_offset": 3,
         "type": "SYNONYM",
         "position": 1
      },
      {
         "token": "contrat à durée déterminée",
         "start_offset": 0,
         "end_offset": 3,
         "type": "SYNONYM",
         "position": 1
      }
   ]
}

So far, it seems to be correct for me. In addition, when I use the validate API for explaining my query like this :

 GET clic/idea/_validate/query?explain
{
   "query": {
      "filtered": {
         "query": {
            "bool": {
               "should": [
                  {
                     "multi_match": {
                        "query": "cdd",
                        "type": "cross_fields",
                        "fields": [
                           "title.name^3",
                           "initial_situation.search^3",
                           "proposed_solution.search^3",
                           "expected_benefits.search^3"
                        ],
                        "operator": "and",
                        "analyzer": "searchAnalyzer"
                     }
                  }
               ]
            }
         }
      }
   }
}

It output :

"explanations": [
  {
    "index": "clic",
    "valid": true,
    "explanation": "filtered((
        blended(terms: [proposed_solution.search:cdd, title.name:cdd, expected_benefits.search:cdd, initial_situation.search:cdd]) 
        blended(terms: [proposed_solution.search:contrat à durée déterminée, title.name:contrat à durée déterminée, expected_benefits.search:contrat à durée déterminée, initial_situation.search:contrat à durée déterminée])
    ))
    ->cache(_type:idea)"
  }
]

From what I understood, ES search both "cdd" and "contrat à durée déterminée" in all fiels that I mentionned. Thus, it should find document containing "cdd" or "contrat à durée déterminée". But it's not the case. When I do a post search with the same query, it hits 0 result.

I hope I was clear in my explanations. Any help will be appreciated :slight_smile: Thanks !


(system) #2