Simple_query_string and multi terms synonyms

I have a field with the following search_analyzer:

"name_search_en" : {
   "filter" : [
     "english_possessive_stemmer",
     "lowercase",
     "name_synonyms_en",
     "english_stop",
     "english_stemmer",
     "asciifolding"
   ],
   "tokenizer" : "standard"
}

name_synonyms_en is a synonym_graph that looks like this

"name_synonyms_en" : {
  "type" : "synonym_graph",
   "synonyms" : [
      "beach bag => straw bag,beach bag",
      "bicycle,bike"
    ]
 }

Running the following multi_match query the synonym are correctly applied

{
  "query": {
    "multi_match": {
      "query": "beach bag",
      "auto_generate_synonyms_phrase_query": false,
      "type": "cross_fields",
      "fields": [
        "brand.en-US^1.0",
        "name.en-US^1.0"
      ]
    }
  }
}

Here is the _validate explanation output. Both beach bag and straw bag are present, as expected, in the raw query:

"explanations" : [
{
  "index" : "d7598351-311f-4844-bb91-4f26c9f538f3",
  "valid" : true,
  "explanation" : "+((((+name.en-US:straw +name.en-US:bag) (+name.en-US:beach +name.en-US:bag))) | (brand.en-US:beach brand.en-US:bag)) #DocValuesFieldExistsQuery [field=_primary_term]"
}

]

I would expect the same in the following simple_query_string

{
  "query": {
    "simple_query_string": {
      "query": "beach bag",
      "auto_generate_synonyms_phrase_query": false,
      "fields": [
        "brand.en-US^1.0",
        "name.en-US^1.0"
      ]
    }
  }
}

but the straw bag synonym is not present in the raw query

"explanations" : [
{
  "index" : "d7598351-311f-4844-bb91-4f26c9f538f3",
  "valid" : true,
  "explanation" : "+((name.en-US:beach | brand.en-US:beach)~1.0 (name.en-US:bag | brand.en-US:bag)~1.0) #DocValuesFieldExistsQuery [field=_primary_term]"
}
]

The problem seems to be related to multi-terms synonyms only. If I search for bike, the bicycle synonym is correctly present in the query

"explanations" : [
{
  "index" : "d7598351-311f-4844-bb91-4f26c9f538f3",
  "valid" : true,
  "explanation" : "+(Synonym(name.en-US:bicycl name.en-US:bike) | brand.en-US:bike)~1.0 #DocValuesFieldExistsQuery [field=_primary_term]"
}
]

Is this the expected behaviour (meaning multi terms synonyms are not supported for this query)?

The simple_query_string query parses the query string by breaking on (amongst other things) whitespace, before that string is analyzed. As a result, the analyzer sees the individual tokens "beach" and "bag", but not the synonym "beach bag".

You can change this behavior with the flags parameter. This defaults to ALL, but you could change that into a subset that does not include WHITESPACE. For example:

{
  "query": {
    "simple_query_string": {
      "query": "beach bag",
      "auto_generate_synonyms_phrase_query": false,
      "flags": "OR|AND|PREFIX", 
      "fields": [
        "brand.en-US^1.0",
        "name.en-US^1.0"
      ]
    }
  }
}

Thank you!

This solves the multiwords synonym problem but introduces another issue. If I don't specify flags the match query works on the specified fields (brand.en-US & name.en-US) as expected. When I specify flags (I tried with NONE and OR|AND|PREFIX as values) the match seems to work on all the documents' fields while the fields are used only for relevance (like the should clause in the bool query). Any idea. Thanks!

I'm sorry, but I don't understand the issue. What is it that you're trying to achieve?

Please don't take into account my previous reply. The issue is somewhere else.
Let me try to summarise what I want to do: I am trying to do a full text search on two fields with different synonym_graph configurations (productTypes.en-US and name.en-US).

Here is the full query:

{
  "query": {
    "simple_query_string": {
      "query": "beach bag",
      "auto_generate_synonyms_phrase_query": false,
      "minimum_should_match": "2<66%",
      "fields": [
        "name.en-US^1.0",
       "productTypes.en-US^1.0"
      ]
    }
  }
}

I dig into the raw queries a bit and I think the issue is related to the minimum_should_match. The above query resolves to this

+(((productTypes.en-US:beach | name.en-US:beach)~1.0 (productTypes.en-US:bag | name.en-US:bag)~1.0)~2) #DocValuesFieldExistsQuery [field=_primary_term]

From my understanding the ~2 at the end of the raw query is the minimum_should_match value. When I specify the "flags": "OR|AND|PREFIX" as you suggested the query is resolved to

+((productTypes.en-US:beach productTypes.en-US:bag) | (((+name.en-US:straw +name.en-US:bag) (+name.en-US:beach +name.en-US:bag))))~1.0 #DocValuesFieldExistsQuery [field=_primary_term]

As you can see, the multiwords synonyms are correctly applied but the minimum_should_match is not included anymore.

Is there something wrong with my query or is this an issue with simple_query_string ?

Thanks!

@abdon

I have created this test case to better explain the problem

The first assert (line 305) fails because the number of hits is 3 since the minimum_should_match is not applied.

Should I file an issue in the elasticsearch repo?

Thanks

I don't think this is a bug, but this is going deep into the internals, so someone else may be better suited to explain this.

My understanding is that if you're not splitting the query string on whitespace, then you're sending Elasticsearch one query clause. With one query clause, minimum_should_match does not make sense. (The docs say this about that: "minimum_should_match: The minimum number of clauses that must match for a document to be returned. See the minimum_should_match documentation for the full list of options.")

Why use the simple_query_string query at all? Are you giving your users the capability to query with AND and OR operators? If not, the match and multi_match queries may be easier to use, as the query parsing of those queries will be simpler. If you do need the simple_query_string query, maybe you can create a bool query with one simple_query_string query that splits on white space, and one that does not? That way you can make use of both functionalities.

I have tried to use a multi_match query but it does not work in my case because minimum_should_match is applied to each field individually (see here). cross_field would not work either because, as the docs say "can only work in term-centric mode on fields that have the same analyzer". Having different analyzers (with different synonym_graph configurations), would make the query work like a most_fields query.

A workaround for this would be to setup synonyms at index time but I would prefer to not go down that road.

I thought using query_string or simple_query_string I could find a suitable solution for that. I dig a bit more into the code and I debug the the SimpleQueryParser.

This is what I found: with WHITESPACE flag the parser creates a boolean query with a clause for each token. As an example, the simple_qeury_string in the unit test I have created generates this bool query:

((otherbody:foo | body:bar)~1.0 (otherbody:bar | body:bar)~1.0)~2)

Since boolean queries support minimum_should_match the parameter is correctly applied.

The same query WITHOUT WHITESPACE generates the following DisjunctionMaxQuery

((otherbody:foo otherbody:bar) | (body:foo body:bar))~1.0

As you can see, the original text (foo bar) gets split but the minimum_should_match is not applied because dismax does not support it.

If dismax could be extended to support minimum_should_match it would solve this issue.

Ok. I think this is the expected behaviour. This logic is well explained in this docs paragraph.

in multi-fields query the minimum_should_match parameter can’t be applied.

This is a pity since this makes impossible to use multiple fields with synonyms and the minimum_should_match parameter.

We are migrating from solr where dismax has support Minimum Match. It would be nice to have something like that in elasticsearch or at least some alternative way to get the same results. Thanks

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.