Match query on field with custom analyzer not working properly with operator or minimum_should_match

pranav.rth · February 21, 2019, 10:47am

I have created a custom pattern analyzer for one of the field. It creates 2 tokens most of the times. But when I am trying to use match query with AND operator or minimum_should_match as 100% , it returns records even if only 1 token got matched.

Mapping for the index:

{
  "settings": {
    "analysis": {
      "analyzer": {
        "test_analyzer": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": [
            "lowercase",
            "test_pattern",
            "unique"
          ]
        }
      },
      "filter": {
        "test_pattern": {
          "type": "pattern_capture",
          "preserve_original": 0,
          "patterns": [
            ".*###(\\d*)###(.*###.*###.*)",
            ".*###(.*###.*###.*)"
          ]
        }
      }
    }
  },
  "mappings": {
    "doc_type": {
      "properties": {
        "test_value": {
          "type": "text",
          "analyzer": "test_analyzer"
        }
      }
    }
  }
}

Test docs:

{
  "test_value": "abc###def###12345###jkl###mno###pqr"
}

{
  "test_value": "abc###def###12367###jkl###mno###pqr"
}

Query:

{
  "query": {
    "match": {
      "test_value": {
        "query": "abc###def###12345###jkl###mno###pqr",
        "operator": "AND"
      }
    }
  }
}

The following query returns both the records.

I tried to understand the explanation of the result as well. I don't know why there is a Synonym in the explanation. Can you please help where am I wrong?

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.46029136,
    "hits": [
      {
        "_shard": "[test_stack][1]",
        "_node": "JO7WIHxLQKW9b_hc8Xm9fQ",
        "_index": "test_stack",
        "_type": "doc_type",
        "_id": "AWkPiO2DN2C8SdyE0d6K",
        "_score": 0.46029136,
        "_source": {
          "test_value": "abc###def###12345###jkl###mno###pqr"
        },
        "_explanation": {
          "value": 0.46029136,
          "description": "weight(Synonym(test_value:12345 test_value:jkl###mno###pqr) in 0) [PerFieldSimilarity], result of:",
          "details": [
            {
              "value": 0.46029136,
              "description": "score(doc=0,freq=2.0 = termFreq=2.0 ), product of:",
              "details": [
                {
                  "value": 0.2876821,
                  "description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
                  "details": [
                    {
                      "value": 1,
                      "description": "docFreq",
                      "details": []
                    },
                    {
                      "value": 1,
                      "description": "docCount",
                      "details": []
                    }
                  ]
                },
                {
                  "value": 1.6,
                  "description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
                  "details": [
                    {
                      "value": 2,
                      "description": "termFreq=2.0",
                      "details": []
                    },
                    {
                      "value": 1.2,
                      "description": "parameter k1",
                      "details": []
                    },
                    {
                      "value": 0.75,
                      "description": "parameter b",
                      "details": []
                    },
                    {
                      "value": 2,
                      "description": "avgFieldLength",
                      "details": []
                    },
                    {
                      "value": 1,
                      "description": "fieldLength",
                      "details": []
                    }
                  ]
                }
              ]
            }
          ]
        }
      },
      {
        "_shard": "[test_stack][4]",
        "_node": "JO7WIHxLQKW9b_hc8Xm9fQ",
        "_index": "test_stack",
        "_type": "doc_type",
        "_id": "AWkPiQfJN2C8SdyE0d6L",
        "_score": 0.36165747,
        "_source": {
          "test_value": "abc###def###12378###jkl###mno###pqr"
        },
        "_explanation": {
          "value": 0.3616575,
          "description": "weight(Synonym(test_value:12345 test_value:jkl###mno###pqr) in 0) [PerFieldSimilarity], result of:",
          "details": [
            {
              "value": 0.3616575,
              "description": "score(doc=0,freq=1.0 = termFreq=1.0 ), product of:",
              "details": [
                {
                  "value": 0.2876821,
                  "description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
                  "details": [
                    {
                      "value": 1,
                      "description": "docFreq",
                      "details": []
                    },
                    {
                      "value": 1,
                      "description": "docCount",
                      "details": []
                    }
                  ]
                },
                {
                  "value": 1.2571429,
                  "description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
                  "details": [
                    {
                      "value": 1,
                      "description": "termFreq=1.0",
                      "details": []
                    },
                    {
                      "value": 1.2,
                      "description": "parameter k1",
                      "details": []
                    },
                    {
                      "value": 0.75,
                      "description": "parameter b",
                      "details": []
                    },
                    {
                      "value": 2,
                      "description": "avgFieldLength",
                      "details": []
                    },
                    {
                      "value": 1,
                      "description": "fieldLength",
                      "details": []
                    }
                  ]
                }
              ]
            }
          ]
        }
      }
    ]
  }
}

dadoonet · February 21, 2019, 11:07am

Did you try the _analyze API to understand how your text is transformed at index time and search time?

pranav.rth · February 21, 2019, 11:11am

Yes I did that.

{
  "text": "abc###def###12345###jkl###mno###pqr",
  "analyzer": "test_analyzer"
}

Output:

{
  "tokens": [
    {
      "token": "12345",
      "start_offset": 0,
      "end_offset": 35,
      "type": "word",
      "position": 0
    },
    {
      "token": "jkl###mno###pqr",
      "start_offset": 0,
      "end_offset": 35,
      "type": "word",
      "position": 0
    }
  ]
}

pranav.rth · February 21, 2019, 11:13am

Just for information, I am using elastic 5.

dadoonet · February 21, 2019, 11:46am

How this is analyzed?

{
  "text": "abc###def###12367###jkl###mno###pqr",
  "analyzer": "test_analyzer"
}

pranav.rth · February 21, 2019, 12:22pm

{
  "tokens": [
    {
      "token": "12367",
      "start_offset": 0,
      "end_offset": 35,
      "type": "word",
      "position": 0
    },
    {
      "token": "jkl###mno###pqr",
      "start_offset": 0,
      "end_offset": 35,
      "type": "word",
      "position": 0
    }
  ]
}

dadoonet · February 21, 2019, 1:09pm

I believe that as the second token can be found in both docs, your search matches both documents.

pranav.rth · February 21, 2019, 1:12pm

Operator is used to tell the query that all the tokens must be matched. Hence, it must match all the tokens generated.

I am seeing a strange thing that it is converting the query to SynonymQuery in lucene. I got to know by looking at the profiling. Please look this - https://pastebin.com/QGMjbCtf

dadoonet · February 21, 2019, 1:38pm

I agree that it looks strange. Can you try with and instead of AND?

pranav.rth · February 21, 2019, 1:40pm

Tried. Not working.

dadoonet · February 21, 2019, 2:27pm

Ok let me test it and come back.

dadoonet · February 21, 2019, 3:29pm

I agree that it looks like a bug as this is working well:

DELETE test
POST test/_doc
{
  "foo": "12345 jkl"
}
POST test/_doc
{
  "foo": "12367 jkl"
}
GET test/_search
{
  "query": {
    "match": {
      "foo": {
        "query": "12367 jkl",
        "operator": "and"
      }
    }
  }
}

I believe this is because of the position in your example. Both tokens are seen as coming from one unique position... This is because of the keyword tokenizer I believe.

@jpountz What is your opinion on this? Bug or expected behavior?

The reproduction script is:

DELETE test
PUT test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "test_analyzer": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": [
            "lowercase",
            "test_pattern",
            "unique"
          ]
        }
      },
      "filter": {
        "test_pattern": {
          "type": "pattern_capture",
          "preserve_original": false,
          "patterns": [
            ".*###(\\d*)###(.*###.*###.*)",
            ".*###(.*###.*###.*)"
          ]
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "test_value": {
          "type": "text",
          "analyzer": "test_analyzer"
        }
      }
    }
  }
}
POST test/_doc
{
  "test_value": "abc###def###12345###jkl###mno###pqr"
}
POST test/_doc
{
  "test_value": "abc###def###12367###jkl###mno###pqr"
}
GET test/_search
{
  "query": {
    "match": {
      "test_value": {
        "query": "abc###def###12345###jkl###mno###pqr",
        "operator": "and"
      }
    }
  }
}

mvg · February 22, 2019, 1:36pm

This query is matching with both documents, because the test_analyzer analyzer produces multiple tokens for the same position. This is because the keyword tokenizer is used here (the pattern_capture token filter then generates tokens for the same position).

I don't know the reason why pattern_capture token filter is used here in combination with keyword analyzer. But I think you need to use something like simple_pattern_split tokenizer (and then split by \\#\\#\\#) instead of the keyword tokenizer. This way tokens get generated on different positions and your query will match only one document.

If during query parsing there are multiple tokens on the position then internally, Elasticsearch uses Lucene's SynonymQuery. This happened here, because the keyword analyzer emits abc###def###12345###jkl###mno###pqr as a single token and then pattern_capture shops that up in multiple tokens on the same position. Multiple tokens on the same position is kind of what synonyms are.

pranav.rth · February 22, 2019, 2:20pm

Conceptually, the tokens are different and from different position, hence it should not treat them as synonyms. This issue is confusing to the general user as it doesn't work as expected.

What are the different options to achieve the same. The way you have suggested is going to break the values into 6 tokens which is not expected. I want 12345 and jkl###mno###pqr as tokens.

pranav.rth · February 27, 2019, 1:19pm

@mvg Can you give an update on this?

mvg · February 27, 2019, 3:08pm

Sorry, this slipped off my radar.

No, those tokens are all from position 0. Check the analyse api response, that you've shared previously in this threat.

I need to understand your use case better in order to provide advice on how to proceed here.
So documents always contain a field value like abc###def###12367###jkl###mno###pqr?
What is the structure of this string? Always 6 alphanumerical substrings separated by ###?
How do your end users search in these strings? Is the entire string used as query, or do they specify one or more of these alphanumerical substrings?

I think either a different tokenizer (that splits the text by some regex) needs to be used or the test_value field needs to be analyzed in multiple ways using multi fields.

system · March 27, 2019, 3:08pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Two custom analyzers with the same synonym filter - why no match Elasticsearch	1	118	September 18, 2023
Search: minimum_should_match: 100% not working when single term expands two more terms Elasticsearch	2	157	April 30, 2024
Minimum_should_match not working as expected Elasticsearch	2	1343	August 16, 2019
Wrong results appearing in elasticsearch must match query Elasticsearch	1	683	March 6, 2019
Multi match query with custom analyzer and 'and' operator Elasticsearch	3	2716	October 25, 2019

Match query on field with custom analyzer not working properly with operator or minimum_should_match

Related topics