Match query on field with custom analyzer not working properly with operator or minimum_should_match

I have created a custom pattern analyzer for one of the field. It creates 2 tokens most of the times. But when I am trying to use match query with AND operator or minimum_should_match as 100% , it returns records even if only 1 token got matched.

Mapping for the index:

{
  "settings": {
    "analysis": {
      "analyzer": {
        "test_analyzer": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": [
            "lowercase",
            "test_pattern",
            "unique"
          ]
        }
      },
      "filter": {
        "test_pattern": {
          "type": "pattern_capture",
          "preserve_original": 0,
          "patterns": [
            ".*###(\\d*)###(.*###.*###.*)",
            ".*###(.*###.*###.*)"
          ]
        }
      }
    }
  },
  "mappings": {
    "doc_type": {
      "properties": {
        "test_value": {
          "type": "text",
          "analyzer": "test_analyzer"
        }
      }
    }
  }
}

Test docs:

{
  "test_value": "abc###def###12345###jkl###mno###pqr"
}

{
  "test_value": "abc###def###12367###jkl###mno###pqr"
}

Query:

{
  "query": {
    "match": {
      "test_value": {
        "query": "abc###def###12345###jkl###mno###pqr",
        "operator": "AND"
      }
    }
  }
}

The following query returns both the records.

I tried to understand the explanation of the result as well. I don't know why there is a Synonym in the explanation. Can you please help where am I wrong?

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.46029136,
    "hits": [
      {
        "_shard": "[test_stack][1]",
        "_node": "JO7WIHxLQKW9b_hc8Xm9fQ",
        "_index": "test_stack",
        "_type": "doc_type",
        "_id": "AWkPiO2DN2C8SdyE0d6K",
        "_score": 0.46029136,
        "_source": {
          "test_value": "abc###def###12345###jkl###mno###pqr"
        },
        "_explanation": {
          "value": 0.46029136,
          "description": "weight(Synonym(test_value:12345 test_value:jkl###mno###pqr) in 0) [PerFieldSimilarity], result of:",
          "details": [
            {
              "value": 0.46029136,
              "description": "score(doc=0,freq=2.0 = termFreq=2.0 ), product of:",
              "details": [
                {
                  "value": 0.2876821,
                  "description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
                  "details": [
                    {
                      "value": 1,
                      "description": "docFreq",
                      "details": []
                    },
                    {
                      "value": 1,
                      "description": "docCount",
                      "details": []
                    }
                  ]
                },
                {
                  "value": 1.6,
                  "description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
                  "details": [
                    {
                      "value": 2,
                      "description": "termFreq=2.0",
                      "details": []
                    },
                    {
                      "value": 1.2,
                      "description": "parameter k1",
                      "details": []
                    },
                    {
                      "value": 0.75,
                      "description": "parameter b",
                      "details": []
                    },
                    {
                      "value": 2,
                      "description": "avgFieldLength",
                      "details": []
                    },
                    {
                      "value": 1,
                      "description": "fieldLength",
                      "details": []
                    }
                  ]
                }
              ]
            }
          ]
        }
      },
      {
        "_shard": "[test_stack][4]",
        "_node": "JO7WIHxLQKW9b_hc8Xm9fQ",
        "_index": "test_stack",
        "_type": "doc_type",
        "_id": "AWkPiQfJN2C8SdyE0d6L",
        "_score": 0.36165747,
        "_source": {
          "test_value": "abc###def###12378###jkl###mno###pqr"
        },
        "_explanation": {
          "value": 0.3616575,
          "description": "weight(Synonym(test_value:12345 test_value:jkl###mno###pqr) in 0) [PerFieldSimilarity], result of:",
          "details": [
            {
              "value": 0.3616575,
              "description": "score(doc=0,freq=1.0 = termFreq=1.0 ), product of:",
              "details": [
                {
                  "value": 0.2876821,
                  "description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
                  "details": [
                    {
                      "value": 1,
                      "description": "docFreq",
                      "details": []
                    },
                    {
                      "value": 1,
                      "description": "docCount",
                      "details": []
                    }
                  ]
                },
                {
                  "value": 1.2571429,
                  "description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
                  "details": [
                    {
                      "value": 1,
                      "description": "termFreq=1.0",
                      "details": []
                    },
                    {
                      "value": 1.2,
                      "description": "parameter k1",
                      "details": []
                    },
                    {
                      "value": 0.75,
                      "description": "parameter b",
                      "details": []
                    },
                    {
                      "value": 2,
                      "description": "avgFieldLength",
                      "details": []
                    },
                    {
                      "value": 1,
                      "description": "fieldLength",
                      "details": []
                    }
                  ]
                }
              ]
            }
          ]
        }
      }
    ]
  }
}

Did you try the _analyze API to understand how your text is transformed at index time and search time?

Yes I did that.

{
  "text": "abc###def###12345###jkl###mno###pqr",
  "analyzer": "test_analyzer"
}

Output:

{
  "tokens": [
    {
      "token": "12345",
      "start_offset": 0,
      "end_offset": 35,
      "type": "word",
      "position": 0
    },
    {
      "token": "jkl###mno###pqr",
      "start_offset": 0,
      "end_offset": 35,
      "type": "word",
      "position": 0
    }
  ]
}

Just for information, I am using elastic 5.

How this is analyzed?

{
  "text": "abc###def###12367###jkl###mno###pqr",
  "analyzer": "test_analyzer"
}
{
  "tokens": [
    {
      "token": "12367",
      "start_offset": 0,
      "end_offset": 35,
      "type": "word",
      "position": 0
    },
    {
      "token": "jkl###mno###pqr",
      "start_offset": 0,
      "end_offset": 35,
      "type": "word",
      "position": 0
    }
  ]
}

I believe that as the second token can be found in both docs, your search matches both documents.

Operator is used to tell the query that all the tokens must be matched. Hence, it must match all the tokens generated.

I am seeing a strange thing that it is converting the query to SynonymQuery in lucene. I got to know by looking at the profiling. Please look this - https://pastebin.com/QGMjbCtf

I agree that it looks strange. Can you try with and instead of AND?

Tried. Not working.

Ok let me test it and come back.

I agree that it looks like a bug as this is working well:

DELETE test
POST test/_doc
{
  "foo": "12345 jkl"
}
POST test/_doc
{
  "foo": "12367 jkl"
}
GET test/_search
{
  "query": {
    "match": {
      "foo": {
        "query": "12367 jkl",
        "operator": "and"
      }
    }
  }
}

I believe this is because of the position in your example. Both tokens are seen as coming from one unique position... This is because of the keyword tokenizer I believe.

@jpountz What is your opinion on this? Bug or expected behavior?

The reproduction script is:

DELETE test
PUT test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "test_analyzer": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": [
            "lowercase",
            "test_pattern",
            "unique"
          ]
        }
      },
      "filter": {
        "test_pattern": {
          "type": "pattern_capture",
          "preserve_original": false,
          "patterns": [
            ".*###(\\d*)###(.*###.*###.*)",
            ".*###(.*###.*###.*)"
          ]
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "test_value": {
          "type": "text",
          "analyzer": "test_analyzer"
        }
      }
    }
  }
}
POST test/_doc
{
  "test_value": "abc###def###12345###jkl###mno###pqr"
}
POST test/_doc
{
  "test_value": "abc###def###12367###jkl###mno###pqr"
}
GET test/_search
{
  "query": {
    "match": {
      "test_value": {
        "query": "abc###def###12345###jkl###mno###pqr",
        "operator": "and"
      }
    }
  }
}
2 Likes

This query is matching with both documents, because the test_analyzer analyzer produces multiple tokens for the same position. This is because the keyword tokenizer is used here (the pattern_capture token filter then generates tokens for the same position).

I don't know the reason why pattern_capture token filter is used here in combination with keyword analyzer. But I think you need to use something like simple_pattern_split tokenizer (and then split by \\#\\#\\#) instead of the keyword tokenizer. This way tokens get generated on different positions and your query will match only one document.

If during query parsing there are multiple tokens on the position then internally, Elasticsearch uses Lucene's SynonymQuery. This happened here, because the keyword analyzer emits abc###def###12345###jkl###mno###pqr as a single token and then pattern_capture shops that up in multiple tokens on the same position. Multiple tokens on the same position is kind of what synonyms are.

2 Likes

Conceptually, the tokens are different and from different position, hence it should not treat them as synonyms. This issue is confusing to the general user as it doesn't work as expected.

What are the different options to achieve the same. The way you have suggested is going to break the values into 6 tokens which is not expected. I want 12345 and jkl###mno###pqr as tokens.

1 Like

@mvg Can you give an update on this?

Sorry, this slipped off my radar.

No, those tokens are all from position 0. Check the analyse api response, that you've shared previously in this threat.

I need to understand your use case better in order to provide advice on how to proceed here.
So documents always contain a field value like abc###def###12367###jkl###mno###pqr?
What is the structure of this string? Always 6 alphanumerical substrings separated by ###?
How do your end users search in these strings? Is the entire string used as query, or do they specify one or more of these alphanumerical substrings?

I think either a different tokenizer (that splits the text by some regex) needs to be used or the test_value field needs to be analyzed in multiple ways using multi fields.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.