Incorrect match pattern with bool/must/filter on filenames

Hi i am trying to create a simple query on file extension but somehow the search does not find all the results.

This query returns 8 hits.

     {
      "from": 0,
      "size": 10,
      "query": {
         "bool": {
                "must": [
                  {
                    "match": 
                    {
                      	"Datatype": {
                      		"type": "phrase",
                      		"query": "the type"
            	          }
        	   }
              }
           ]
         }
     }

response:

      {
        "_source": {
          "Datatype": "the type",   
          "FileName": "a-b.xyz",
      },
      {
        "_source": {
          "Datatype": "the type",   
          "FileName": "b-c.xyz",
      },
      {
        "_source": {
          "Datatype": "the type",   
          "FileName": "d-x.xyz",
      },
      {
        "_source": {
          "Datatype": "the type",   
          "FileName": "aa-aa.xyz",
      },
      {
        "_source": {
          "Datatype": "the type",   
          "FileName": "ddfsdf-ddf.xyz",   
      },
      {
        "_source": {
          "Datatype": "the type",   
          "FileName": "1234-sdd.xyz",
      },
      {
        "_source": {
          "Datatype": "the type",   
          "FileName": "31502-sdsd.xyz",
      },
      {
        "_source": {
            "Datatype": "the type",   
          "FileName": "16104-ss.xyz",
      }
    ]
  }
}

if I filter on only the FileName:
I retrieve only 3 hits...

{
  "from": 0,
  "size": 10,
  "query": {
    "bool": {
      "filter": [ // tried with must/should too
      	{
      		"match": {
            	      "FileName": {
            		  "query": "xyz"
            	      }
        	}
      	}
      ]
    }
  }
}

I only get 3 hits on the 8 hits above while I expect to have 8. I am having a hard time to troubleshoot that.
Any guidance on what could be going wrong?

. does not seem to be a special character. My FileName field is indexed as text and is usually of the form 123-abc.xyz

Thanks,
Jon

Are you using the default analyzer on the FileName field? The standard
tokenizer should decompose that field with 'xyz' as one of the terms. The
perplexing part is that you have a match some of the time, but not all of
the time. Seems like it would be an all-or-nothing type of scenario.

I would use the Analyze API to see exactly how the field is being analyzed
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html

Use the notation where you specify the field rather than specify the
specific analyzers and filters.

Alternatively, you can use the Explain API to see why certain documents are
being matched:
https://www.elastic.co/guide/en/elasticsearch/reference/5.4/search-explain.html
Once you see why a certain document was matched, you can duplicate the
logic.

Interesting! thanks.

Here are two filenames and their analyzer response:
24022-ABC.xyz

{
  "tokens": [
    {
      "token": "24022",
      "start_offset": 0,
      "end_offset": 5,
      "type": "<NUM>",
      "position": 0
    },
    {
      "token": "abc.xyz",
      "start_offset": 6,
      "end_offset": 13,
      "type": "<ALPHANUM>",
      "position": 1
    }
  ]
}

and 22210-ABC1.xyz

{
  "tokens": [
    {
      "token": "22210",
      "start_offset": 0,
      "end_offset": 5,
      "type": "<NUM>",
      "position": 0
    },
    {
      "token": "abc1",
      "start_offset": 6,
      "end_offset": 10,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "xyz",
      "start_offset": 11,
      "end_offset": 14,
      "type": "<ALPHANUM>",
      "position": 2
    }
  ]
}

the difference is that there is numeric character just before the . and those are all mapped.

I tried the filter part only on a index row and obtained the following, not sure how to interpret that though:

{
  "_index" : "index",
  "_type" : "ResultRow",
  "_id" : "bc593da7_994f_4b28_8075_057b2fc9cd84",
  "matched" : false,
  "explanation" : {
    "value" : 0.0,
    "description" : "Failure to meet condition(s) of required/prohibited clause(s)",
    "details" : [
      {
        "value" : 0.0,
        "description" : "no match on required clause ((ConstantScore(FileName:xyz))^0.0)",
        "details" : [
          {
            "value" : 0.0,
            "description" : "ConstantScore(FileName:xyz) doesn't match id 4697498",
            "details" : [ ]
          }
        ]
      },
      {
        "value" : 0.0,
        "description" : "match on required clause, product of:",
        "details" : [
          {
            "value" : 0.0,
            "description" : "# clause",
            "details" : [ ]
          },
          {
            "value" : 1.0,
            "description" : "_type:ResultRow, product of:",
            "details" : [
              {
                "value" : 1.0,
                "description" : "boost",
                "details" : [ ]
              },
              {
                "value" : 1.0,
                "description" : "queryNorm",
                "details" : [ ]
              }
            ]
          }
        ]
      }
    ]
  }
}

It seems the pattern analyzer splits things in a more appropriate way /_analyze?text=22210-ABC.xyz&analyzer=pattern (I get 3 tokens, 22210m abc, and xyz) but still does not work when i query:

    "filter": [
      	{
      		"match": {
            	"FileName": {
            		"analyzer":"pattern",
            		"query": "xyz"
            	}
        	}
      	}
      ]

Apparently the standard tokenize does not split "ABC.xy" into separate
tokens. The default pattern analyzer [1] should get you closer to your
goal. If not, you can customize the pattern or use a custom analyzer to a
pattern tokenizer.

[1]
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-analyzer.html

Yes i noticed that but somehow i am not able to apply to pattern analyzer in my query (see text block above). It just ignores the analyzer field in the json it seems.

Unfortunately, you would need to reindex your content since the correct
tokens are not in your index. As long as you use a match query or one of
its variants, the correct analyzer will be used. The pattern analyzer needs
to be defined in the mapping for that field before content is indexed.

Ivan

for those who are interested, i index the extension in a separate field.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.