Incorrect match pattern with bool/must/filter on filenames

Jonathan_Delfour · June 5, 2017, 10:11pm

Hi i am trying to create a simple query on file extension but somehow the search does not find all the results.

This query returns 8 hits.

     {
      "from": 0,
      "size": 10,
      "query": {
         "bool": {
                "must": [
                  {
                    "match": 
                    {
                      	"Datatype": {
                      		"type": "phrase",
                      		"query": "the type"
            	          }
        	   }
              }
           ]
         }
     }

response:

      {
        "_source": {
          "Datatype": "the type",   
          "FileName": "a-b.xyz",
      },
      {
        "_source": {
          "Datatype": "the type",   
          "FileName": "b-c.xyz",
      },
      {
        "_source": {
          "Datatype": "the type",   
          "FileName": "d-x.xyz",
      },
      {
        "_source": {
          "Datatype": "the type",   
          "FileName": "aa-aa.xyz",
      },
      {
        "_source": {
          "Datatype": "the type",   
          "FileName": "ddfsdf-ddf.xyz",   
      },
      {
        "_source": {
          "Datatype": "the type",   
          "FileName": "1234-sdd.xyz",
      },
      {
        "_source": {
          "Datatype": "the type",   
          "FileName": "31502-sdsd.xyz",
      },
      {
        "_source": {
            "Datatype": "the type",   
          "FileName": "16104-ss.xyz",
      }
    ]
  }
}

if I filter on only the FileName:
I retrieve only 3 hits...

{
  "from": 0,
  "size": 10,
  "query": {
    "bool": {
      "filter": [ // tried with must/should too
      	{
      		"match": {
            	      "FileName": {
            		  "query": "xyz"
            	      }
        	}
      	}
      ]
    }
  }
}

I only get 3 hits on the 8 hits above while I expect to have 8. I am having a hard time to troubleshoot that.
Any guidance on what could be going wrong?

. does not seem to be a special character. My FileName field is indexed as text and is usually of the form 123-abc.xyz

Thanks,
Jon

Ivan · June 6, 2017, 6:31am

Are you using the default analyzer on the FileName field? The standard
tokenizer should decompose that field with 'xyz' as one of the terms. The
perplexing part is that you have a match some of the time, but not all of
the time. Seems like it would be an all-or-nothing type of scenario.

I would use the Analyze API to see exactly how the field is being analyzed
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html

Use the notation where you specify the field rather than specify the
specific analyzers and filters.

Alternatively, you can use the Explain API to see why certain documents are
being matched:
https://www.elastic.co/guide/en/elasticsearch/reference/5.4/search-explain.html
Once you see why a certain document was matched, you can duplicate the
logic.

Jonathan_Delfour · June 6, 2017, 2:33pm

Interesting! thanks.

Here are two filenames and their analyzer response:
24022-ABC.xyz

{
  "tokens": [
    {
      "token": "24022",
      "start_offset": 0,
      "end_offset": 5,
      "type": "<NUM>",
      "position": 0
    },
    {
      "token": "abc.xyz",
      "start_offset": 6,
      "end_offset": 13,
      "type": "<ALPHANUM>",
      "position": 1
    }
  ]
}

and 22210-ABC1.xyz

{
  "tokens": [
    {
      "token": "22210",
      "start_offset": 0,
      "end_offset": 5,
      "type": "<NUM>",
      "position": 0
    },
    {
      "token": "abc1",
      "start_offset": 6,
      "end_offset": 10,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "xyz",
      "start_offset": 11,
      "end_offset": 14,
      "type": "<ALPHANUM>",
      "position": 2
    }
  ]
}

the difference is that there is numeric character just before the . and those are all mapped.

I tried the filter part only on a index row and obtained the following, not sure how to interpret that though:

{
  "_index" : "index",
  "_type" : "ResultRow",
  "_id" : "bc593da7_994f_4b28_8075_057b2fc9cd84",
  "matched" : false,
  "explanation" : {
    "value" : 0.0,
    "description" : "Failure to meet condition(s) of required/prohibited clause(s)",
    "details" : [
      {
        "value" : 0.0,
        "description" : "no match on required clause ((ConstantScore(FileName:xyz))^0.0)",
        "details" : [
          {
            "value" : 0.0,
            "description" : "ConstantScore(FileName:xyz) doesn't match id 4697498",
            "details" : [ ]
          }
        ]
      },
      {
        "value" : 0.0,
        "description" : "match on required clause, product of:",
        "details" : [
          {
            "value" : 0.0,
            "description" : "# clause",
            "details" : [ ]
          },
          {
            "value" : 1.0,
            "description" : "_type:ResultRow, product of:",
            "details" : [
              {
                "value" : 1.0,
                "description" : "boost",
                "details" : [ ]
              },
              {
                "value" : 1.0,
                "description" : "queryNorm",
                "details" : [ ]
              }
            ]
          }
        ]
      }
    ]
  }
}

It seems the pattern analyzer splits things in a more appropriate way /_analyze?text=22210-ABC.xyz&analyzer=pattern (I get 3 tokens, 22210m abc, and xyz) but still does not work when i query:

    "filter": [
      	{
      		"match": {
            	"FileName": {
            		"analyzer":"pattern",
            		"query": "xyz"
            	}
        	}
      	}
      ]

Ivan · June 6, 2017, 7:50pm

Apparently the standard tokenize does not split "ABC.xy" into separate
tokens. The default pattern analyzer [1] should get you closer to your
goal. If not, you can customize the pattern or use a custom analyzer to a
pattern tokenizer.

[1]
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-analyzer.html

Jonathan_Delfour · June 6, 2017, 7:52pm

Yes i noticed that but somehow i am not able to apply to pattern analyzer in my query (see text block above). It just ignores the analyzer field in the json it seems.

Ivan · June 7, 2017, 6:01am

Unfortunately, you would need to reindex your content since the correct
tokens are not in your index. As long as you use a match query or one of
its variants, the correct analyzer will be used. The pattern analyzer needs
to be defined in the mapping for that field before content is indexed.

Ivan

Jonathan_Delfour · June 19, 2017, 2:13pm

for those who are interested, i index the extension in a separate field.

system · July 17, 2017, 2:13pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Matched Queries not always functioning Elasticsearch	4	1787	September 26, 2017
Clarification on SHOULD bool query (w/ must, filter) Elasticsearch	2	598	July 5, 2017
Using must finds document, using should doesn't - why? Elasticsearch	3	522	July 6, 2017
Wrong results appearing in elasticsearch must match query Elasticsearch	1	683	March 6, 2019
Query bool with must and should Elasticsearch	7	259	February 10, 2023

Incorrect match pattern with bool/must/filter on filenames

Related topics