Invalid Search results returned


(David Dale) #1

When searching for the term "no" it only returns a single search result. However when I search for a more specific term like "normal" it returns 3 results all having normal. Why would the more general search yield fewer results? Why would query for "no" not yield all results with the word "normal"?

Only a Single Results

   {  
      "from":0,
      "size":25,
      "query":{  
         "multi_match":{  
            "query":"no",
            "fields":[  
               "_all"
            ],
            "type":"phrase_prefix",
            "lenient":true
         }
      },
      "post_filter":{  
         "bool":{  
            "must":[  
               {  
                  "range":{  
                     "created":{  
                        "from":"2018-01-01T06:00:00.000Z",
                        "to":"2018-07-30T19:35:06.646Z",
                        "include_lower":true,
                        "include_upper":true
                     }
                  }
               },
               {  
                  "terms":{  
                     "payer.id":[  
                        15350
                     ]
                  }
               },
               {  
                  "terms":{  
                     "status":[  
                        "created",
                        "errored",
                        "publish_complete",
                        "reconcile_in_progress",
                        "reconcile_approved",
                        "reconcile_complete",
                        "reconcile_failed",
                        "reconcile_cancelled",
                        "reconcile_declined",
                        "duplicate"
                     ]
                  }
               }
            ]
         }
      },
      "sort":[  
         {  
            "created":{  
               "order":"desc"
            }
         }
      ]
   }

Yields 3 Records

{  
   "from":0,
   "size":25,
   "query":{  
      "multi_match":{  
         "query":"normal",
         "fields":[  
            "_all"
         ],
         "type":"phrase_prefix",
         "lenient":true
      }
   },
   "post_filter":{  
      "bool":{  
         "must":[  
            {  
               "range":{  
                  "created":{  
                     "from":"2018-01-01T06:00:00.000Z",
                     "to":"2018-07-30T19:35:06.646Z",
                     "include_lower":true,
                     "include_upper":true
                  }
               }
            },
            {  
               "terms":{  
                  "payer.id":[  
                     15350
                  ]
               }
            },
            {  
               "terms":{  
                  "status":[  
                     "created",
                     "errored",
                     "publish_complete",
                     "reconcile_in_progress",
                     "reconcile_approved",
                     "reconcile_complete",
                     "reconcile_failed",
                     "reconcile_cancelled",
                     "reconcile_declined",
                     "duplicate"
                  ]
               }
            }
         ]
      }
   },
   "sort":[  
      {  
         "created":{  
            "order":"desc"
         }
      }
   ]
}

(David Pilato) #2

Probably because you have indexed normal with the default analyzer which indexed it as normal.
no is not normal so that does not match.

May be use a different analyzer which can produces subtokens like no, nor, norm, norma and normal? Like a edge ngram based analyzer...

If you don't succeed, could you provide a full recreation script as described in About the Elasticsearch category. It will help to better understand what you are doing. Please, try to keep the example as simple as possible.

A full reproduction script will help readers to understand, reproduce and if needed fix your problem. It will also most likely help to get a faster answer.


(David Dale) #3

Thank you for the response. I am looking into how we can adjust the analyzer that we use. Once that has been completed we will reindex and re-run the test. I will put the details about the analyzer work here as well.


(David Dale) #4

So just a quick update on this. I have put the code in place to generate the schema to have ngram analyzers on for the payeeName and supplierCode fields.

{  
   mappings:{  
      Payment:{  
         "properties":{  
            "payeeName":{  
               "type":"string",
               "search_analyzer":"ngram_analyzer",
               "analyzer":"ngram_analyzer"
            },
            "supplierCode":{  
               "type":"string",
               "search_analyzer":"ngram_analyzer",
               "analyzer":"ngram_analyzer"
            },
            "id":{  
               "type":"long"
            }
         }
      }
   },
   settings:{  
      "analysis":{  
         "analyzer":{  
            "ngram_analyzer":{  
               "tokenizer":"standard",
               "type":"custom",
               "filter":[  
                  "lowercase",
                  "ngram_filter"
               ]
            }
         },
         "filter":{  
            "ngram_filter":{  
               "type":"nGram",
               "min_gram":2,
               "max_gram":15
            }
         }
      }
   }
}

I have removed and recreated the index with the definition from above. Reindexed payments and partial searches for payeeName are still failing to return valid results. Is there a way to peek at what ngrams that are being created?

Using this allows me to see how the analyzer is going to work.

curl 'localhost:9200/com.csi.model.payments/_analyze?pretty=1&analyzer=ngram_analyzer' -d 'FC Schalke 04'

And that yields

{
  "tokens" : [ {
    "token" : "fc",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "word",
    "position" : 0
  }, {
    "token" : "sc",
    "start_offset" : 3,
    "end_offset" : 10,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "sch",
    "start_offset" : 3,
    "end_offset" : 10,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "scha",
    "start_offset" : 3,
    "end_offset" : 10,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "schal",
    "start_offset" : 3,
    "end_offset" : 10,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "schalk",
    "start_offset" : 3,
    "end_offset" : 10,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "schalke",
    "start_offset" : 3,
    "end_offset" : 10,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "ch",
    "start_offset" : 3,
    "end_offset" : 10,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "cha",
    "start_offset" : 3,
    "end_offset" : 10,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "chal",
    "start_offset" : 3,
    "end_offset" : 10,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "chalk",
    "start_offset" : 3,
    "end_offset" : 10,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "chalke",
    "start_offset" : 3,
    "end_offset" : 10,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "ha",
    "start_offset" : 3,
    "end_offset" : 10,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "hal",
    "start_offset" : 3,
    "end_offset" : 10,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "halk",
    "start_offset" : 3,
    "end_offset" : 10,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "halke",
    "start_offset" : 3,
    "end_offset" : 10,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "al",
    "start_offset" : 3,
    "end_offset" : 10,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "alk",
    "start_offset" : 3,
    "end_offset" : 10,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "alke",
    "start_offset" : 3,
    "end_offset" : 10,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "lk",
    "start_offset" : 3,
    "end_offset" : 10,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "lke",
    "start_offset" : 3,
    "end_offset" : 10,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "ke",
    "start_offset" : 3,
    "end_offset" : 10,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "04",
    "start_offset" : 11,
    "end_offset" : 13,
    "type" : "word",
    "position" : 2
  } ]
}

Should i be using a tokenizer instead of a filter?

Here is the query.

curl "localhost:9200/com.csi.model.payments/Payment/_search?pretty=true" -d '{
  "from" : 0,
  "size" : 25,
  "query" : {
    "multi_match" : {
      "query" : "auto",
      "fields" : [ "_all" ],
      "type" : "phrase_prefix",
      "lenient" : true
    }
  },
  "post_filter" : {
    "bool" : {
      "must" : [ {
        "range" : {
          "created" : {
            "from" : "2018-01-01T05:00:00.000Z",
            "to" : "2018-08-28T17:53:10.866Z",
            "include_lower" : true,
            "include_upper" : true
          }
        }
      }, {
        "terms" : {
          "payer.id" : [ 11750 ]
        }
      }, {
        "terms" : {
          "status" : [ "created", "errored", "publish_complete", "reconcile_in_progress", "reconcile_approved", "reconcile_complete", "reconcile_failed", "reconcile_cancelled", "reconcile_declined", "duplicate" ]
        }
      } ]
    }
  },
  "sort" : [ {
    "created" : {
      "order" : "desc"
    }
  } ]
}'

The PayeeName is "Automated 1534436722180 1534436722180 Vendor 1"


(system) closed #5

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.