Boolean query always returns boosted score

stefAnger · June 17, 2021, 4:10pm

Hi,
I am trying to replace a MySQL "product search" with an elastic search.

The MySQL Search is querying a single table, containing any information for both: Presenting the search results, and containing the source data on which to perform the search.
This table is copied to an elastic index (via logstash), where the primary key of the "search table" is the document id in the elastic search index, and all columns of a a row are the body of the documents.

The MySQL Search is relatively complex and so the elastic search has to implement the following functional aspects:

The "[search word]" has to use wildcards. That means, it has to implement a behaviour like the SQL: "like" function with "%[search word]%".
The search has to be performed on a list of fields. All fields have to be optional "boosted".
There is a filter list, where the search result must not contain any document where the field value matches any value contained in this list. ex. field-name: source_table_pk.
There is a term field, where the search result must only contain documents, which match this defined value for this field. ex. field-name: language.
The search result must be collapsed on a field. Which means, there can only be one document, having the same value for the specified field. ex. field-name: source_table_pk.
The result has to be filtered by a minimum score, that means, the result must not contain any document which does not match at least the minimum score.
The result is ordered by the score (which I think is default behaviour).

My actual implementation

GET/search/_search?pretty
{
  "query": {
    "bool": {
      "should": {
        "query_string": {
          "query": "*cola*",
          "fields": [
            "keywords^2.0",
            "supplier^1.7",
            "headline^1.5",
            "claim^1.2",
            "content^0.6"
          ],
          "fuzziness": 0
        }
      },
      "filter": [
        {
          "terms": {
            "source_table_pk": [
              12345,
              45678
            ]
          }
        },
        {
          "term": {
            "language": "en"
          }
        }
      ]
    }
  },
  "from": 0,
  "size": 17,
  "collapse": {
    "field": "source_table_pk"
  },
  "sort": [
    {
      "ranking": {
        "order": "desc"
      }
    },
    "_score"
  ]
}

Specific
For implementing a scoring filter, I use the "min_score" Filter and execute a count query first, because "min_score" leads to a exception on a boolean query.
Questions

Are there general mistakes I did? Is there a better approach than my query?
The result for the "[search term]" syntax in combination with boost fiels leads to odd scoring values. It seems the first hit just gets scored with the boost value multiplied by one. I would need a more precise scoring, where every hit on "search word" is evaluated for calculating the score. Strangely I can accomplish that, with removing the asterisks / wildcards.

styks90 · July 2, 2021, 8:46am

Have you tried adding wildcard specific flags as described in the documentation, e.g.:

allow_leading_wildcard

(Optional, Boolean) If true , the wildcard characters * and ? are allowed as the first character of the query string. Defaults to true .

analyze_wildcard

(Optional, Boolean) If true , the query attempts to analyze wildcard terms in the query string. Defaults to false .

stefAnger · July 16, 2021, 9:19am

Thank you very much for your reply.
Sadly the wildcard flags didnt solve the problem.
I did some research and finally I think I will try to go with a custom ngram based analyser for solving this issue.

system · August 13, 2021, 9:20am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.