Exact match of keywords & score weight keywords


(Bas) #1

Two small questions:

  1. The first document has a match on "java-8", but I would only like to match on "java". Same for the second document with docker.
    How can I filter on the exact keywords? Should I change the mapping or is possible to fix this inside the search query?

  2. It looks like the score also weights the number of (non-matching) tags. It possible to exclude the non-matching tags in the score?
    For me it is okay if ['php','java','x'] and ['php,','java','x','y','z'] have exactly the same score.

Request

GET localhost:9200/profile/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "term": {
            "tags": "docker"
          }
        },
        {
          "term": {
            "tags": "php"
          }
        },
        {
          "term": {
            "tags": "java"
          }
        }
      ]
    }
  }
}

Response

[
  {
    "_score": 10.028147,
    "_source": {
      "name": "Harald",
      "tags": [
        "docker",
        "java-8",
        "website"
      ]
    }
  },
  {
    "_score": 9.958822,
    "_source": {
      "name": "Alex",
      "tags": [
        "java",
        "spring",
        "yaml",
        "spring-boot",
        "website",
        "docker",
        "docker-compose",
        "docker-machine"
      ]
    }
  },
  {
    "_score": 9.919757,
    "_source": {
      "name": "Fleming",
      "tags": [
        "java",
        "docker",
        "dockerfile",
        "website"
      ]
    }
  },
  {
    "_score": 9.911845,
    "_source": {
      "name": "Galley",
      "tags": [
        "php",
        "html",
        "docker",
        "website"
      ]
    }
  }
]

(Abdon Pijpelink) #2

What you're seeing is probably caused by the tags field being mapped as a text field. text fields go through a process called text analysis, which breaks up strings on things like whitespace and characters like -. This is why a query for "java" matches "java-8".

If you want to search for exact values, you will need to use a keyword field instead. If you have gone with the default mapping in Elasticsearch, there should already be a field called tags.keyword that will give you exactly what you need. You would query that keyword field like this:

GET localhost:9200/profile/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "term": {
            "tags.keyword": "docker"
          }
        },
        {
          "term": {
            "tags.keyword": "php"
          }
        },
        {
          "term": {
            "tags.keyword": "java"
          }
        }
      ]
    }
  }
}

With regards to scoring, in this case you probably do not want to go with the default BM25 scoring mechanism. If you wrap your term queries in a constant_score query, the scoring will depend purely on the number of matching tags:

GET profile/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "constant_score": {
            "filter": {
              "term": {
                "tags.keyword": "docker"
              }
            }
          }
        },
        {
          "constant_score": {
            "filter": {
              "term": {
                "tags.keyword": "php"
              }
            }
          }
        },
        {
          "constant_score": {
            "filter": {
              "term": {
                "tags.keyword": "java"
              }
            }
          }
        }
      ]
    }
  }
}

(Bas) #3

Thanks a lot for this clear explanation Abdon!

I'm going to change the mapping for the tags field and use keyword instead of text and I'm also going to use constant_score in the search query, thanks!