Url similarity scoring

I want to create a search that scores by the similarity of Url's.

So searching for
https://www.sample.com/abc/def/index.html

in a field with these values

Would return in something like this order

  1. https://www.sample.com/abc/def/index.html (exact match) score:1
  2. https://www.test.com/abc/def/index.html (different host, path matches) score:.9
  3. https://www.sample.com/abc/xxx/index.html (one part different) score:.8
  4. https://www.sample.com/def/abc/index.html (two parts in wrong order) score .7
  5. https://www.sample.com/xyz/pdq/def/abc/index.html (several changes but some patching parts) score:.5
  6. https://www.test.com/abc/ (different host, only one path part matches) score: .2
  7. https://test.org/welcome.html (nothing matches) score:0

I explicitly want url’s to score well if the host is different but the rest of the url is similar. So for example, applications deployed on different hosts should match.

What type of field and search would do this?

Hi @Moverton there are several things to consider, but as a starting point you should consider creating custom analyzers and storing your data (the url's) in different formats for different purposes.

Text fields in Elasticsearch are always transformed upon indexing, in a sequence that comprises Character Filters -> Tokenizer -> Token Filters, as detailed here, this process produces tokens that are then stored. When we do a search, the same process is applied to our input, which will also produce tokens that Elasticsearch will then compare to the list of tokens produced by each indexed document, yielding a result.

By default, the standard analyzer is used in text fields, and there is a very useful API called Analyze that helps us having an intuition about what tokens are produced. For instance, the following:

GET _analyze
{
  "analyzer": "standard",
  "text": "https://www.sample.com/abc/def/index.html"
}

Will show us what terms (tokens) are produced for the given text, which are "https", "www.sample.com", "abc", "def", "index.html". So when searching for https://www.sample.com/abc/def/index.html the exact same tokens will be produced and the document that contains an exact match will have a higher score. However you have the specific requirement of "if its a different host, but the path is the same, then it should score higher" so its a good idea to make a custom analyzer that removes the protocol and the hostname and only produces a single token, which is the path (/abc/def/index.html) so if there is an exact match in the path, the document will score higher regardless of the hostname.

Since your question was very precise on the requirements, I have put an example that you can use as a starting point, it currently produces the following, which is pretty close to what you want:

  1. https://www.sample.com/abc/def/index.html (exact match)
  2. https://www.test.com/abc/def/index.html (different host, path matches)
  3. https://www.sample.com/def/abc/index.html (two parts in wrong order)
  4. https://www.sample.com/xyz/pdq/def/abc/index.html (several changes but some patching parts)
  5. https://www.sample.com/abc/xxx/index.html (one part different)
  6. https://www.test.com/abc/ (different host, only one path part matches)
  7. https://test.org/welcome.html (nothing matches) (actually its still matching because of the protocol)

Copy and paste the code below to Kibana Dev Tools and run the commands in sequence.
There you can see how to use a custom analyzer, how to test it (analyze API), how to use multiple fields and how to use a multi_match to match the same input against multiple fields.


DELETE url-match
PUT url-match
{
  "mappings": {
    "properties": {
      "url": {
        "type": "text",
        "fields": {
          "path": {
            "type": "text",
            "analyzer": "my_url_matcher",
            "similarity": "boolean"
          }
        }
      }
    }
  },
  "settings": {
    "analysis": {
      "filter": {
        "clear_hostname": {
          "type": "pattern_replace",
          "pattern": """((http[s]):\/)?\/?([^:\/\s]+)""",
          "replacement": "",
          "all": false
        }
      },
      "analyzer": {
        "my_url_matcher": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "clear_hostname"
          ]
        }
      }
    }
  }
}

POST url-match/_bulk
{"index":{}}
{"url":"https://www.sample.com/abc/def/index.html","description":"exact"}
{"index":{}}
{"url":"https://www.test.com/abc/","description":"different host, only one path part matches"}
{"index":{}}
{"url":"https://www.sample.com/abc/ghi/index.html ","description":"one part different"}
{"index":{}}
{"url":"https://www.sample.com/xyz/pdq/def/abc/index.html","description":"several changes but some patching parts"}
{"index":{}}
{"url":"https://www.sample.com/def/abc/index.html","description":"two parts in wrong order"}
{"index":{}}
{"url":"https://www.test.com/abc/def/index.html","description":"different host, path matches"}
{"index":{}}
{"url":"https://test.org/welcome.html","description":"nothing matches"}



GET url-match/_search?explain=false
{
  "query": {
    "bool": {
      "must": [
        {
          "multi_match": {
            "query": "https://www.sample.com/abc/def/index.html",
            "fields": [
              "url",
              "url.path^2"
            ],
            "type": "best_fields"
          }
        }
      ]
    }
  }
}


GET url-match/_analyze
{
  "analyzer": "my_url_matcher",
  "text": "https://www.sample.com/abc/def/index.html"
}

Excellent. We will work on this strategy and give some feedback. We do want the host to add some score but not as much as the path similarity. We could probably break the host into another field and have that scored independently.
thanks