Index and Search URL

Hello,

I've searched a lot around this forum and google but I didn't find a proper way to index URL that would also allow partial match.
My use case is, I have documents that contains fully qualified URL (hostname + path + query string), I want to them to be searchable, either exact match or partial match.
I've put a keyword mapping right now because I don't want my URL to split in multiple keyword.

# Mapping
{
  "mappings": {
    "properties": {
      "url": {
        "type": "keyword"
      }
    }
  }
}
# Document example
{
  "_source": {
    "url": "https://discuss.elastic.co/t/indexing-and-searching-urls-with-dashes/6977?foo=bar"
  }
}

Let say I have this URL: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-wildcard-query.html?pid=1234.
I want to find it when I type:

  • https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-wildcard-query.html?pid=1234
  • www.elastic.co
  • elasticsearch/reference/current
  • pid
  • [anything really ...]

The problem I have, is that ngram would be too big to generate (I think), wildcard is working but I suppose it's slow.
Is there a good solution to this ?

Thanks,

May be using a path tokenizer would help?

Hey,

I thought about that but that would only work for querying/agg a specific part of the path (Not sure)
I don't think it will work when searching for => elastic.co/foo?bar

Wildcard is returning results in a few seconds for 8millions rows and once in the cache a few hundred milliseconds. The machine is not really big so I guess I could throw more CPU/memory but I'm sad that there is not good way to do that, specially considering a lot of people are using ELK to store webserver log.

At some point I think it’s just simpler to preparse this into different fields. Each component of the URL probably has different prioritizarion in your ranking. So I would deconstruct the URL at index and query time and search each individual attribute separately as different fields.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.