Filtering for wildcard domains

Hi there,

I have to write a query that has to find documents where one of the fields (type: text, also available: keyword) matches a list of wildcard domains. I know that it's sub-optimal but that's what I have to work with. Since I can't filter for exact terms, I don't know if I have any other option than writing a long query string with wildcards. Something like this:

      "query_string": {
        "query": "*third-level.second.jp OR *anotherone.it",
        "analyze_wildcard": true,
        "default_field": "host.keyword"
      }

Any suggestion, help is much appreciated!
Thanks!

Searching for ends of strings is expensive because it means we can't efficiently accelerate lookups using the index (which stores terms alphabetically based on the start of the string). We end up scanning all index entries.
A way to counter that is to store a version of the strings which is reversed. It's a little clunky but here's an example mapping/doc/query that should be more efficient:

DELETE test
PUT test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "reverser": {
          "tokenizer": "keyword",
          "filter": [
            "reverse"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "domain": {
        "type": "keyword",
        "fields": {
          "reversed": {
            "type": "text",
            "analyzer": "reverser"
          }
        }
      }
    }
  }
}
POST test/_analyze
{
  "field": "domain.reversed",
  "text": ["www.foo.com"]
}
POST test/_doc/1
{
  "domain":"www.foo.com"
}
POST test/_search
{
  "query": {
    "match_phrase_prefix": {"domain.reversed":  "foo.com"}
  }
}

For multiple clauses you would use a bool query with multiple of the match_phrase_prefix clauses inside the should clause.

Another approach is to make use of the new wildcard field but, like most things this has trade-offs. These were discussed here where the original question was related to your exact same problem (searching for ends of domain names).

1 Like

Thank you Mark for this!
That's easily implementable. Is there any limit for clauses? I have a handful of edge cases where I need to match hundreds of domains in the target index.

Edit: Could you please tell me why did you use "type": "binary", in the mapping?

Best Regards,
YvorL

Edit: Could you please tell me why did you use "type": "binary", in the mapping?

Oops. Not sure how that happened. That was supposed to be keyword but is irrelevant for the purposes of the example as we only use the "reversed" subfield. You'd probably want to do any aggregation on the containing keyword field though.

That's easily implementable. Is there any limit for clauses?

Yep, 1024 on clauses. You'd need to break your search into multiple requests.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.