Making hostname.com into two tokens

drs · June 22, 2016, 12:25am

I have a multi-lingual, free-text dataset that contains some domain names within the text. I am finding that searching for just the hostname portion of a domain does not yield results because the standard analyzer does not split tokens on periods. (I think this explained here: http://unicode.org/reports/tr29/#Word_Boundaries)

I imagine there is a lot of work baked in to the standard tokenizer so that it yields the best results on multilingual data and so I don't want to switch tokenizers because of this one use case.

One thought I had would be to add a mapping char filter that maps periods to spaces, but I can't say I know all of the side-effects of this and am wondering if there is a more elegant solution.

Is there any other way to tweak the tokenizer such that it will split hostname.com into two tokens?

drs · June 22, 2016, 1:06am

For what it's worth, the tokenizer used here on discuss.elastic.co apparently splits "hostname.com" into two tokens. Doing a search for "hostname" returns this topic with hostname.com emphasized like so.

Topic		Replies	Views
Standard tokenizer documentation doesn't match behavior Elasticsearch	2	328	July 6, 2017
Hostname Tokenizer Elasticsearch	2	1399	July 5, 2017
ElasticSearch standard Analyzer - problematic case Elasticsearch	2	558	July 6, 2017
Tokenize email address Elasticsearch	5	2349	July 6, 2017
Defining custom analyzer that understands URLS/hostnames/emails and can exclude patterns Elasticsearch	1	370	July 6, 2017

Making hostname.com into two tokens

Related topics