I have a multi-lingual, free-text dataset that contains some domain names within the text. I am finding that searching for just the hostname portion of a domain does not yield results because the standard analyzer does not split tokens on periods. (I think this explained here: http://unicode.org/reports/tr29/#Word_Boundaries)
I imagine there is a lot of work baked in to the standard tokenizer so that it yields the best results on multilingual data and so I don't want to switch tokenizers because of this one use case.
One thought I had would be to add a mapping char filter that maps periods to spaces, but I can't say I know all of the side-effects of this and am wondering if there is a more elegant solution.
Is there any other way to tweak the tokenizer such that it will split hostname.com into two tokens?