Configuring icu_tokenizer to keep hashtag in token


(Robert Fišer) #1

Hi,
we'r using icu_tokenizer to analyze text which may be in many languages. The problem is that text contains hastags like #dog #cat etc. and icu_tokenizer removes the '#' characters from tokens. So we'r not able to find documents which contains exactly the '#cat'.
Is there a simple way to achieve calling _analyze text:'#cat' produces 2 tokens: ['#cat', 'cat']?
Robert


(Robert Fišer) #2

Any idea?


(system) #3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.