Creating an analyzer that works with # and @ in the following manner

If I index a tweet body with the following text:

"Hey @joesmith are you watching #gameofthrones?"

I would like the tokenizer to create the following tokens from @joesmith and #gameofthrones

@joesmith
joesmith
#gameofthrones
gameofthrones

If someone searches for gameofthrones, it will match the above tweet text. If they search for #gameofthrones it would also match. The same behavior would also exist for searching for joesmith or @joesmith.

Is this possible? Which tokenizer / analyzer should I be looking at using?

Hi,

the simplest way to do this is using a Whitespace analyzer for the @joesmith and another analyzer with a tokeniser defined with "discard_punctuation" : "true" to have the joesmith.

There's more resources here:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.