Tokenizing a hashtag

(Winder) #1

I want to create a pattern_capture filter to tokenize a hashtag. For instance #My_Hashtag needs to tokenize to #My_Hashtag and My_Hashtag.

There is a project from twitter with the regular expression I need, but it is kind of huge (see below). Would using this enormous regular expression while processing hundreds of tweets per second be a performance concern?

For reference my filter looks like this:

"hashtag_filter": {
    "type" : "pattern_capture",
    "preserve_original" : 1,
    "patterns" : ["giant regular expression here"]

many characters removed, total regex is 10-12k characters

(Daniel Mitterdorfer) #2

Hi @winder,

that's hard to tell. I'd say it's overkill to use a regex for that but I suggest you just create a benchmark. Then you'll know. In case it's too slow I'd create a small plugin with a dedicated tokenizer (and just iterate over the characters).


(system) #3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.