Tokenizing a hashtag


(Winder) #1

I want to create a pattern_capture filter to tokenize a hashtag. For instance #My_Hashtag needs to tokenize to #My_Hashtag and My_Hashtag.

There is a project from twitter with the regular expression I need, but it is kind of huge (see below). Would using this enormous regular expression while processing hundreds of tweets per second be a performance concern?

For reference my filter looks like this:

"hashtag_filter": {
    "type" : "pattern_capture",
    "preserve_original" : 1,
    "patterns" : ["giant regular expression here"]
}

(^|[^&\p{L}\p{M}\u037f\u0528-\u052f\u08a0-\u08b2\u08e4-\u08ff\u0978\u0980\u0c00\u0c34\u0c81\u0d01\u0ede\u0edf\u10c7\u10cd\u10fd-\u10ff\u16f1-\u16f8\u17b4\u17b5\u191d\u191e\u1ab0-\u1abe\u1bab-\u1bad\u1bba-\u1bbf\u1cf3-\u1cf6\u1cf8\u1cf9\u1de7-\u1df5\u2cf2\u2cf3\u2d27\u2d2d\u2d66\u2d67\u9fcc\ua674-\ua67b\ua698-
...
many characters removed, total regex is 10-12k characters
...
\ua9f9\ud804\udcf0-\ud804\udcf9\ud804\udd36-\ud804\udd3f\ud804\uddd0-\ud804\uddd9\ud804\udef0-\ud804\udef9\ud805\udcd0-\ud805\udcd9\ud805\ude50-\ud805\ude59\ud805\udec0-\ud805\udec9\ud806\udce0-\ud806\udce9\ud81a\ude60-\ud81a\ude69\ud81a\udf50-\ud81a\udf59_\u200c\u200d\ua67e\u05be\u05f3\u05f4\uff5e\u301c\u309b\u309c\u30a0\u30fb\u3003\u0f0b\u0f0c\u00b7]*)


(Daniel Mitterdorfer) #2

Hi @winder,

that's hard to tell. I'd say it's overkill to use a regex for that but I suggest you just create a benchmark. Then you'll know. In case it's too slow I'd create a small plugin with a dedicated tokenizer (and just iterate over the characters).

Daniel


(system) #3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.