My application has been tokenizing twitter style hashtags with a word_delimiter filter for a long time. For a hashtag '#SomeHashtag' we expect users might search for '#SomeHashtag' or 'SomeHashtag'.
For '#SomeHashtag' our analyzer produces the following tokens:
#SomeHashtag
, Some
, SomeHashtag
, Hashtag
For '#Some_Hashtag', which is common in some different languages, our analyzer removes the underscore from all tokens except the original:
#Some_Hashtag
, Some
, SomeHashtag
, Hashtag
Here is our analyzer:
"analysis": {
"analyzer": {
"tweet_test": {
"type": "custom",
"char_filter": ["html_strip", "quotes"],
"tokenizer": "standard_custom",
"filter": [ "custom_text_word_delimiter_query"]
}
},
"filter": {
"custom_text_word_delimiter_query": {
"type": "word_delimiter",
"generate_word_parts": "0",
"generate_number_parts": "0",
"catenate_words": "1",
"catenate_numbers": "1",
"catenate_all": "0",
"split_on_case_change": "0",
"split_on_numerics": "0",
"preserve_original": "0",
"type_table": [
"# => ALPHA",
"@ => ALPHA",
"& => ALPHA",
"- => ALPHA",
". => ALPHA",
"/ => ALPHA",
"_ => ALPHA"
]
}
}
}
One solution I'm considering is an additional pattern_capture filter (the actual regular expression twitter uses is MUCH longer than this):
"hashtag_filter": {
"type" : "pattern_capture",
"preserve_original" : 1,
"patterns" : ["#([^\\s]*)"]
}
This application may index several hundred messages per second, so my questions are:
Should I be concerned about the performance of a regular expression match?
Is there an alternative approach that might be less expensive than a regular expression match?