Tokenizing a Hashtag containing underscores


(Winder) #1

My application has been tokenizing twitter style hashtags with a word_delimiter filter for a long time. For a hashtag '#SomeHashtag' we expect users might search for '#SomeHashtag' or 'SomeHashtag'.

For '#SomeHashtag' our analyzer produces the following tokens:
#SomeHashtag, Some, SomeHashtag, Hashtag

For '#Some_Hashtag', which is common in some different languages, our analyzer removes the underscore from all tokens except the original:
#Some_Hashtag, Some, SomeHashtag, Hashtag

Here is our analyzer:

"analysis": {
  "analyzer": {
    "tweet_test": {
      "type": "custom",
      "char_filter": ["html_strip", "quotes"],
      "tokenizer": "standard_custom",
      "filter": [ "custom_text_word_delimiter_query"]
    }
  },
  "filter": {
    "custom_text_word_delimiter_query": {
      "type": "word_delimiter",
      "generate_word_parts": "0",
      "generate_number_parts": "0",
      "catenate_words": "1",
      "catenate_numbers": "1",
      "catenate_all": "0",
      "split_on_case_change": "0",
      "split_on_numerics": "0",
      "preserve_original": "0",
      "type_table": [
          "# => ALPHA",
          "@ => ALPHA",
          "& => ALPHA",
          "- => ALPHA",
          ". => ALPHA",
          "/ => ALPHA",
          "_ => ALPHA"
      ]
    }
  }
}

One solution I'm considering is an additional pattern_capture filter (the actual regular expression twitter uses is MUCH longer than this):

"hashtag_filter": {
    "type" : "pattern_capture",
    "preserve_original" : 1,
    "patterns" : ["#([^\\s]*)"]
}

This application may index several hundred messages per second, so my questions are:

Should I be concerned about the performance of a regular expression match?
Is there an alternative approach that might be less expensive than a regular expression match?


(system) #2