Stop standard tokenizer from splitting on punctuations

Hi

We use the "standard" tokenizer in custom analyzer definitions. By default
the standard tokenizer splits words on hyphens and ampersands, so for
example "i-mac" is tokenized to "i" and "mac"

Is there any way to configure the behaviour of the standard tokenizer to
stop it splitting words on all punctuations except comma(","), while still doing all
the other tokenizing it does.

Or maybe define a custom tokenizer which will achieve the above
For example:
query string ["n-12"] should create a this token:

{
  "tokens" : [
    {
      "token" : "n-12",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "<ALPHANUM>",
      "position" : 0
    }
  ]
} 
instead of 
{
    "tokens" : [
      {
        "token" : "n",
        "start_offset" : 0,
        "end_offset" : 1,
        "type" : "<ALPHANUM>",
        "position" : 0
      },
      {
        "token" : "12",
        "start_offset" : 0,
        "end_offset" : 2,
        "type" : "<ALPHANUM>",
        "position" : 0
      }
    ]
}

Regards,
Nipun

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.