The standard tokenizer is following the Unicode Standard Annex #29, and
doesn't really have any settings besides version and max_token_length. I am
not sure what's your use case, but one possible solution that comes to mind
would be to replace hyphens and ampersands with symbols that don't cause
words to be split, or just remove them. For example, you can replace
hyphens with underscores and remove ampersands. Another interesting
replacement character is ".".
You can achieve it using Mapping Char Filterhttp://www.elasticsearch.org/guide/reference/index-modules/analysis/mapping-charfilter.html
.
$ curl -XPUT http://localhost:9200/test -d '{
"settings" : {
"index" : {
"number_of_shards" : 5,
"number_of_replicas" : 0,
"analysis" : {
"char_filter" : {
"my_filter" : {
"type" : "mapping",
"mappings" : ["'''=>", "-=>_"]
}
},
"analyzer" : {
"my_standard" : {
"type": "custom",
"tokenizer" : "standard",
"char_filter" : ["my_filter"],
"filter" : ["standard", "lowercase", "stop"]
}
}
}
}
}
}'
{"ok":true,"acknowledged":true}
$ curl "localhost:9200/test/_analyze?analyzer=my_standard&pretty=true" -d
"Robin's i-mac is nice."
{
"tokens" : [ {
"token" : "robins",
"start_offset" : 0,
"end_offset" : 7,
"type" : "",
"position" : 1
}, {
"token" : "i_mac",
"start_offset" : 8,
"end_offset" : 13,
"type" : "",
"position" : 2
}, {
"token" : "nice",
"start_offset" : 17,
"end_offset" : 21,
"type" : "",
"position" : 4
} ]
}
On Thursday, August 9, 2012 7:09:12 AM UTC-4, Robin Hughes wrote:
Hi
We use the "standard" tokenizer in custom analyzer definitions. By default
the standard tokenizer splits words on hyphens and ampersands, so for
example "i-mac" is tokenized to "i" and "mac"
Is there any way to configure the behaviour of the standard tokenizer to
stop it splitting words on hyphens and ampersands, while still doing all
the normal tokenizing it does on other punctuation?
We could define our own "Pattern" tokenizer using a regex but it seems
like any custom regex we write is unlikely to deal with the myriad unicode
characters the standard tokenizer handles already.
Regards,
Robin