Standard tokenizer documentation doesn't match behavior


(George Sakkis) #1

Am I reading it wrong or has something changed recently? From
http://www.elasticsearch.org/guide/reference/index-modules/analysis/standard-tokenizer.html:

"[A standard tokenizer] also splits words at hyphens, unless there’s a
number in the token, in which case the whole token is interpreted as a
product number and is not split."

curl -XGET 'http://localhost:9200/twitter/_analyze/?
pretty=true&analyzer=standard' -d '123-456-7890'
{
"tokens" : [ {
"token" : "123",
"start_offset" : 0,
"end_offset" : 3,
"type" : "",
"position" : 1
}, {
"token" : "456",
"start_offset" : 4,
"end_offset" : 7,
"type" : "",
"position" : 2
}, {
"token" : "7890",
"start_offset" : 8,
"end_offset" : 12,
"type" : "",
"position" : 3
} ]

"It recognizes email addresses and internet hostnames as one token."

curl -GET 'http://localhost:9200/twitter/_analyze/?
pretty=true&analyzer=standard' -d 'somebody@example.com'
{
"tokens" : [ {
"token" : "somebody",
"start_offset" : 0,
"end_offset" : 8,
"type" : "",
"position" : 1
}, {
"token" : "example.com",
"start_offset" : 9,
"end_offset" : 20,
"type" : "",
"position" : 2
} ]


(Shay Banon) #2

You are right, this is the old behavior of the tokenizer (though you can
retain the email behavior by using the uax tokenizer). I will fix it
shortly.

On Thu, Jan 19, 2012 at 3:05 PM, George Sakkis george.sakkis@gmail.comwrote:

Am I reading it wrong or has something changed recently? From

http://www.elasticsearch.org/guide/reference/index-modules/analysis/standard-tokenizer.html
:

"[A standard tokenizer] also splits words at hyphens, unless there’s a
number in the token, in which case the whole token is interpreted as a
product number and is not split."

curl -XGET 'http://localhost:9200/twitter/_analyze/?
pretty=true&analyzer=standard' -d '123-456-7890'
{
"tokens" : [ {
"token" : "123",
"start_offset" : 0,
"end_offset" : 3,
"type" : "",
"position" : 1
}, {
"token" : "456",
"start_offset" : 4,
"end_offset" : 7,
"type" : "",
"position" : 2
}, {
"token" : "7890",
"start_offset" : 8,
"end_offset" : 12,
"type" : "",
"position" : 3
} ]

"It recognizes email addresses and internet hostnames as one token."

curl -GET 'http://localhost:9200/twitter/_analyze/?
pretty=true&analyzer=standard' -d 'somebody@example.com'
{
"tokens" : [ {
"token" : "somebody",
"start_offset" : 0,
"end_offset" : 8,
"type" : "",
"position" : 1
}, {
"token" : "example.com",
"start_offset" : 9,
"end_offset" : 20,
"type" : "",
"position" : 2
} ]


(system) #3