Am I reading it wrong or has something changed recently? From
http://www.elasticsearch.org/guide/reference/index-modules/analysis/standard-tokenizer.html:
"[A standard tokenizer] also splits words at hyphens, unless there’s a
number in the token, in which case the whole token is interpreted as a
product number and is not split."
curl -XGET 'http://localhost:9200/twitter/_analyze/ ?
pretty=true&analyzer=standard' -d '123-456-7890'
{
"tokens" : [ {
"token" : "123",
"start_offset" : 0,
"end_offset" : 3,
"type" : "",
"position" : 1
}, {
"token" : "456",
"start_offset" : 4,
"end_offset" : 7,
"type" : "",
"position" : 2
}, {
"token" : "7890",
"start_offset" : 8,
"end_offset" : 12,
"type" : "",
"position" : 3
} ]
"It recognizes email addresses and internet hostnames as one token."
curl -GET 'http://localhost:9200/twitter/_analyze/ ?
pretty=true&analyzer=standard' -d 'somebody@example.com'
{
"tokens" : [ {
"token" : "somebody",
"start_offset" : 0,
"end_offset" : 8,
"type" : "",
"position" : 1
}, {
"token" : "example.com ",
"start_offset" : 9,
"end_offset" : 20,
"type" : "",
"position" : 2
} ]
kimchy
(Shay Banon)
January 19, 2012, 6:36pm
2
You are right, this is the old behavior of the tokenizer (though you can
retain the email behavior by using the uax tokenizer). I will fix it
shortly.
On Thu, Jan 19, 2012 at 3:05 PM, George Sakkis george.sakkis@gmail.com wrote:
Am I reading it wrong or has something changed recently? From
Elasticsearch Platform — Find real-time answers at scale | Elastic
:
"[A standard tokenizer] also splits words at hyphens, unless there’s a
number in the token, in which case the whole token is interpreted as a
product number and is not split."
curl -XGET 'http://localhost:9200/twitter/_analyze/ ?
pretty=true&analyzer=standard' -d '123-456-7890'
{
"tokens" : [ {
"token" : "123",
"start_offset" : 0,
"end_offset" : 3,
"type" : "",
"position" : 1
}, {
"token" : "456",
"start_offset" : 4,
"end_offset" : 7,
"type" : "",
"position" : 2
}, {
"token" : "7890",
"start_offset" : 8,
"end_offset" : 12,
"type" : "",
"position" : 3
} ]
"It recognizes email addresses and internet hostnames as one token."
curl -GET 'http://localhost:9200/twitter/_analyze/ ?
pretty=true&analyzer=standard' -d 'somebody@example.com'
{
"tokens" : [ {
"token" : "somebody",
"start_offset" : 0,
"end_offset" : 8,
"type" : "",
"position" : 1
}, {
"token" : "example.com ",
"start_offset" : 9,
"end_offset" : 20,
"type" : "",
"position" : 2
} ]