The tokenizer "uax_url_email" doesn't work

recently i am trying to study the analyzer of ES, so i try to practice the tokenizer "uax_url_email" and pattern pattern_capture token filter.

As this page https://www.elastic.co/guide/en/elasticsearch/reference/2.1/analysis-pattern-capture-tokenfilter.html said with below analyzer if i input john-smith_123@foo-bar.com it would produce the following tokens: john-smith_123@foo-bar.com, john-smith_123, john, smith, 123, foo-bar.com, foo, bar, com.

curl -XPUT 10.35.22.80:9200/test_email/ -d '
{
"settings" : {
"analysis" : {
"filter" : {
"email" : {
"type" : "pattern_capture",
"preserve_original" : 1,
"patterns" : [
"([^@]+)",
"(\p{L}+)",
"(\d+)",
"@(.+)"
]
}
},
"analyzer" : {
"email" : {
"tokenizer" : "uax_url_email",
"filter" : [ "email", "lowercase", "unique" ]
}
}
}
}
}'

But when i try the above mapping in my cluster, the output of my cluster are not same as the ES page write. i only got below output. I tried on both 1.7.3 and 2.1.0. Both of them got the same output like below. so does any body know what should i do?
cloud@g-dev-work:~> curl 10.35.22.80:9200/test_email/_analyze?pretty -d "john-smith_123@foo-bar.com"
{
"tokens" : [ {
"token" : "john",
"start_offset" : 0,
"end_offset" : 4,
"type" : "",
"position" : 0
}, {
"token" : "smith_123",
"start_offset" : 5,
"end_offset" : 14,
"type" : "",
"position" : 1
}, {
"token" : "foo",
"start_offset" : 15,
"end_offset" : 18,
"type" : "",
"position" : 2
}, {
"token" : "bar.com",
"start_offset" : 19,
"end_offset" : 26,
"type" : "",
"position" : 3
} ]
}

You need to tell the _analyze endpoint which analyzer to use:

curl "localhost:9200/test_email/_analyze?analyzer=email&pretty" -d "john-smith_123@foo-bar.com"

1 Like

That works. Thanks, i sill have lot of info need to learn.:smile: