Elasticsearch version: 7.8.0
Use uax_url_email and pattern_capture to analyze emails, search email address get unexpected result.
Two email address:
hello-test hello-world@gmail.com
hello-test-hello-world@gmail.com
Search hello-test get "hello-test hello-world@gmail.com", should return both docs.
Search hello-world, nothing returns, should return both docs.
Reproduce steps:
Put mapping:
PUT email
{
"settings" : {
"number_of_shards" : 1,
"analysis": {
"analyzer": {
"email_analyzer": {
"tokenizer": "uax_url_email",
"filter": [
"lowercase",
"email",
"remove_duplicates"
]
},
"search_email": {
"tokenizer": "uax_url_email",
"filter": [
"lowercase"
]
}
},
"filter": {
"email": {
"type": "pattern_capture",
"preserve_original": "true",
"patterns": [
"([^@]+)",
"(\\p{LD}+)",
"@(.+)",
"(@)"
]
}
}
}
},
"mappings" : {
"properties" : {
"email" : {
"type" : "text",
"analyzer": "email_analyzer"
}
}
}
}
Put docs:
PUT email/_doc/1
{
"email": "hello-test hello-world@gmail.com"
}
PUT email/_doc/2
{
"email": "hello-test-hello-world@gmail.com"
}
Do search:
GET email/_search
{
"query": {
"match_phrase": {
"email": "hello-test"
}
}
}
Result: Only return 1 doc: "hello-test hello-world@gmail.com".
Expected: Return all docs.
GET email/_search
{
"query": {
"match_phrase": {
"email": "hello-world"
}
}
}
Result: No match doc returns.
Expected: Return all docs