I have created a test index here
{
"settings" : {
"analysis" : {
"filter" : {
"emailcustom" : {
"type" : "pattern_capture",
"preserve_original" : true,
"patterns" : [
"([^@]+)",
"(\\p{L}+)",
"(\\d+)",
"@(.+)"
]
}
},
"analyzer" : {
"my_email_analyzer" : {
"tokenizer" : "uax_url_email",
"filter" : [ "lowercase", "asciifolding", "emailcustom" ]
}
}
}
},
"mappings": {
"properties": {
"descwithemail":{
"type": "keyword",
"fields": {
"text": {
"type": "text",
"analyzer": "my_email_analyzer"
}
}
}
}
}
}
When I try and
GET /myindex/_analyze
{
"analyzer": "my_email_analyzer",
"text": "My email address is john@domain.com"
}
I get duplicate tokens namely 'domain.com' and 'john'
I know I should use the 'unique' filter but what is the impact of not using or using it.
I also do not understand where the duplicate tokens come from.