Duplicate Tokens in elasticsearch uax_url_email tokenizer

I have created a test index here

{
                    "settings" : {
                        "analysis" : {
                            "filter" : {
                            "emailcustom" : {
                                    "type" : "pattern_capture",
                                    "preserve_original" : true,
                                    "patterns" : [
                                        "([^@]+)",
                                        "(\\p{L}+)",
                                        "(\\d+)",
                                        "@(.+)"
                                    ]
                                }
                            },
                            "analyzer" : {
                                "my_email_analyzer" : {
                                    "tokenizer" : "uax_url_email",
                                    "filter" : [  "lowercase", "asciifolding", "emailcustom" ]
                                }
                            }
                        }
                    },
                    "mappings": {
                        "properties": {
			                "descwithemail":{
				                "type": "keyword", 
				                "fields": {
                                    "text": { 
                                        "type":  "text",
						                "analyzer": "my_email_analyzer"
                                    }
                        }         
			          }          
                    }    
                }
            }

When I try and

GET /myindex/_analyze
             {
            "analyzer": "my_email_analyzer",
            "text": "My email address is john@domain.com"
            }

I get duplicate tokens namely 'domain.com' and 'john'

I know I should use the 'unique' filter but what is the impact of not using or using it.

I also do not understand where the duplicate tokens come from.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.