Using tokenizer uax_url_email and filter lowercase, email_filter, unique
email_filter:
"type" => "pattern_capture",
"preserve_original" => true,
"patterns" => [
"([^@]+)",
"(\p{L}+)",
"(\d+)",
"@(.+)"
]
Now if we try to analyse the text s@test.club, then it get tokenise like:
{
"tokens": [
{
"token": "s@test.cl",
"start_offset": 0,
"end_offset": 9,
"type": "",
"position": 0
},
{
"token": "s",
"start_offset": 0,
"end_offset": 9,
"type": "",
"position": 0
},
{
"token": "s",
"start_offset": 0,
"end_offset": 9,
"type": "",
"position": 0
},
{
"token": "test.cl",
"start_offset": 0,
"end_offset": 9,
"type": "",
"position": 0
},
{
"token": "test",
"start_offset": 0,
"end_offset": 9,
"type": "",
"position": 0
},
{
"token": "cl",
"start_offset": 0,
"end_offset": 9,
"type": "",
"position": 0
},
{
"token": "ub",
"start_offset": 9,
"end_offset": 11,
"type": "",
"position": 1
}
]
}
Problem:
Why it is not preserving the original(s@test.club) one when it is tokenising the text