Problem with uax_url_email tokenizer with pattern capture filter

Madhur_Tandon · November 23, 2017, 8:11am

Using tokenizer uax_url_email and filter lowercase, email_filter, unique

email_filter:
"type" => "pattern_capture",
"preserve_original" => true,
"patterns" => [
"([^@]+)",
"(\p{L}+)",
"(\d+)",
"@(.+)"
]

Now if we try to analyse the text s@test.club, then it get tokenise like:
{
"tokens": [
{
"token": "s@test.cl",
"start_offset": 0,
"end_offset": 9,
"type": "",
"position": 0
},
{
"token": "s",
"start_offset": 0,
"end_offset": 9,
"type": "",
"position": 0
},
{
"token": "s",
"start_offset": 0,
"end_offset": 9,
"type": "",
"position": 0
},
{
"token": "test.cl",
"start_offset": 0,
"end_offset": 9,
"type": "",
"position": 0
},
{
"token": "test",
"start_offset": 0,
"end_offset": 9,
"type": "",
"position": 0
},
{
"token": "cl",
"start_offset": 0,
"end_offset": 9,
"type": "",
"position": 0
},
{
"token": "ub",
"start_offset": 9,
"end_offset": 11,
"type": "",
"position": 1
}
]
}

Problem:
Why it is not preserving the original(s@test.club) one when it is tokenising the text

system · December 21, 2017, 8:11am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Pattern_capture filter emits a token that is not matched with pattern also Elasticsearch	2	759	July 5, 2017
The tokenizer "uax_url_email" doesn't work Elasticsearch	3	513	July 5, 2017
How to configure a pattern_capture on a specific token type? Elasticsearch	1	381	July 6, 2017
Duplicate Tokens in elasticsearch uax_url_email tokenizer Elasticsearch	1	180	April 30, 2022
Uax_url_email tokenizer not recognising valid emails with no dots on the email domain Elasticsearch	2	22	August 5, 2024

Problem with uax_url_email tokenizer with pattern capture filter

Related topics