Pattern_capture filter emits a token that is not matched with pattern also

Raj_2 · August 14, 2014, 12:13pm

 I have a case where I have to extract domain part from emails that are

found in a text. I used uax_url_email tokenizer to create emails as a
single. And I have a pattern_capture filter which will emit "@(.+)" pattern
string. But uax_url_email also return words also which is not an email and
the pattern capture filter does not filter that. Any suggestions?

"custom_analyzer":{
"tokenizer": "uax_url_email",
"filter": [
"email_domain_filter"
]
}
"filter": {
"email_domain_filter":{
"type": "pattern_capture",
"preserve_original": false,
"patterns": [
"@(.+)"
]
}
}

input string : "my email id is xyz@gmail.com"
Output tokens: my, email, id, is, gmail.com

But I need only gmail.com

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3de51758-bb99-46c6-b47c-a68004de8eb8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

haight6716 · February 2, 2016, 10:00pm

I have a similar problem using pattern capture. In my case, it "feels" like preserve_original: false isn't working. I'm getting tokens that don't match my patterns. My document includes the text ":::" which is being picked up as a token, but my filter allows only [a-z0-9].

        "analysis": {
            "filter": {
                "ipnwords": {
                    "type": "pattern_capture",
                    "preserve_original": false,
                    "patterns": [
                        "^([a-z0-9]+)$"
                    ]
                }
            },
            "analyzer": {
                "words": {
                    "tokenizer": "ipwords",
                    "filter": ["lowercase", "ipnwords", "unique"]
                }
            },
            "tokenizer": {
                "ipwords": {
                    "type": "pattern",
                    "pattern": "[a-zA-Z0-9:\\.-]+",
                    "group": 0
                }
            }
        }

You can see my tokenizer is picking up colons, but my filter should be dropping tokens with colon. Ultimately, I'll include patterns that allow colon in specific situations, so I can't just drop it from the tokenizer.

Topic		Replies	Views
Problem with uax_url_email tokenizer with pattern capture filter Elasticsearch	1	455	December 21, 2017
How to configure a pattern_capture on a specific token type? Elasticsearch	1	381	July 6, 2017
Uax_url_email tokenizer not recognising valid emails with no dots on the email domain Elasticsearch	2	22	August 5, 2024
The tokenizer "uax_url_email" doesn't work Elasticsearch	3	513	July 5, 2017
Non-matched tokens not filtered: pattern_capture with preserve_original: false Elasticsearch	3	1065	July 5, 2017

Pattern_capture filter emits a token that is not matched with pattern also

Related topics