Pattern_capture filter emits a token that is not matched with pattern also


(Raj-2) #1
 I have a case where I have to extract domain part from emails that are 

found in a text. I used uax_url_email tokenizer to create emails as a
single. And I have a pattern_capture filter which will emit "@(.+)" pattern
string. But uax_url_email also return words also which is not an email and
the pattern capture filter does not filter that. Any suggestions?

"custom_analyzer":{
"tokenizer": "uax_url_email",
"filter": [
"email_domain_filter"
]
}
"filter": {
"email_domain_filter":{
"type": "pattern_capture",
"preserve_original": false,
"patterns": [
"@(.+)"
]
}
}

input string : "my email id is xyz@gmail.com"
Output tokens: my, email, id, is, gmail.com

But I need only gmail.com

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3de51758-bb99-46c6-b47c-a68004de8eb8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Julian Haight) #2

I have a similar problem using pattern capture. In my case, it "feels" like preserve_original: false isn't working. I'm getting tokens that don't match my patterns. My document includes the text ":::" which is being picked up as a token, but my filter allows only [a-z0-9].

        "analysis": {
            "filter": {
                "ipnwords": {
                    "type": "pattern_capture",
                    "preserve_original": false,
                    "patterns": [
                        "^([a-z0-9]+)$"
                    ]
                }
            },
            "analyzer": {
                "words": {
                    "tokenizer": "ipwords",
                    "filter": ["lowercase", "ipnwords", "unique"]
                }
            },
            "tokenizer": {
                "ipwords": {
                    "type": "pattern",
                    "pattern": "[a-zA-Z0-9:\\.-]+",
                    "group": 0
                }
            }
        }

You can see my tokenizer is picking up colons, but my filter should be dropping tokens with colon. Ultimately, I'll include patterns that allow colon in specific situations, so I can't just drop it from the tokenizer.


(system) #3