Non-matched tokens not filtered: pattern_capture with preserve_original: false

haight6716 · February 2, 2016, 10:40pm

I have a custom pattern tokenizer that creates tokens which include colon (:). I also have a custom pattern_capture token filter. I have preserve_original set to false.

The problem is that tokens which don't match any of my patterns are being passed through the filter.

Here's a simplified test case:

            "filter": {
                "ipnwords": {
                    "type": "pattern_capture",
                    "preserve_original": false,
                    "patterns": [
                        "([a-z0-9][a-z0-9-]+[a-z0-9])"
                    ]
                }
            },
            "analyzer": {
                "words": {
                    "tokenizer": "ipwords",
                    "filter": ["lowercase", "ipnwords", "unique"]
                }
            },
            "tokenizer": {
                "ipwords": {
                    "type": "pattern",
                    "pattern": "[a-zA-Z0-9:\\.-]+",
                    "group": 0
                }
            }

When I pass a test document to /_analyze that includes the string :::, that string is returned as a token, even though it should never match the pattern in the filter.

This seems like a bug (preserve_orginal=false ignored when no pattern matched in pattern_capture), but maybe someone can point out the error of my ways.

Thanks in advance for any help/advice. Let me know if I can provide more detail or help with testing/bugfix.

davidcai19840412 · August 24, 2016, 8:43am

i have the same problem, and not sovled.

Blackeagle52 · March 6, 2017, 5:47pm

Oké, had to create an account to reply onto this, but I had a similar problem and maybe my solution could help you out. You helped me with the following sentence:

This made me understand what the tokenizer actually does.

But I think your problem lies in your custom Tokenizer, which has the pattern "[a-zA-Z0-9:\\.-]+", but the docs say: "The default pattern is \W+, which splits text whenever it encounters non-word characters." Thus with your pattern you get a lot of tokens! You probably need to negate your pattern "[^a-zA-Z0-9:\\.-]+".

To compare with my problem and solution, see StackOverflow

Topic		Replies	Views
Pattern_capture filter emits a token that is not matched with pattern also Elasticsearch	2	757	July 5, 2017
Pattern_replace Token Filter and preserve original tokens Elasticsearch	1	10	September 25, 2024
Need help with Pattern Capture Token Filter Elasticsearch	1	194	May 2, 2022
Filter keep_types not working with pattern tokenizer Elasticsearch	1	327	June 29, 2020
PatternReplaceCharFilter and Punctuation Characters Elasticsearch	1	911	March 16, 2017

Non-matched tokens not filtered: pattern_capture with preserve_original: false

Related topics