Non-matched tokens not filtered: pattern_capture with preserve_original: false


(Julian Haight) #1

I have a custom pattern tokenizer that creates tokens which include colon (:). I also have a custom pattern_capture token filter. I have preserve_original set to false.

The problem is that tokens which don't match any of my patterns are being passed through the filter.

Here's a simplified test case:

            "filter": {
                "ipnwords": {
                    "type": "pattern_capture",
                    "preserve_original": false,
                    "patterns": [
                        "([a-z0-9][a-z0-9-]+[a-z0-9])"
                    ]
                }
            },
            "analyzer": {
                "words": {
                    "tokenizer": "ipwords",
                    "filter": ["lowercase", "ipnwords", "unique"]
                }
            },
            "tokenizer": {
                "ipwords": {
                    "type": "pattern",
                    "pattern": "[a-zA-Z0-9:\\.-]+",
                    "group": 0
                }
            }

When I pass a test document to /_analyze that includes the string :::, that string is returned as a token, even though it should never match the pattern in the filter.

This seems like a bug (preserve_orginal=false ignored when no pattern matched in pattern_capture), but maybe someone can point out the error of my ways.

Thanks in advance for any help/advice. Let me know if I can provide more detail or help with testing/bugfix.


(Davidcai19840412) #2

i have the same problem, and not sovled.


(Peter Pul) #3

Oké, had to create an account to reply onto this, but I had a similar problem and maybe my solution could help you out. You helped me with the following sentence:

This made me understand what the tokenizer actually does.

But I think your problem lies in your custom Tokenizer, which has the pattern "[a-zA-Z0-9:\\.-]+", but the docs say: "The default pattern is \W+, which splits text whenever it encounters non-word characters." Thus with your pattern you get a lot of tokens! You probably need to negate your pattern "[^a-zA-Z0-9:\\.-]+".

To compare with my problem and solution, see StackOverflow


(system) #4