Pattern_replace not working with optional regex

I'm trying to remove http:// and/or www from a url, as well as ?..., so basically strip it to it's minimal form. I'm using the pattern_replace char_filter (ES version 1.7). And having trouble matching "optional" regex pattern:

So here's my filter:

"url_strip_filter": {
	"type": "pattern_replace",
	"pattern": "(?:http:\\/\\/)?(?:www\\.)?([^\\?]+)(?:\\?.*)?",
	"replacement": "$1"
}

And when I use the ES _analayze endpoint for that filter, it is not stripping out the optional groups:

GET /my_index/_analyze?char_filters=url_strip_filter&tokenizer=keyword
{
    "text": "http://www.blah.com/?foo"
}

{
   "tokens": [
      {
         "token": "{\n    \"text\": \"http://www.blah.com/\n}\n",
         "start_offset": 0,
         "end_offset": 43,
         "type": "word",
         "position": 1
      }
   ]
}

But when I remove the ? for optional, it works:

"url_strip_filter": {
	"type": "pattern_replace",
	"pattern": "(?:http:\\/\\/)(?:www\\.)([^\\?]+)(?:\\?.*)?",
	"replacement": "$1"
}

{
   "tokens": [
      {
         "token": "{\n    \"text\": \"blah.com/\n}\n",
         "start_offset": 0,
         "end_offset": 43,
         "type": "word",
         "position": 1
      }
   ]
}

I don't think it's a regex issue bc I tested the regex and it works as expected: https://regex101.com/r/Hd9E2V/3.

Not sure what I'm missing here, but any help would be appreciated!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.