Search partial URL

Hello,

I have a problem searching partial URLs in a text field. I'm using a a word_delimiter filter to split possible URLs. Here is the mappings:

{  
    "settings":{  
        "analysis":{  
            "filter":{  
                "my_word_delimiter":{  
                    "type":"word_delimiter",
                    "catenate_words":false,
                    "catenate_numbers":false,
                    "split_on_numerics":false,
                    "split_on_case_change":false
                },
                "english_possessive_stemmer":{  
                    "type":"stemmer",
                    "name":"possessive_english"
                },
                "english_stop":{  
                    "type":"stop",
                    "stopwords":"_english_"
                },
                "english_plural_stemmer":{  
                    "type":"stemmer",
                    "name":"minimal_english"
                }
            },
            "analyzer":{  
                "custom_analyzer":{  
                    "tokenizer": "whitespace",
                    "filter":[  
                    	"my_word_delimiter",
                        "lowercase",
                        "english_possessive_stemmer", "english_plural_stemmer", 
                        "english_stop"
                    ]
                }
            }
        }
    },
    "mappings": {
    	"test": {
    		"properties": {
    			"body": {
    				"type": "text",
    				"analyzer": "custom_analyzer"
    			}
    		}
    	}
    }
}

When running a search or aggregating data this leads to unexpected results.
Let's consider the following document:

{
    "body": "www.google.co.uk hello com"
}

It generates 6 tokens: www, google, co, uk hello, com.

If an user searches "google.com", ES returns the document above. Even if it is technically correct, that is what you don't expect.

So I was thinking to implement a filter to parse URLs.
The filter is supposed to generate the following tokens for the token "www.google.co.uk": www.google.co.uk (original), google.co.uk, google.
Then at query time, I would a simple analyser that doesn't tokenise data. So if the user searches "google" or "google.co.uk", he will get proper results.

What do you think?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.