Search partial URL


(Davide) #1

Hello,

I have a problem searching partial URLs in a text field. I'm using a a word_delimiter filter to split possible URLs. Here is the mappings:

{  
    "settings":{  
        "analysis":{  
            "filter":{  
                "my_word_delimiter":{  
                    "type":"word_delimiter",
                    "catenate_words":false,
                    "catenate_numbers":false,
                    "split_on_numerics":false,
                    "split_on_case_change":false
                },
                "english_possessive_stemmer":{  
                    "type":"stemmer",
                    "name":"possessive_english"
                },
                "english_stop":{  
                    "type":"stop",
                    "stopwords":"_english_"
                },
                "english_plural_stemmer":{  
                    "type":"stemmer",
                    "name":"minimal_english"
                }
            },
            "analyzer":{  
                "custom_analyzer":{  
                    "tokenizer": "whitespace",
                    "filter":[  
                    	"my_word_delimiter",
                        "lowercase",
                        "english_possessive_stemmer", "english_plural_stemmer", 
                        "english_stop"
                    ]
                }
            }
        }
    },
    "mappings": {
    	"test": {
    		"properties": {
    			"body": {
    				"type": "text",
    				"analyzer": "custom_analyzer"
    			}
    		}
    	}
    }
}

When running a search or aggregating data this leads to unexpected results.
Let's consider the following document:

{
    "body": "www.google.co.uk hello com"
}

It generates 6 tokens: www, google, co, uk hello, com.

If an user searches "google.com", ES returns the document above. Even if it is technically correct, that is what you don't expect.

So I was thinking to implement a filter to parse URLs.
The filter is supposed to generate the following tokens for the token "www.google.co.uk": www.google.co.uk (original), google.co.uk, google.
Then at query time, I would a simple analyser that doesn't tokenise data. So if the user searches "google" or "google.co.uk", he will get proper results.

What do you think?


(system) #2

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.