Search partial URL

netcelli.tux · November 24, 2016, 11:10am

Hello,

I have a problem searching partial URLs in a text field. I'm using a a word_delimiter filter to split possible URLs. Here is the mappings:

{  
    "settings":{  
        "analysis":{  
            "filter":{  
                "my_word_delimiter":{  
                    "type":"word_delimiter",
                    "catenate_words":false,
                    "catenate_numbers":false,
                    "split_on_numerics":false,
                    "split_on_case_change":false
                },
                "english_possessive_stemmer":{  
                    "type":"stemmer",
                    "name":"possessive_english"
                },
                "english_stop":{  
                    "type":"stop",
                    "stopwords":"_english_"
                },
                "english_plural_stemmer":{  
                    "type":"stemmer",
                    "name":"minimal_english"
                }
            },
            "analyzer":{  
                "custom_analyzer":{  
                    "tokenizer": "whitespace",
                    "filter":[  
                    	"my_word_delimiter",
                        "lowercase",
                        "english_possessive_stemmer", "english_plural_stemmer", 
                        "english_stop"
                    ]
                }
            }
        }
    },
    "mappings": {
    	"test": {
    		"properties": {
    			"body": {
    				"type": "text",
    				"analyzer": "custom_analyzer"
    			}
    		}
    	}
    }
}

When running a search or aggregating data this leads to unexpected results.
Let's consider the following document:

{
    "body": "www.google.co.uk hello com"
}

It generates 6 tokens: www, google, co, uk hello, com.

If an user searches "google.com", ES returns the document above. Even if it is technically correct, that is what you don't expect.

So I was thinking to implement a filter to parse URLs.
The filter is supposed to generate the following tokens for the token "www.google.co.uk": www.google.co.uk (original), google.co.uk, google.
Then at query time, I would a simple analyser that doesn't tokenise data. So if the user searches "google" or "google.co.uk", he will get proper results.

What do you think?

system · December 22, 2016, 11:10am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Field Analyser vs _all Analyser and Query String Analyser Elasticsearch	1	491	October 2, 2017
Word Delimiter Filter Elasticsearch	1	285	July 6, 2017
WordDelimiterTokenFilter doesn't seem to be generating expected tokens Elasticsearch	1	509	February 19, 2018
Word_delimiter behaviour using match query with operator and Elasticsearch	1	203	September 26, 2022
Pattern tokenization to split multiple URL's (edited) Elasticsearch	1	448	July 5, 2017

Search partial URL

Related topics