Has anyone tried url normalization with elasticsearch? Or can you share
" URL normalization http://en.wikipedia.org/wiki/URL_normalization (or
URL canonicalization) is the process by which URLs are modified and
standardized in a consistent manner. The goal of the normalization process
is to transform a URL into a normalized or canonical URL so it is possible
to determine if two syntactically different URLs are equivalent. " from
Here is an example of the problem:
URL Description http://mysite.com A webmaster may consider this their
authoritative or canonical URL for their homepage.
http://www.mysite.com However, you can add 'www' to most websites and
still get the same home page.
http://mysite.com/default.aspx You can also often add the specific
filename of the homepage and get the same page
http://mysite.com/default.aspx?promo=ABC Many times websites use
parameters to track things like where customers are coming from (in this
case an offline promotion), or parameters that determine how the content on
the page is formatted.