Url normalization with elasticsearch


Has anyone tried url normalization with elasticsearch? Or can you share
some hints?

" URL normalization http://en.wikipedia.org/wiki/URL_normalization (or
URL canonicalization) is the process by which URLs are modified and
standardized in a consistent manner. The goal of the normalization process
is to transform a URL into a normalized or canonical URL so it is possible
to determine if two syntactically different URLs are equivalent. " from

Here is an example of the problem:

( from


URL Description http://mysite.com A webmaster may consider this their
authoritative or canonical URL for their homepage.

http://www.mysite.com However, you can add 'www' to most websites and
still get the same home page.

http://mysite.com/default.aspx You can also often add the specific
filename of the homepage and get the same page

http://mysite.com/default.aspx?promo=ABC Many times websites use
parameters to track things like where customers are coming from (in this
case an offline promotion), or parameters that determine how the content on
the page is formatted.


Have you make any progress in this topic since you had this problem? If
yes, could you show me your research effects?


You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.