HTML structural similarity

(Peyman Mohajerian) #1

My goal is to find structural similarity among web sites that I'm crawling. Doing content based similarity is easy using 'more like this' query but I'm interested in structural similarity, e.g. same author who created site X must be behind site Y because the site's source code are similar even though the content may not be. I have found some relevant publication:
Buttler, David. "A Short Survey of Document Similarity Algorithms," IC'04 June 21-24, Las Vegas, NV (2004).

There is perhaps more recent work as well. My question is: how to I do this in Elasticsearch? My first idea was to extract the XML DOM of each web site and compare them using 'more like this' query. But I'm doubtful it will work because it will do text-based comparison.

Thanks in advance

(system) #2