I have bunch of sitemaps with a list of urls and last modified time, that i want to fetch (get the html) and parse (extract title, links, text, etc) the content and finally index it to elasticsearch. In future, I might have to deal with PDF, Doc and other kinds of content present in some of the urls, as well.
So far I looked at Nutch, Scrapy and Storm crawler. I am trying to keep it simple with room for further improvement in future. I would like to go with a solution thats widely adopted and has a good support. Does any one have any recommendation ?