Index content to elasticsearch cluster from sitemap

Srinivasan_Ramaswamy · November 10, 2016, 7:43pm

I have bunch of sitemaps with a list of urls and last modified time, that i want to fetch (get the html) and parse (extract title, links, text, etc) the content and finally index it to elasticsearch. In future, I might have to deal with PDF, Doc and other kinds of content present in some of the urls, as well.

So far I looked at Nutch, Scrapy and Storm crawler. I am trying to keep it simple with room for further improvement in future. I would like to go with a solution thats widely adopted and has a good support. Does any one have any recommendation ?

system · December 8, 2016, 7:43pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Using a sitemap to get elasticSearch content Elasticsearch	2	1259	July 6, 2017
Website crawl and index into Elasticsearch Elastic Community and Ecosystem	4	2211	October 24, 2017
Indexing html as raw content Elasticsearch	3	541	October 8, 2022
Indexing HTML documents, problems with JSON Elasticsearch	5	981	July 6, 2017
Crawling web sites and indexing the extracted content Elasticsearch	8	10895	July 6, 2017

Index content to elasticsearch cluster from sitemap

Related topics