How to set up website crawl with ElasticSearch

(Arul Krishnamoorthy) #1

My understanding is that even with latest version of ElasticSearch (5.1.2) there is no build-in functionality for crawling a website.

What are the options and any recommendations among preferred options.

(David Pilato) #2

Is it a public website with static pages or a private one built from database data?

(Arul Krishnamoorthy) #3

Thanks David for your reponse. Its a public website with section of the pages with dynamic data

(David Pilato) #4

So I guess you don't have access to the datasource? The structured data I mean?

May be using nutch could help? I know there are some recipes on the web about connecting Nutch and elasticsearch.

(Arul Krishnamoorthy) #5

Thanks Again David. I was thinking Nutch implementation itself will be heavy weight with dependency for underlying store. Are there any lightweight options ?

(David Pilato) #6

I don't know. Never did any web crawling in the past.
I'm always prefering indexing from the data source than from the rendered pages. But may be it's not possible for you.

(Arul Krishnamoorthy) #7

No problem. Thank you David.

(system) #8

