ElasticSearch6.4 compatible web crawler


(Subasini Rath) #1

Hi All,
I want to use ElasticSearch 6.4 as I am going to replace GSA in my application. ElasticSearch will crawl around 60+ websites including child links.

Since we are not going to use Cloud based solution -

  1. Can I use nutch 2.3.1 crawler and My SQL with ElasticSearch 6.4
  2. Which other softwares do I need to replace GSA with ElasticSearch?
  3. Does anybody has any steps how to implement ?

Please respond. It is of higher priority.


(David Pilato) #2

Read this and specifically the "Also be patient" part.

I personally consider that someone who has a cluster in production down is more urgent than a question about a project that does not exist yet.

Anyway, some answers:

  1. No idea. Never used Nutch. May be ask to the Nutch mailing list if any?
  2. It depends on what are your needs. I wrote FSCrawler to crawl files on disk for example and parse them with Apache Tika.
  3. Too wide question.

(Will Johnson) #3

Depending on your needs and your current GSA configuration, there really aren't OSS web crawlers out there that cover everything GSA does and what websites produce these days. Handling things like modern javascript frameworks and complex authentication are incredibly difficult to do and many commercial crawlers still struggle with those technologies. That being said, commercial offerings are probably your best bet if you really are in a hurry to solve the problem.

If you did want to build your own, Nutch is a descent place to start, but be prepared to spend a long making it do everything the GSA does today.

As for other software, you could conceivably cobble everything together including document parsers, linguistic packages etc, but again it can take a while to make it all work together.


(Subasini Rath) #4

As per my requirement, I am trying with Nutch 2.3.1. If anybody has any documents or links regarding elastic and nutch , please share.


(system) #5

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.