Can Apache Nutch be used with Elasticsearch to index web crawl content?


(Shaunak Kashyap) #1

In general, you are free to use any web crawl product to fetch URL content. Apache Nutch is certainly one of the more popular open source web crawl products in the market. In the case of Apache Nutch (starting in Nutch 1.7+), there is an ElasticSearchWriter class written by the Apache team for integration with ElasticSearch.

However, please keep in mind that the Nutch plugin class (ElasticSearchWriter) mentioned above is not written, owned, tested or certified by Elasticsearch so it is not an integration module we support.

If you run into issues with the Nutch Elasticsearch plugin, please file a ticket with the Apache Nutch team. There are also additional resources such as Nutch mailing lists available.

Instead of relying on the Nutch plugin, you can optionally write custom code to pull data out of the default Apache Nutch storage and invoke the Elasticsearch API to create the index. This gives you full control of the ingest/import pipeline given that 3rd party plugins may break and may not be updated to work with the latest Elasticsearch versions.


(Rodrigo Nunes) #2

Yes, and it works pretty well once you get all the kinks ironed out in the configuration files. I got a news crawler indexing to ES 1.4 a while ago, had to fight with it for a few hours but I can dig that code if it helps. Did not get it to work with Nutch 2.x though, but there are a few tutorials out there.


(system) #3