In general, you are free to use any web crawl product to fetch URL content. Apache Nutch is certainly one of the more popular open source web crawl products in the market. In the case of Apache Nutch (starting in Nutch 1.7+), there is an ElasticSearchWriter
class written by the Apache team for integration with ElasticSearch.
However, please keep in mind that the Nutch plugin class (ElasticSearchWriter
) mentioned above is not written, owned, tested or certified by Elasticsearch so it is not an integration module we support.
If you run into issues with the Nutch Elasticsearch plugin, please file a ticket with the Apache Nutch team. There are also additional resources such as Nutch mailing lists available.
Instead of relying on the Nutch plugin, you can optionally write custom code to pull data out of the default Apache Nutch storage and invoke the Elasticsearch API to create the index. This gives you full control of the ingest/import pipeline given that 3rd party plugins may break and may not be updated to work with the latest Elasticsearch versions.