In general, you are free to use any web crawl product to fetch URL content. Apache Nutch is certainly one of the more popular open source web crawl products in the market. In the case of Apache Nutch (starting in Nutch 1.7+), there is an
ElasticSearchWriter class written by the Apache team for integration with ElasticSearch.
However, please keep in mind that the Nutch plugin class (
ElasticSearchWriter) mentioned above is not written, owned, tested or certified by Elasticsearch so it is not an integration module we support.
Instead of relying on the Nutch plugin, you can optionally write custom code to pull data out of the default Apache Nutch storage and invoke the Elasticsearch API to create the index. This gives you full control of the ingest/import pipeline given that 3rd party plugins may break and may not be updated to work with the latest Elasticsearch versions.