Indexing Nutch Results

Adam_Estrada · August 24, 2011, 5:00pm

Has anyone been able to do this? I am using Nutch to crawl the web and
now I would like to store my results in ElasticSearch rather than in
Solr.

Thoughts?

Adam

Tomislav_Poljak · August 29, 2011, 3:55pm

Hi,

2011/8/24 Adam Estrada estrada.adam@gmail.com:

Has anyone been able to do this? I am using Nutch to crawl the web and
now I would like to store my results in Elasticsearch rather than in
Solr.

Thoughts?

I don't think such an integration exists at the moment, but if you
check Nutch-Solr integration code you can see Nutch-ES integration
would be very similar. Nutch integrates with Solr through 2
points/commands (in bin/nutch script): Solr indexing and Solr
de-duplicatoin.
...
elif [ "$COMMAND" = "solrindex" ] ; then
CLASS=org.apache.nutch.indexer.solr.SolrIndexerJob
elif [ "$COMMAND" = "solrdedup" ] ; then
CLASS=org.apache.nutch.indexer.solr.SolrDeleteDuplicates
...

Solr indexing of Nutch crawled content is implemented through
SolrIndexerJob and if you check the code
(org.apache.nutch.indexer.solr.SolrIndexerJob) you will see it uses
SolrJ (actually CommonsHttpSolrServer) to post the data to Solr for
indexing. So, ElasticSearchIndexerJob needs to be implemented (similar
to SolrIndexerJob; also extends IndexerJob) where SolrJ code would be
replaced with ES indexing client code (for example Java API indexing
Elasticsearch Platform — Find real-time answers at scale | Elastic)

Other part of Nutch-Solr integration is deduplication
(http://wiki.apache.org/nutch/bin/nutch_dedup) where duplicate
documents are removed from the index based on either the same contents
(via MD5 hash) or the same URL. Here iteration through documents and
duplicate deletes (job queries the solr server and removes duplicates)
implemented with SolrJ needs to replaced with ES Java API's
Elasticsearch Platform — Find real-time answers at scale | Elastic and
Elasticsearch Platform — Find real-time answers at scale | Elastic

Hope this helps.

Tomislav

Adam

Topic		Replies	Views
Can Apache Nutch be used with Elasticsearch to index web crawl content? Elastic Community and Ecosystem	2	8192	July 6, 2017
Indexing to Elasticsearch elasticsearch 5.6.3 from Apache Nutch Elasticsearch	1	1081	February 24, 2018
Web Crawler Elasticsearch	4	1200	July 6, 2017
[ANN] ElasticSearch Mock Solr Plugin (use Solr tools/clients with ElasticSearch) Elasticsearch	7	739	July 6, 2017
Converting crawled Nutch Indice to Elasticsearch Indices Elasticsearch	1	321	July 6, 2017

Indexing Nutch Results

Related topics