Indexing Nutch Results


(Adam Estrada) #1

Has anyone been able to do this? I am using Nutch to crawl the web and
now I would like to store my results in ElasticSearch rather than in
Solr.

Thoughts?

Adam


(Tomislav Poljak) #2

Hi,

2011/8/24 Adam Estrada estrada.adam@gmail.com:

Has anyone been able to do this? I am using Nutch to crawl the web and
now I would like to store my results in ElasticSearch rather than in
Solr.

Thoughts?

I don't think such an integration exists at the moment, but if you
check Nutch-Solr integration code you can see Nutch-ES integration
would be very similar. Nutch integrates with Solr through 2
points/commands (in bin/nutch script): Solr indexing and Solr
de-duplicatoin.
...
elif [ "$COMMAND" = "solrindex" ] ; then
CLASS=org.apache.nutch.indexer.solr.SolrIndexerJob
elif [ "$COMMAND" = "solrdedup" ] ; then
CLASS=org.apache.nutch.indexer.solr.SolrDeleteDuplicates
...

Solr indexing of Nutch crawled content is implemented through
SolrIndexerJob and if you check the code
(org.apache.nutch.indexer.solr.SolrIndexerJob) you will see it uses
SolrJ (actually CommonsHttpSolrServer) to post the data to Solr for
indexing. So, ElasticSearchIndexerJob needs to be implemented (similar
to SolrIndexerJob; also extends IndexerJob) where SolrJ code would be
replaced with ES indexing client code (for example Java API indexing
http://www.elasticsearch.org/guide/reference/java-api/index_.html)

Other part of Nutch-Solr integration is deduplication
(http://wiki.apache.org/nutch/bin/nutch_dedup) where duplicate
documents are removed from the index based on either the same contents
(via MD5 hash) or the same URL. Here iteration through documents and
duplicate deletes (job queries the solr server and removes duplicates)
implemented with SolrJ needs to replaced with ES Java API's
http://www.elasticsearch.org/guide/reference/java-api/search.html and
http://www.elasticsearch.org/guide/reference/java-api/delete.html

Hope this helps.

Tomislav

Adam


(system) #3