Suggestions for places to start for a crawler?

Hello,

I'm working on a hosted search solution, currently using Bing's API as
a source for providing domain targeted web search results. I'll
continue to offer this as the free part of a freemium model, but I'm
now researching how to set up my own crawler and search backend, which
will eventually give my customers more control over crawling and how
search results are displayed.

Everything I've read has me liking ElasticSearch for the search
engine. It fits the infrastructure design I have, which is completely
horizontally scalable. I'm also using MongoDB for data storage for
other components right now, so the fact ElasticSearch is JSON oriented
makes it a good fit as well.

I'm really new to this area of development, just dripping my toes so
to speak, so I thought I'd ask the list.

My requirements at this point are fairly simple. I need a crawler that
can honor robots.txt files, and be restricted by domain. I'm not
interested in creating another internet wide crawler, I want to crawl
the sites my customers configure me to. I'd like to find a product/
approach that's proven and stable. My current platform is built
primarily using python, but while I'm extremely rusty I can probably
brush up on C if necessary. I'd like to avoid Java unless something
that can run within the same JVM as ElasticSearch possibly. Really not
looking for the overhead of additional JVMs, and I'm not much of a
java developer, but that's a preference not a requirement. I also have
a strong belief in right tool for the job.

Hi,

So, elasticsearch does not do crawling itself, the crawling job you will
need to do and index data into elasticsearch. Nutch, as a side note, is one
such crawler, and I had a chat with some of its developers and they were
keen on getting Nutch to index data into elasticsearch. I am not familiar
with other crawlers, but there are probable several others, the integration
point would be to get the crawled data indexed into elasticsearch.

-shay.banon

On Wed, Jun 16, 2010 at 7:23 PM, Joe Bowman bowman.joseph@gmail.com wrote:

Hello,

I'm working on a hosted search solution, currently using Bing's API as
a source for providing domain targeted web search results. I'll
continue to offer this as the free part of a freemium model, but I'm
now researching how to set up my own crawler and search backend, which
will eventually give my customers more control over crawling and how
search results are displayed.

Everything I've read has me liking Elasticsearch for the search
engine. It fits the infrastructure design I have, which is completely
horizontally scalable. I'm also using MongoDB for data storage for
other components right now, so the fact Elasticsearch is JSON oriented
makes it a good fit as well.

I'm really new to this area of development, just dripping my toes so
to speak, so I thought I'd ask the list.

My requirements at this point are fairly simple. I need a crawler that
can honor robots.txt files, and be restricted by domain. I'm not
interested in creating another internet wide crawler, I want to crawl
the sites my customers configure me to. I'd like to find a product/
approach that's proven and stable. My current platform is built
primarily using python, but while I'm extremely rusty I can probably
brush up on C if necessary. I'd like to avoid Java unless something
that can run within the same JVM as Elasticsearch possibly. Really not
looking for the overhead of additional JVMs, and I'm not much of a
java developer, but that's a preference not a requirement. I also have
a strong belief in right tool for the job.

Sorry if I was unclear, I do understand the integration point, I
thought though that Elasticsearch does the indexing as the data is fed
into it? I think that's what you're saying though. I really am looking
for suggestions about the specific crawler piece. My own research
everyone points to Nutch, and based off of your post, unless someone
adds something to change my mind, I believe I will go with that when
I'm ready to start that process of my product.

Thanks.

On Jun 16, 3:37 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi,

So, elasticsearch does not do crawling itself, the crawling job you will
need to do and index data into elasticsearch. Nutch, as a side note, is one
such crawler, and I had a chat with some of its developers and they were
keen on getting Nutch to index data into elasticsearch. I am not familiar
with other crawlers, but there are probable several others, the integration
point would be to get the crawled data indexed into elasticsearch.

-shay.banon

On Wed, Jun 16, 2010 at 7:23 PM, Joe Bowman bowman.jos...@gmail.com wrote:

Hello,

I'm working on a hosted search solution, currently using Bing's API as
a source for providing domain targeted web search results. I'll
continue to offer this as the free part of a freemium model, but I'm
now researching how to set up my own crawler and search backend, which
will eventually give my customers more control over crawling and how
search results are displayed.

Everything I've read has me liking Elasticsearch for the search
engine. It fits the infrastructure design I have, which is completely
horizontally scalable. I'm also using MongoDB for data storage for
other components right now, so the fact Elasticsearch is JSON oriented
makes it a good fit as well.

I'm really new to this area of development, just dripping my toes so
to speak, so I thought I'd ask the list.

My requirements at this point are fairly simple. I need a crawler that
can honor robots.txt files, and be restricted by domain. I'm not
interested in creating another internet wide crawler, I want to crawl
the sites my customers configure me to. I'd like to find a product/
approach that's proven and stable. My current platform is built
primarily using python, but while I'm extremely rusty I can probably
brush up on C if necessary. I'd like to avoid Java unless something
that can run within the same JVM as Elasticsearch possibly. Really not
looking for the overhead of additional JVMs, and I'm not much of a
java developer, but that's a preference not a requirement. I also have
a strong belief in right tool for the job.

On Thu, 2010-06-17 at 10:03 -0700, Joe Bowman wrote:

Sorry if I was unclear, I do understand the integration point, I
thought though that Elasticsearch does the indexing as the data is fed
into it? I think that's what you're saying though. I really am looking
for suggestions about the specific crawler piece. My own research
everyone points to Nutch, and based off of your post, unless someone
adds something to change my mind, I believe I will go with that when
I'm ready to start that process of my product.

Although you mentioned that you weren't keen on Java, you didn't specify
which languages you would like to use.

Perl has some very good crawling libraries, which you could throw
together to create your own crawler. I don't see one that supports
robots.txt out of the box, but with three lines of code, you could
change Web::Scraper to do so:

Perl also has an interface to Elastic Search:

So if you're familiar with Perl, then combining the two parts above
would be quite simple.

Clint

Hi Clinton,

I have also same question. Thank you for your replies. Can you suggest PHP, Java based crawlers those suits for Elastic Search integration.

Thanks in advance.

-Nehatha

  • deleted -

We are looking at using Nutch (Java Crawler) for our efforts. The below source I put together to write data directly from Nutch to Mongodb in the same fashion as the way SolrIndexer works in Nutch.