I'm working on a hosted search solution, currently using Bing's API as
a source for providing domain targeted web search results. I'll
continue to offer this as the free part of a freemium model, but I'm
now researching how to set up my own crawler and search backend, which
will eventually give my customers more control over crawling and how
search results are displayed.
Everything I've read has me liking ElasticSearch for the search
engine. It fits the infrastructure design I have, which is completely
horizontally scalable. I'm also using MongoDB for data storage for
other components right now, so the fact ElasticSearch is JSON oriented
makes it a good fit as well.
I'm really new to this area of development, just dripping my toes so
to speak, so I thought I'd ask the list.
My requirements at this point are fairly simple. I need a crawler that
can honor robots.txt files, and be restricted by domain. I'm not
interested in creating another internet wide crawler, I want to crawl
the sites my customers configure me to. I'd like to find a product/
approach that's proven and stable. My current platform is built
primarily using python, but while I'm extremely rusty I can probably
brush up on C if necessary. I'd like to avoid Java unless something
that can run within the same JVM as ElasticSearch possibly. Really not
looking for the overhead of additional JVMs, and I'm not much of a
java developer, but that's a preference not a requirement. I also have
a strong belief in right tool for the job.