I had similar requirements some time ago, the problem I found with the
crawlers out there was the blocking IO they use, but I needed high
volume crawling, with some sort of traffic shaping, so we ended up
implementing a NIO based crawler on top of HttpComponents (Core /
Client) v4. There are many more, simpler, NIO ones now, like:
If you need a lot of html manipulation, usually you start with some
sort of tidier, like Neko HTML, or JTidy, so then you can use XPath to
get the content you need. Jaxen is very good for this, and is
integrated with Dom4J if you build those types of XML nodes. A little
simpler and more limited, is to use some sort of html unit testing
tool, like HtmlUnit, that integrates all those components, it's
basically a web browser component, without the frontend. In my
experience HtmlUnit tends to be difficult to customize to special
cases (I don't remember exactly but it tends to download as a browser
would, all frames, iframes, scripts, etc.), and I wouldn't recommend
it for big scale, just for a simple quick thing, or for it's intended
purpose, unit testing.
If you go with a custom crawler remember you'll have to deal yourself
with the links database and the scheduling of downloads among others,
but the page parsing, content extraction, and general NIO
infrastructure is well supported out there.
I hope that helps.
On Nov 11, 12:28 pm, Lukáš Vlček lukas.vl...@gmail.com wrote:
yea, I noticed that in your Lucene Revolution presentation. My question
would be how much do you manipulate the HTML document content? Do you
extract all data from it or do you extract just some specific portions?
Which tools do you use for this?
On Thu, Nov 11, 2010 at 1:49 PM, Otis otis.gospodne...@gmail.com wrote:
We use Droids forhttp://search-hadoop.comandhttp://search-lucene.com
Yes, the community is slim and not very active, but what's there seems
to work. It's much simpler than Nutch or Heritrix.
Sematext ::http://sematext.com/:: Solr - Lucene - Nutch
Lucene ecosystem search ::http://search-lucene.com/
On Nov 10, 7:23 am, Lukáš Vlček lukas.vl...@gmail.com wrote:
does anybody have any experience crawling web content and indexing it
I need to crawl relatively small number of web pages from few internet
domain and extract some content from these web pages. Rather then massive
scale of the crawler I need very precise control and possibility to
implement custom code.
I need very flexible control of the content extraction and processing, at
least the following:
- extracting just a portion of crawled html page (for example only DIV
elements with specific class or id attributes)
- extracting only specific links from the content identified in #1
- add metadata for each extracted content
- either storing the output into file system or possibility to process
and send to search server (e.g. indexing the extracted content with
I think that with my requirements I will end up writing some (possibly
lot of) custom code. That is ok but would like to avoid unnecessary
complexity if possible. I do not consider using incremental crawling (it
fine to reindex every time).
There are may web crawlers out there... but they are either old with
inactive community or they tend to become complex (sure, web crawling is
There are probably two good candidates: Heritrix and Nutch. Would you
recommend using one of them for the above requirements?
There is also Apache Droids project and it should allow for custom
but it seems to me that the project is quite inactive.
Any comments are welcome.