Crawling web sites and indexing the extracted content

Hi Otis,
yea, I noticed that in your Lucene Revolution presentation. My question
would be how much do you manipulate the HTML document content? Do you
extract all data from it or do you extract just some specific portions?
Which tools do you use for this?
Thanks,
Lukas

On Thu, Nov 11, 2010 at 1:49 PM, Otis otis.gospodnetic@gmail.com wrote:

Hi Lukáš,

We use Droids for http://search-hadoop.com and http://search-lucene.com
Yes, the community is slim and not very active, but what's there seems
to work. It's much simpler than Nutch or Heritrix.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/

On Nov 10, 7:23 am, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

does anybody have any experience crawling web content and indexing it
with
Elastic Search?

I need to crawl relatively small number of web pages from few internet
domain and extract some content from these web pages. Rather then massive
scale of the crawler I need very precise control and possibility to
implement custom code.

I need very flexible control of the content extraction and processing, at
least the following:

  1. extracting just a portion of crawled html page (for example only DIV
    elements with specific class or id attributes)
  2. extracting only specific links from the content identified in #1
  3. add metadata for each extracted content
  4. either storing the output into file system or possibility to process
    it
    and send to search server (e.g. indexing the extracted content with
    search
    server)

I think that with my requirements I will end up writing some (possibly
not a
lot of) custom code. That is ok but would like to avoid unnecessary
complexity if possible. I do not consider using incremental crawling (it
is
fine to reindex every time).

There are may web crawlers out there... but they are either old with
inactive community or they tend to become complex (sure, web crawling is
a
complex thing...).
There are probably two good candidates: Heritrix and Nutch. Would you
recommend using one of them for the above requirements?
There is also Apache Droids project and it should allow for custom
crawlers
but it seems to me that the project is quite inactive.

Any comments are welcome.

Regards,
Lukas