Crawling web sites and indexing the extracted content

Hi,

does anybody have any experience crawling web content and indexing it with
Elastic Search?

I need to crawl relatively small number of web pages from few internet
domain and extract some content from these web pages. Rather then massive
scale of the crawler I need very precise control and possibility to
implement custom code.

I need very flexible control of the content extraction and processing, at
least the following:

  1. extracting just a portion of crawled html page (for example only DIV
    elements with specific class or id attributes)
  2. extracting only specific links from the content identified in #1
  3. add metadata for each extracted content
  4. either storing the output into file system or possibility to process it
    and send to search server (e.g. indexing the extracted content with search
    server)

I think that with my requirements I will end up writing some (possibly not a
lot of) custom code. That is ok but would like to avoid unnecessary
complexity if possible. I do not consider using incremental crawling (it is
fine to reindex every time).

There are may web crawlers out there... but they are either old with
inactive community or they tend to become complex (sure, web crawling is a
complex thing...).
There are probably two good candidates: Heritrix and Nutch. Would you
recommend using one of them for the above requirements?
There is also Apache Droids project and it should allow for custom crawlers
but it seems to me that the project is quite inactive.

Any comments are welcome.

Regards,
Lukas

Hi Lukas

I need very flexible control of the content extraction and processing,
at least the following:

  1. extracting just a portion of crawled html page (for example only
    DIV elements with specific class or id attributes)
  2. extracting only specific links from the content identified in #1
  3. add metadata for each extracted content
  4. either storing the output into file system or possibility to
    process it and send to search server (e.g. indexing the extracted
    content with search server)

I know that Java is your language of choice, but if you're looking for
flexible libraries that do what you need above, then there are several
Perl modules that will help.

See:

clint

Hi Lukáš,

We use Droids for http://search-hadoop.com and http://search-lucene.com
Yes, the community is slim and not very active, but what's there seems
to work. It's much simpler than Nutch or Heritrix.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/

On Nov 10, 7:23 am, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

does anybody have any experience crawling web content and indexing it with
Elastic Search?

I need to crawl relatively small number of web pages from few internet
domain and extract some content from these web pages. Rather then massive
scale of the crawler I need very precise control and possibility to
implement custom code.

I need very flexible control of the content extraction and processing, at
least the following:

  1. extracting just a portion of crawled html page (for example only DIV
    elements with specific class or id attributes)
  2. extracting only specific links from the content identified in #1
  3. add metadata for each extracted content
  4. either storing the output into file system or possibility to process it
    and send to search server (e.g. indexing the extracted content with search
    server)

I think that with my requirements I will end up writing some (possibly not a
lot of) custom code. That is ok but would like to avoid unnecessary
complexity if possible. I do not consider using incremental crawling (it is
fine to reindex every time).

There are may web crawlers out there... but they are either old with
inactive community or they tend to become complex (sure, web crawling is a
complex thing...).
There are probably two good candidates: Heritrix and Nutch. Would you
recommend using one of them for the above requirements?
There is also Apache Droids project and it should allow for custom crawlers
but it seems to me that the project is quite inactive.

Any comments are welcome.

Regards,
Lukas

Hi Otis,
yea, I noticed that in your Lucene Revolution presentation. My question
would be how much do you manipulate the HTML document content? Do you
extract all data from it or do you extract just some specific portions?
Which tools do you use for this?
Thanks,
Lukas

On Thu, Nov 11, 2010 at 1:49 PM, Otis otis.gospodnetic@gmail.com wrote:

Hi Lukáš,

We use Droids for http://search-hadoop.com and http://search-lucene.com
Yes, the community is slim and not very active, but what's there seems
to work. It's much simpler than Nutch or Heritrix.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/

On Nov 10, 7:23 am, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

does anybody have any experience crawling web content and indexing it
with
Elastic Search?

I need to crawl relatively small number of web pages from few internet
domain and extract some content from these web pages. Rather then massive
scale of the crawler I need very precise control and possibility to
implement custom code.

I need very flexible control of the content extraction and processing, at
least the following:

  1. extracting just a portion of crawled html page (for example only DIV
    elements with specific class or id attributes)
  2. extracting only specific links from the content identified in #1
  3. add metadata for each extracted content
  4. either storing the output into file system or possibility to process
    it
    and send to search server (e.g. indexing the extracted content with
    search
    server)

I think that with my requirements I will end up writing some (possibly
not a
lot of) custom code. That is ok but would like to avoid unnecessary
complexity if possible. I do not consider using incremental crawling (it
is
fine to reindex every time).

There are may web crawlers out there... but they are either old with
inactive community or they tend to become complex (sure, web crawling is
a
complex thing...).
There are probably two good candidates: Heritrix and Nutch. Would you
recommend using one of them for the above requirements?
There is also Apache Droids project and it should allow for custom
crawlers
but it seems to me that the project is quite inactive.

Any comments are welcome.

Regards,
Lukas

Hi Lukas,

I had similar requirements some time ago, the problem I found with the
crawlers out there was the blocking IO they use, but I needed high
volume crawling, with some sort of traffic shaping, so we ended up
implementing a NIO based crawler on top of HttpComponents (Core /
Client) v4. There are many more, simpler, NIO ones now, like:

If you need a lot of html manipulation, usually you start with some
sort of tidier, like Neko HTML, or JTidy, so then you can use XPath to
get the content you need. Jaxen is very good for this, and is
integrated with Dom4J if you build those types of XML nodes. A little
simpler and more limited, is to use some sort of html unit testing
tool, like HtmlUnit, that integrates all those components, it's
basically a web browser component, without the frontend. In my
experience HtmlUnit tends to be difficult to customize to special
cases (I don't remember exactly but it tends to download as a browser
would, all frames, iframes, scripts, etc.), and I wouldn't recommend
it for big scale, just for a simple quick thing, or for it's intended
purpose, unit testing.

If you go with a custom crawler remember you'll have to deal yourself
with the links database and the scheduling of downloads among others,
but the page parsing, content extraction, and general NIO
infrastructure is well supported out there.

I hope that helps.
Regards,
Sebastian.

On Nov 11, 12:28 pm, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi Otis,
yea, I noticed that in your Lucene Revolution presentation. My question
would be how much do you manipulate the HTML document content? Do you
extract all data from it or do you extract just some specific portions?
Which tools do you use for this?
Thanks,
Lukas

On Thu, Nov 11, 2010 at 1:49 PM, Otis otis.gospodne...@gmail.com wrote:

Hi Lukáš,

We use Droids forhttp://search-hadoop.comandhttp://search-lucene.com
Yes, the community is slim and not very active, but what's there seems
to work. It's much simpler than Nutch or Heritrix.

Otis

Sematext ::http://sematext.com/:: Solr - Lucene - Nutch
Lucene ecosystem search ::http://search-lucene.com/

On Nov 10, 7:23 am, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

does anybody have any experience crawling web content and indexing it
with
Elastic Search?

I need to crawl relatively small number of web pages from few internet
domain and extract some content from these web pages. Rather then massive
scale of the crawler I need very precise control and possibility to
implement custom code.

I need very flexible control of the content extraction and processing, at
least the following:

  1. extracting just a portion of crawled html page (for example only DIV
    elements with specific class or id attributes)
  2. extracting only specific links from the content identified in #1
  3. add metadata for each extracted content
  4. either storing the output into file system or possibility to process
    it
    and send to search server (e.g. indexing the extracted content with
    search
    server)

I think that with my requirements I will end up writing some (possibly
not a
lot of) custom code. That is ok but would like to avoid unnecessary
complexity if possible. I do not consider using incremental crawling (it
is
fine to reindex every time).

There are may web crawlers out there... but they are either old with
inactive community or they tend to become complex (sure, web crawling is
a
complex thing...).
There are probably two good candidates: Heritrix and Nutch. Would you
recommend using one of them for the above requirements?
There is also Apache Droids project and it should allow for custom
crawlers
but it seems to me that the project is quite inactive.

Any comments are welcome.

Regards,
Lukas

Sebastian,

did you have a chance to look at http://www.niocchi.com/ as well?

Lukas

On Thu, Nov 11, 2010 at 9:57 PM, Sebastian sgavarini@gmail.com wrote:

Hi Lukas,

I had similar requirements some time ago, the problem I found with the
crawlers out there was the blocking IO they use, but I needed high
volume crawling, with some sort of traffic shaping, so we ended up
implementing a NIO based crawler on top of HttpComponents (Core /
Client) v4. There are many more, simpler, NIO ones now, like:
GitHub - AsyncHttpClient/async-http-client: Asynchronous Http and WebSocket Client library for Java

If you need a lot of html manipulation, usually you start with some
sort of tidier, like Neko HTML, or JTidy, so then you can use XPath to
get the content you need. Jaxen is very good for this, and is
integrated with Dom4J if you build those types of XML nodes. A little
simpler and more limited, is to use some sort of html unit testing
tool, like HtmlUnit, that integrates all those components, it's
basically a web browser component, without the frontend. In my
experience HtmlUnit tends to be difficult to customize to special
cases (I don't remember exactly but it tends to download as a browser
would, all frames, iframes, scripts, etc.), and I wouldn't recommend
it for big scale, just for a simple quick thing, or for it's intended
purpose, unit testing.

If you go with a custom crawler remember you'll have to deal yourself
with the links database and the scheduling of downloads among others,
but the page parsing, content extraction, and general NIO
infrastructure is well supported out there.

I hope that helps.
Regards,
Sebastian.

On Nov 11, 12:28 pm, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi Otis,
yea, I noticed that in your Lucene Revolution presentation. My question
would be how much do you manipulate the HTML document content? Do you
extract all data from it or do you extract just some specific portions?
Which tools do you use for this?
Thanks,
Lukas

On Thu, Nov 11, 2010 at 1:49 PM, Otis otis.gospodne...@gmail.com
wrote:

Hi Lukáš,

We use Droids forhttp://search-hadoop.comandhttp://search-lucene.com
Yes, the community is slim and not very active, but what's there seems
to work. It's much simpler than Nutch or Heritrix.

Otis

Sematext ::http://sematext.com/:: Solr - Lucene - Nutch
Lucene ecosystem search ::http://search-lucene.com/

On Nov 10, 7:23 am, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

does anybody have any experience crawling web content and indexing it
with
Elastic Search?

I need to crawl relatively small number of web pages from few
internet
domain and extract some content from these web pages. Rather then
massive
scale of the crawler I need very precise control and possibility to
implement custom code.

I need very flexible control of the content extraction and
processing, at
least the following:

  1. extracting just a portion of crawled html page (for example only
    DIV
    elements with specific class or id attributes)
  2. extracting only specific links from the content identified in #1
  3. add metadata for each extracted content
  4. either storing the output into file system or possibility to
    process
    it
    and send to search server (e.g. indexing the extracted content with
    search
    server)

I think that with my requirements I will end up writing some
(possibly
not a
lot of) custom code. That is ok but would like to avoid unnecessary
complexity if possible. I do not consider using incremental crawling
(it
is
fine to reindex every time).

There are may web crawlers out there... but they are either old with
inactive community or they tend to become complex (sure, web crawling
is
a
complex thing...).
There are probably two good candidates: Heritrix and Nutch. Would you
recommend using one of them for the above requirements?
There is also Apache Droids project and it should allow for custom
crawlers
but it seems to me that the project is quite inactive.

Any comments are welcome.

Regards,
Lukas

Lukas,

I wasn't aware of that product, at the time I did my research I didn't
find it for sure. I looked at it now, the documentation seems scarce,
but from what I understood it could work well for some use cases.

Apparently you should "Subclass Worker and implement
processResource(Query)", and there do your own parsing of the
information, probably using tidying/xpath. I guess part of the parsing
should involve finding anchor ("A" tags) and refilling the url pool so
the crawler can continue in depth.

The caveats are important, (same size in-memory buffers and no
shaping) but for what you said you needed it shouldn't matter much. If
you need to crawl a handful of sites, without complex requirements,
I'd focus on the value in terms of functionality, community,
documentation, stability, and not so much in NIO. One of the problem
with IO in the context of crawling, (apart from the several KB per
thread), is that if a site misbehaves (like being very slow) it could
hang all your threads easily, with NIO you can get in parallel much
more information, many times more than the max connections to each
site, so only that site would be affected. But again, it's a tradeoff
that in your case I think it's not worth it.

Sebastian.

On Nov 11, 7:42 pm, Lukáš Vlček lukas.vl...@gmail.com wrote:

Sebastian,

did you have a chance to look athttp://www.niocchi.com/as well?

Lukas

On Thu, Nov 11, 2010 at 9:57 PM, Sebastian sgavar...@gmail.com wrote:

Hi Lukas,

I had similar requirements some time ago, the problem I found with the
crawlers out there was the blocking IO they use, but I needed high
volume crawling, with some sort of traffic shaping, so we ended up
implementing a NIO based crawler on top of HttpComponents (Core /
Client) v4. There are many more, simpler, NIO ones now, like:
https://github.com/AsyncHttpClient/async-http-client

If you need a lot of html manipulation, usually you start with some
sort of tidier, like Neko HTML, or JTidy, so then you can use XPath to
get the content you need. Jaxen is very good for this, and is
integrated with Dom4J if you build those types of XML nodes. A little
simpler and more limited, is to use some sort of html unit testing
tool, like HtmlUnit, that integrates all those components, it's
basically a web browser component, without the frontend. In my
experience HtmlUnit tends to be difficult to customize to special
cases (I don't remember exactly but it tends to download as a browser
would, all frames, iframes, scripts, etc.), and I wouldn't recommend
it for big scale, just for a simple quick thing, or for it's intended
purpose, unit testing.

If you go with a custom crawler remember you'll have to deal yourself
with the links database and the scheduling of downloads among others,
but the page parsing, content extraction, and general NIO
infrastructure is well supported out there.

I hope that helps.
Regards,
Sebastian.

On Nov 11, 12:28 pm, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi Otis,
yea, I noticed that in your Lucene Revolution presentation. My question
would be how much do you manipulate the HTML document content? Do you
extract all data from it or do you extract just some specific portions?
Which tools do you use for this?
Thanks,
Lukas

On Thu, Nov 11, 2010 at 1:49 PM, Otis otis.gospodne...@gmail.com
wrote:

Hi Lukáš,

We use Droids forhttp://search-hadoop.comandhttp://search-lucene.com
Yes, the community is slim and not very active, but what's there seems
to work. It's much simpler than Nutch or Heritrix.

Otis

Sematext ::http://sematext.com/::Solr - Lucene - Nutch
Lucene ecosystem search ::http://search-lucene.com/

On Nov 10, 7:23 am, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

does anybody have any experience crawling web content and indexing it
with
Elastic Search?

I need to crawl relatively small number of web pages from few
internet
domain and extract some content from these web pages. Rather then
massive
scale of the crawler I need very precise control and possibility to
implement custom code.

I need very flexible control of the content extraction and
processing, at
least the following:

  1. extracting just a portion of crawled html page (for example only
    DIV
    elements with specific class or id attributes)
  2. extracting only specific links from the content identified in #1
  3. add metadata for each extracted content
  4. either storing the output into file system or possibility to
    process
    it
    and send to search server (e.g. indexing the extracted content with
    search
    server)

I think that with my requirements I will end up writing some
(possibly
not a
lot of) custom code. That is ok but would like to avoid unnecessary
complexity if possible. I do not consider using incremental crawling
(it
is
fine to reindex every time).

There are may web crawlers out there... but they are either old with
inactive community or they tend to become complex (sure, web crawling
is
a
complex thing...).
There are probably two good candidates: Heritrix and Nutch. Would you
recommend using one of them for the above requirements?
There is also Apache Droids project and it should allow for custom
crawlers
but it seems to me that the project is quite inactive.

Any comments are welcome.

Regards,
Lukas

Hi,

there's a new version of niocchi that removes one of the caveats. Content can be now stored directly on disk using the DiskResource class. Additions were also made to the javadoc which is accessible online.

flm