Crawling web sites and indexing the extracted content

Lukas_Vlcek1 · November 11, 2010, 3:28pm

Hi Otis,
yea, I noticed that in your Lucene Revolution presentation. My question
would be how much do you manipulate the HTML document content? Do you
extract all data from it or do you extract just some specific portions?
Which tools do you use for this?
Thanks,
Lukas

On Thu, Nov 11, 2010 at 1:49 PM, Otis otis.gospodnetic@gmail.com wrote:

Hi Lukáš,

We use Droids for http://search-hadoop.com and http://search-lucene.com
Yes, the community is slim and not very active, but what's there seems
to work. It's much simpler than Nutch or Heritrix.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/

On Nov 10, 7:23 am, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

does anybody have any experience crawling web content and indexing it
with
Elastic Search?

I need to crawl relatively small number of web pages from few internet
domain and extract some content from these web pages. Rather then massive
scale of the crawler I need very precise control and possibility to
implement custom code.

I need very flexible control of the content extraction and processing, at
least the following:

extracting just a portion of crawled html page (for example only DIV
elements with specific class or id attributes)

extracting only specific links from the content identified in #1

add metadata for each extracted content

either storing the output into file system or possibility to process
it
and send to search server (e.g. indexing the extracted content with
search
server)

I think that with my requirements I will end up writing some (possibly
not a
lot of) custom code. That is ok but would like to avoid unnecessary
complexity if possible. I do not consider using incremental crawling (it
is
fine to reindex every time).

There are may web crawlers out there... but they are either old with
inactive community or they tend to become complex (sure, web crawling is
a
complex thing...).
There are probably two good candidates: Heritrix and Nutch. Would you
recommend using one of them for the above requirements?
There is also Apache Droids project and it should allow for custom
crawlers
but it seems to me that the project is quite inactive.

Any comments are welcome.

Regards,
Lukas

Topic		Replies	Views
Elastic crawler metadata content extraction Elastic Search crawler	3	11	November 18, 2024
Website crawl and index into Elasticsearch Elastic Community and Ecosystem	4	2212	October 24, 2017
Index content to elasticsearch cluster from sitemap Elasticsearch	1	826	December 8, 2016
Can Apache Nutch be used with Elasticsearch to index web crawl content? Elastic Community and Ecosystem	2	8192	July 6, 2017
How to set up website crawl with ElasticSearch Elasticsearch	7	4749	February 21, 2017

Crawling web sites and indexing the extracted content

Otis

Related topics