Using a sitemap to get elasticSearch content


(muk dal) #1

Hello,

Has anyone successfully used a sitemap (such as

http://www.nytimes.com/sitemaps/sitemap_news/sitemap.xml.gz) to get content for indexing in elasticSearch?
Apart from the html content pages, I'd also like to index metadata from the sitemap entries (such as news:title).

My first attempt was to modify an existing river (wikipedia or fsriver), but it has not yet worked.

thanks,
Muk


(Ivan Brusic) #2

The way I see it, rivers are suited for data that is constantly
updating, so real time updates are required: twitter, couchdb updates,
rabbitmq, etc... Sitemaps are fairly static compared to the real time
nature of Twitter. IMHO, an external process that reads and parses a
sitemap and uses this data to index ElasticSearch would be better.

That said, any Java process can be made into a river. ElasticSearch
doesn't work with XML, so a custom XML parser is required. Surely,
someone abstracted sitemap parsing into a library by now.

Cheers,

Ivan

On Wed, Aug 8, 2012 at 12:58 PM, muk dal mukesh.dalal@gmail.com wrote:

Hello,

Has anyone successfully used a sitemap (such as

http://www.nytimes.com/sitemaps/sitemap_news/sitemap.xml.gz) to get content
for indexing in elasticSearch?
Apart from the html content pages, I'd also like to index metadata from the
sitemap entries (such as news:title).

My first attempt was to modify an existing river (wikipedia or fsriver), but
it has not yet worked.

thanks,
Muk


(system) #3