Using a sitemap to get elasticSearch content

muk_dal · August 8, 2012, 7:58pm

Hello,

Has anyone successfully used a sitemap (such as

http://www.nytimes.com/sitemaps/sitemap_news/sitemap.xml.gz) to get content for indexing in elasticSearch?
Apart from the html content pages, I'd also like to index metadata from the sitemap entries (such as news:title).

My first attempt was to modify an existing river (wikipedia or fsriver), but it has not yet worked.

thanks,
Muk

Ivan · August 9, 2012, 6:07pm

The way I see it, rivers are suited for data that is constantly
updating, so real time updates are required: twitter, couchdb updates,
rabbitmq, etc... Sitemaps are fairly static compared to the real time
nature of Twitter. IMHO, an external process that reads and parses a
sitemap and uses this data to index Elasticsearch would be better.

That said, any Java process can be made into a river. Elasticsearch
doesn't work with XML, so a custom XML parser is required. Surely,
someone abstracted sitemap parsing into a library by now.

Cheers,

Ivan

On Wed, Aug 8, 2012 at 12:58 PM, muk dal mukesh.dalal@gmail.com wrote:

Hello,

Has anyone successfully used a sitemap (such as

http://www.nytimes.com/sitemaps/sitemap_news/sitemap.xml.gz) to get content
for indexing in elasticSearch?
Apart from the html content pages, I'd also like to index metadata from the
sitemap entries (such as news:title).

My first attempt was to modify an existing river (wikipedia or fsriver), but
it has not yet worked.

thanks,
Muk

Topic		Replies	Views
Index content to elasticsearch cluster from sitemap Elasticsearch	1	826	December 8, 2016
How to index XML data Elasticsearch	7	6289	July 6, 2017
How to index my website Elasticsearch	7	3314	July 6, 2017
ElasticSearch Indexing question Elasticsearch	22	3760	July 5, 2017
Should rivers only index information? Elasticsearch	8	397	July 6, 2017

Using a sitemap to get elasticSearch content

Related topics