The way I see it, rivers are suited for data that is constantly
updating, so real time updates are required: twitter, couchdb updates,
rabbitmq, etc... Sitemaps are fairly static compared to the real time
nature of Twitter. IMHO, an external process that reads and parses a
sitemap and uses this data to index ElasticSearch would be better.
That said, any Java process can be made into a river. ElasticSearch
doesn't work with XML, so a custom XML parser is required. Surely,
someone abstracted sitemap parsing into a library by now.
On Wed, Aug 8, 2012 at 12:58 PM, muk dal email@example.com wrote:
Has anyone successfully used a sitemap (such as
http://www.nytimes.com/sitemaps/sitemap_news/sitemap.xml.gz) to get content
for indexing in elasticSearch?
Apart from the html content pages, I'd also like to index metadata from the
sitemap entries (such as news:title).
My first attempt was to modify an existing river (wikipedia or fsriver), but
it has not yet worked.