The way I see it, rivers are suited for data that is constantly
updating, so real time updates are required: twitter, couchdb updates,
rabbitmq, etc... Sitemaps are fairly static compared to the real time
nature of Twitter. IMHO, an external process that reads and parses a
sitemap and uses this data to index Elasticsearch would be better.
That said, any Java process can be made into a river. Elasticsearch
doesn't work with XML, so a custom XML parser is required. Surely,
someone abstracted sitemap parsing into a library by now.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.