[HOWTO] Creating a streaming JSON river

Hi there,

I have created a sample small river (intended to help you write your
own implementation, not to use this one), which can be used to stream
data into elasticsearch via JSON. Basically this does not sound to
impressive, but let me tell first

  • It really streams data in (from any HTTP endpoint), this means, you
    can import gigabytes of JSON data without going OOM as you do not need
    to hold the whole response in memory
  • Its architecture is to be working incrementally, as it provides a
    timestamp to the endpoint, which then only should deliver the data
    which has updates - you need to implement this logic in your endpoint
    of course as well
  • Even though it uses bulk imports, it has a max_bulk_size which
    means, the river fires a bulk import every n*thousand products making
    them searchable even before your data import is finished
  • Comes with sample tests, which might help you to test your own river
    implementation
  • NOT implemented, but mentioned in the documentation: Extending your
    river to read its configuration from the river configuration instead
    of being hardcoded in the source.

You can get the source and its documentation (I tried to make it
extensive) at https://github.com/spinscale/elasticsearch-river-streaming-json

Have fun, create your own river and, as usual, drop some feedback.
And yes, I know, its not always desired to import products, before the
response has been coming in, but I think in many cases this works
pretty. Always think about your use-case.

Regards, Alexander

P.S. I also created a small pull request to get the documentation for
this into the ES tutorials page
(https://github.com/elasticsearch/elasticsearch.github.com/pull/300)

--