Hi there,
I have created a sample small river (intended to help you write your
own implementation, not to use this one), which can be used to stream
data into elasticsearch via JSON. Basically this does not sound to
impressive, but let me tell first
- It really streams data in (from any HTTP endpoint), this means, you
can import gigabytes of JSON data without going OOM as you do not need
to hold the whole response in memory - Its architecture is to be working incrementally, as it provides a
timestamp to the endpoint, which then only should deliver the data
which has updates - you need to implement this logic in your endpoint
of course as well - Even though it uses bulk imports, it has a max_bulk_size which
means, the river fires a bulk import every n*thousand products making
them searchable even before your data import is finished - Comes with sample tests, which might help you to test your own river
implementation - NOT implemented, but mentioned in the documentation: Extending your
river to read its configuration from the river configuration instead
of being hardcoded in the source.
You can get the source and its documentation (I tried to make it
extensive) at https://github.com/spinscale/elasticsearch-river-streaming-json
Have fun, create your own river and, as usual, drop some feedback.
And yes, I know, its not always desired to import products, before the
response has been coming in, but I think in many cases this works
pretty. Always think about your use-case.
Regards, Alexander
P.S. I also created a small pull request to get the documentation for
this into the ES tutorials page
(https://github.com/elasticsearch/elasticsearch.github.com/pull/300)
--