[HOWTO] Creating a streaming JSON river

spinscale · September 23, 2012, 1:57pm

Hi there,

I have created a sample small river (intended to help you write your
own implementation, not to use this one), which can be used to stream
data into elasticsearch via JSON. Basically this does not sound to
impressive, but let me tell first

It really streams data in (from any HTTP endpoint), this means, you
can import gigabytes of JSON data without going OOM as you do not need
to hold the whole response in memory
Its architecture is to be working incrementally, as it provides a
timestamp to the endpoint, which then only should deliver the data
which has updates - you need to implement this logic in your endpoint
of course as well
Even though it uses bulk imports, it has a max_bulk_size which
means, the river fires a bulk import every n*thousand products making
them searchable even before your data import is finished
Comes with sample tests, which might help you to test your own river
implementation
NOT implemented, but mentioned in the documentation: Extending your
river to read its configuration from the river configuration instead
of being hardcoded in the source.

You can get the source and its documentation (I tried to make it
extensive) at https://github.com/spinscale/elasticsearch-river-streaming-json

Have fun, create your own river and, as usual, drop some feedback.
And yes, I know, its not always desired to import products, before the
response has been coming in, but I think in many cases this works
pretty. Always think about your use-case.

Regards, Alexander

P.S. I also created a small pull request to get the documentation for
this into the ES tutorials page
(https://github.com/elasticsearch/elasticsearch.github.com/pull/300)

--

Topic		Replies	Views
How to implement new 'river' for ES? Elasticsearch	5	361	July 6, 2017
Streaming API for ElasticSearch Elasticsearch	8	5811	July 5, 2017
Streaming from elasticsearch to local file in java Elasticsearch	13	469	June 16, 2020
What is the recommended way to stream Elasticsearch index data Elasticsearch	2	97	March 28, 2024
What's the most efficient indexing for a river? Elasticsearch	3	365	July 6, 2017

[HOWTO] Creating a streaming JSON river

Related topics