How to construct Spark DStream from continued RDD?


(Kramer Li) #1

I`m reading data from ElasticSearch to Spark every 5min. So there will be a RDD every 5 minutes.

I hope to construct a DStream based on these RDDs, so that I can get report for data within last 1 day, last 1 hour , last 5 minutes and so on.

To construct the DStream, I was thinking about create my own receiver, but the official documents of spark only give information using scala or java to do so. And I use python.

So do you know any way to do it? I know we can. After all the DStream is a series of RDDs, of course we should be about create DStream from continued RDDs. I just do not know how. Please give some advice

apache elasticsearch


(Costin Leau) #2

I'm afraid I can't help when it comes to Python. ES-Hadoop is based on the JVM - Spark python integration can leverage the InputFormat/OutputFormat but that's not enough when it comes to an RDD in terms of efficiency.
There is however a community wrapper in python around ES-Hadoop that is available on github; maybe that one will address your problem.

Pardon?


(system) #3