Reading gateway data in HDFS independently of ES

Andrew_Clegg · March 6, 2012, 4:55pm

Hi,

If you use the Hadoop gateway to ship all your ES data to HDFS, is it
in a format amenable to running map-reduce jobs over, independently of
ES?

For example, it would be really useful to be able to do Pig queries
over the raw JSON document contents. Wonderdog (https://github.com/
infochimps/wonderdog) lets you do this via the ES cluster as a scan
query, but that will put load on ES. If the data's already being
written to the Hadoop cluster via a gateway, can you just analyse it
there? And if so, does anyone have an example?

Many thanks,

Andrew.

kimchy · March 6, 2012, 8:56pm

No, its not really storing it in a way that you can easily read the actual json document.

On Tuesday, March 6, 2012 at 6:55 PM, Andrew Clegg wrote:

Hi,

If you use the Hadoop gateway to ship all your ES data to HDFS, is it
in a format amenable to running map-reduce jobs over, independently of
ES?

For example, it would be really useful to be able to do Pig queries
over the raw JSON document contents. Wonderdog (https://github.com/
infochimps/wonderdog) lets you do this via the ES cluster as a scan
query, but that will put load on ES. If the data's already being
written to the Hadoop cluster via a gateway, can you just analyse it
there? And if so, does anyone have an example?

Many thanks,

Andrew.

Craig_Brown · March 6, 2012, 9:27pm

Could you read the indices with the lucene libraries?

Craig

On Tue, Mar 6, 2012 at 1:56 PM, Shay Banon kimchy@gmail.com wrote:

No, its not really storing it in a way that you can easily read the
actual json document.

On Tuesday, March 6, 2012 at 6:55 PM, Andrew Clegg wrote:

Hi,

If you use the Hadoop gateway to ship all your ES data to HDFS, is it
in a format amenable to running map-reduce jobs over, independently of
ES?

For example, it would be really useful to be able to do Pig queries
over the raw JSON document contents. Wonderdog (https://github.com/
infochimps/wonderdog) lets you do this via the ES cluster as a scan
query, but that will put load on ES. If the data's already being
written to the Hadoop cluster via a gateway, can you just analyse it
there? And if so, does anyone have an example?

Many thanks,

Andrew.

--
…
CRAIG BROWN
chief architect
youwho, Inc.

www.youwho.com http://www.youwho.com/

T: 801.855. 0921
M: 801.913. 0939