Try as I might and I have read all the stuff I can find on ES' website
about this I understand somewhat how the integration works but not the
actual nuts and bolts of it.
For example:
Is Hadoop just storing the files that would normally be stored in the local
filesystem for the ES indexes or is it storing the data that would normally
be in those indexes and just accessed through es-hadoop?
If it is the latter how do you go about determining whatto set for the
number of nodes and shards.
If anyone has any information on this or even better yet a place to point
me to that has better references so that I can research this on my own it
would be much appreciated.
Think of es-hadoop as a connector between Hadoop and Elasticsearch. You
would use it to index data in Hadoop to ES or run queries in ES directly
from Hadoop.
Where does ES store the data? That depends on its configuration (completely
separate from es-hadoop itself). In general (and the default) is to store
it onto the local file-system. If you want to use it on a shared
file-system or HDFS you can easily do that by mounting it locally (for
example, mount HDFS through NFS as a local disk) and point ES to it. ES is
happy to work with it however the performance will be significantly
degraded and most of the real-time nature of it will go down the window
since HDFS is a distributed file-system (and thus even basic operations
like opening a file or closing a file mean at least one call over the
network) plus you're giving up the amazing OS file-system cache (since the
fs is not local). If the FS is slow, anything that sits on top of it (like
ES) will be slow as well.
Hope this helps,
P.S. By the way, if you want/need to snapshot/restore data to/from ES
from/to HDFS you can use the HDFS repository (more info here:
Try as I might and I have read all the stuff I can find on ES' website
about this I understand somewhat how the integration works but not the
actual nuts and bolts of it.
For example:
Is Hadoop just storing the files that would normally be stored in the
local filesystem for the ES indexes or is it storing the data that would
normally be in those indexes and just accessed through es-hadoop?
If it is the latter how do you go about determining whatto set for the
number of nodes and shards.
If anyone has any information on this or even better yet a place to point
me to that has better references so that I can research this on my own it
would be much appreciated.
So if I understand you correctly if the data is stored in Hadoop then
es-hadoop is really just acting as a job manager? If that is the case what
is the rule of thumb on how many ES nodes and shard should be set?
On Thursday, June 5, 2014 6:45:09 PM UTC-4, Costin Leau wrote:
Think of es-hadoop as a connector between Hadoop and Elasticsearch. You
would use it to index data in Hadoop to ES or run queries in ES directly
from Hadoop.
Where does ES store the data? That depends on its configuration
(completely separate from es-hadoop itself). In general (and the default)
is to store it onto the local file-system. If you want to use it on a
shared file-system or HDFS you can easily do that by mounting it locally
(for example, mount HDFS through NFS as a local disk) and point ES to it.
ES is happy to work with it however the performance will be significantly
degraded and most of the real-time nature of it will go down the window
since HDFS is a distributed file-system (and thus even basic operations
like opening a file or closing a file mean at least one call over the
network) plus you're giving up the amazing OS file-system cache (since the
fs is not local). If the FS is slow, anything that sits on top of it (like
ES) will be slow as well.
Try as I might and I have read all the stuff I can find on ES' website
about this I understand somewhat how the integration works but not the
actual nuts and bolts of it.
For example:
Is Hadoop just storing the files that would normally be stored in the
local filesystem for the ES indexes or is it storing the data that would
normally be in those indexes and just accessed through es-hadoop?
If it is the latter how do you go about determining whatto set for the
number of nodes and shards.
If anyone has any information on this or even better yet a place to point
me to that has better references so that I can research this on my own it
would be much appreciated.
Hmm i am not sure i understand your questions.
Hadoop is distributed storage system (HDFS) and Map-reduce framework (MR)
(among other things)
ES is distributed storage/search system (among other things)
So what es-hadoop is giving you:
You can read data from ES , and do some complex analysis , taking benefits
MR
You can write data to ES - one can process some data stored on HDFS and
write some pre-aggregated data to ES for example
es-hadoop is basically connector between ES and Hadoop
I hope this helps
On Thursday, June 5, 2014 7:41:34 PM UTC+2, ES USER wrote:
Try as I might and I have read all the stuff I can find on ES' website
about this I understand somewhat how the integration works but not the
actual nuts and bolts of it.
For example:
Is Hadoop just storing the files that would normally be stored in the
local filesystem for the ES indexes or is it storing the data that would
normally be in those indexes and just accessed through es-hadoop?
If it is the latter how do you go about determining whatto set for the
number of nodes and shards.
If anyone has any information on this or even better yet a place to point
me to that has better references so that I can research this on my own it
would be much appreciated.
and i don't think this anyhow related with number of shards and nodes
On Thursday, June 5, 2014 7:41:34 PM UTC+2, ES USER wrote:
Try as I might and I have read all the stuff I can find on ES' website
about this I understand somewhat how the integration works but not the
actual nuts and bolts of it.
For example:
Is Hadoop just storing the files that would normally be stored in the
local filesystem for the ES indexes or is it storing the data that would
normally be in those indexes and just accessed through es-hadoop?
If it is the latter how do you go about determining whatto set for the
number of nodes and shards.
If anyone has any information on this or even better yet a place to point
me to that has better references so that I can research this on my own it
would be much appreciated.
Adding to what Georgi wrote, es-hadoop does not create the shards for you -
that's up to you or index templates (which I highly recommend). However
es-hadoop is aware of the target shards and will use them to parallelize
the reads/writes (such as one task per shard).
and i don't think this anyhow related with number of shards and nodes
On Thursday, June 5, 2014 7:41:34 PM UTC+2, ES USER wrote:
Try as I might and I have read all the stuff I can find on ES' website
about this I understand somewhat how the integration works but not the
actual nuts and bolts of it.
For example:
Is Hadoop just storing the files that would normally be stored in the
local filesystem for the ES indexes or is it storing the data that would
normally be in those indexes and just accessed through es-hadoop?
If it is the latter how do you go about determining whatto set for the
number of nodes and shards.
If anyone has any information on this or even better yet a place to point
me to that has better references so that I can research this on my own it
would be much appreciated.
I guess the problem I having wrapping my head around is exactly where the
data is residing and in what format.
If I understand the Georgi's email above is it that you can run map reduce
jobs against data stored in local ES through by utilizing es-hadoop and you
can also run ES queries against data in Hadoop utilizing es-hadoop.
Is that correct?
On Friday, June 6, 2014 12:39:44 PM UTC-4, Costin Leau wrote:
Adding to what Georgi wrote, es-hadoop does not create the shards for you
that's up to you or index templates (which I highly recommend). However
es-hadoop is aware of the target shards and will use them to parallelize
the reads/writes (such as one task per shard).
On Fri, Jun 6, 2014 at 2:45 PM, Georgi Ivanov <georgi....@gmail.com
<javascript:>> wrote:
and i don't think this anyhow related with number of shards and nodes
On Thursday, June 5, 2014 7:41:34 PM UTC+2, ES USER wrote:
Try as I might and I have read all the stuff I can find on ES' website
about this I understand somewhat how the integration works but not the
actual nuts and bolts of it.
For example:
Is Hadoop just storing the files that would normally be stored in the
local filesystem for the ES indexes or is it storing the data that would
normally be in those indexes and just accessed through es-hadoop?
If it is the latter how do you go about determining whatto set for the
number of nodes and shards.
If anyone has any information on this or even better yet a place to
point me to that has better references so that I can research this on my
own it would be much appreciated.
ES stores data in its own internal format, which typically resides locally.
What you are stating is partially correct - with the connector you would
move/copy data between Hadoop and ES since, in order for ES to work with
data, it needs to actually index it (that is, to see it).
So you would use es-hadoop to index data from Hadoop in ES or/and query ES
directly from Hadoop.
I guess the problem I having wrapping my head around is exactly where the
data is residing and in what format.
If I understand the Georgi's email above is it that you can run map reduce
jobs against data stored in local ES through by utilizing es-hadoop and you
can also run ES queries against data in Hadoop utilizing es-hadoop.
Is that correct?
On Friday, June 6, 2014 12:39:44 PM UTC-4, Costin Leau wrote:
Adding to what Georgi wrote, es-hadoop does not create the shards for you
that's up to you or index templates (which I highly recommend). However
es-hadoop is aware of the target shards and will use them to parallelize
the reads/writes (such as one task per shard).
and i don't think this anyhow related with number of shards and nodes
On Thursday, June 5, 2014 7:41:34 PM UTC+2, ES USER wrote:
Try as I might and I have read all the stuff I can find on ES' website
about this I understand somewhat how the integration works but not the
actual nuts and bolts of it.
For example:
Is Hadoop just storing the files that would normally be stored in the
local filesystem for the ES indexes or is it storing the data that would
normally be in those indexes and just accessed through es-hadoop?
If it is the latter how do you go about determining whatto set for the
number of nodes and shards.
If anyone has any information on this or even better yet a place to
point me to that has better references so that I can research this on my
own it would be much appreciated.
Thanks. So just one final question. From what you said above that means
that you can not run ES queries on data in Hadoop over something like a 6
month time range without it having to pull in all that data and index it
first. And I am assuming that the opposite is all correct that Hadoop can
not run jobs on data in ES without it first pulling in that data to its
storage first.
On Friday, June 6, 2014 5:03:03 PM UTC-4, Costin Leau wrote:
ES stores data in its own internal format, which typically resides
locally. What you are stating is partially correct - with the connector you
would move/copy data between Hadoop and ES since, in order for ES to work
with data, it needs to actually index it (that is, to see it).
So you would use es-hadoop to index data from Hadoop in ES or/and query ES
directly from Hadoop.
I guess the problem I having wrapping my head around is exactly where the
data is residing and in what format.
If I understand the Georgi's email above is it that you can run map
reduce jobs against data stored in local ES through by utilizing es-hadoop
and you can also run ES queries against data in Hadoop utilizing es-hadoop.
Is that correct?
On Friday, June 6, 2014 12:39:44 PM UTC-4, Costin Leau wrote:
Adding to what Georgi wrote, es-hadoop does not create the shards for
you - that's up to you or index templates (which I highly recommend).
However es-hadoop is aware of the target shards and will use them to
parallelize the reads/writes (such as one task per shard).
and i don't think this anyhow related with number of shards and nodes
On Thursday, June 5, 2014 7:41:34 PM UTC+2, ES USER wrote:
Try as I might and I have read all the stuff I can find on ES' website
about this I understand somewhat how the integration works but not the
actual nuts and bolts of it.
For example:
Is Hadoop just storing the files that would normally be stored in the
local filesystem for the ES indexes or is it storing the data that would
normally be in those indexes and just accessed through es-hadoop?
If it is the latter how do you go about determining whatto set for the
number of nodes and shards.
If anyone has any information on this or even better yet a place to
point me to that has better references so that I can research this on my
own it would be much appreciated.
From what you said above that means that you can not run ES queries on
data in Hadoop over something like a 6 month time range without it having
to pull in all that data and index it first. - CORRECT . Es queries
can run only on ES
And I am assuming that the opposite is all correct that Hadoop can not
run jobs on data in ES without it first pulling in that data to its storage
first. - NOT CORRECT
The thing is , that you can run MR jobs against data stored in ES (via
EsInputFormat)
So you can do some realy cool stuff reading(and writing) data form ES and
the use the power of MR to process/analyze/dowhateveryouwant the data.
In most common case with Hadoop MR job you do the following
Job config : input, output, input format, output format , etc
Mapper - proces each "line" of the input (stored on HDFS) and eventualy
"emit" ket/val to Reducer
In reducer process all values for one key and eventualy emit again to
the output (on HDFS)
With Es-hadoop you can set the job input data to be read from ES (so step
and then all steps can be the same.
I am giving you some typical scenarios :
Read(via es query) from ES
1.1 Process the data in a MR job
1.2 Store the output to HDFS [OR Store output to ES again (ESindexing
operation)]
Run MR job against data stored on HDFS
2.1 Process the data
2.2 Store the output to ES (ES indexing)
Thanks. So just one final question. From what you said above that means
that you can not run ES queries on data in Hadoop over something like a 6
month time range without it having to pull in all that data and index it
first. And I am assuming that the opposite is all correct that Hadoop can
not run jobs on data in ES without it first pulling in that data to its
storage first.
On Friday, June 6, 2014 5:03:03 PM UTC-4, Costin Leau wrote:
ES stores data in its own internal format, which typically resides
locally. What you are stating is partially correct - with the connector you
would move/copy data between Hadoop and ES since, in order for ES to work
with data, it needs to actually index it (that is, to see it).
So you would use es-hadoop to index data from Hadoop in ES or/and query
ES directly from Hadoop.
I guess the problem I having wrapping my head around is exactly where
the data is residing and in what format.
If I understand the Georgi's email above is it that you can run map
reduce jobs against data stored in local ES through by utilizing es-hadoop
and you can also run ES queries against data in Hadoop utilizing es-hadoop.
Is that correct?
On Friday, June 6, 2014 12:39:44 PM UTC-4, Costin Leau wrote:
Adding to what Georgi wrote, es-hadoop does not create the shards for
you - that's up to you or index templates (which I highly recommend).
However es-hadoop is aware of the target shards and will use them to
parallelize the reads/writes (such as one task per shard).
and i don't think this anyhow related with number of shards and nodes
On Thursday, June 5, 2014 7:41:34 PM UTC+2, ES USER wrote:
Try as I might and I have read all the stuff I can find on ES'
website about this I understand somewhat how the integration works but not
the actual nuts and bolts of it.
For example:
Is Hadoop just storing the files that would normally be stored in the
local filesystem for the ES indexes or is it storing the data that would
normally be in those indexes and just accessed through es-hadoop?
If it is the latter how do you go about determining whatto set for
the number of nodes and shards.
If anyone has any information on this or even better yet a place to
point me to that has better references so that I can research this on my
own it would be much appreciated.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.