[hadoop] newbie question


(liorg2) #1

hi,

i have few basic questions about es-hadoop,
and i would really appreciate your kind help

  1. if i have currently ES cluster, do i have motivation to add hadoop layer?

  2. is the idea of ES-hadoop, that hadoop will be the data store, and ES the
    search engine above it?

  3. can logstash write to hadoop?

  4. when i run queries to ES, does it go to HDFS in real time?

thanks a lot!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/cf3e6331-2ac2-4967-89b9-a60f2967b8de%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Mark Walkom) #2
  1. Maybe, depends on your use case.
  2. No, they connect but ES does not store data on HDFS
  3. Not natively. eg
    http://www.devopsa.net/2014/04/three-way-to-use-logstash-with-hadoop.html
  4. Can you elaborate here, what do you mean (though see 2)?

On 3 May 2015 at 19:21, Lior Goldemberg liorg2@gmail.com wrote:

hi,

i have few basic questions about es-hadoop,
and i would really appreciate your kind help

  1. if i have currently ES cluster, do i have motivation to add hadoop
    layer?

  2. is the idea of ES-hadoop, that hadoop will be the data store, and ES
    the search engine above it?

  3. can logstash write to hadoop?

  4. when i run queries to ES, does it go to HDFS in real time?

thanks a lot!

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/cf3e6331-2ac2-4967-89b9-a60f2967b8de%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/cf3e6331-2ac2-4967-89b9-a60f2967b8de%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEYi1X8gTi7kpTxEgN__Z7zrr2vDXOY%3DsdNmZ6XHC3xazkYxTA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Costin Leau) #3

To add to Mark's answer:

  1. Hadoop means a lot of things so typically, if you are not familiar
    with it or not a user, the answer tends to be no
  2. No. Data is indexed from Hadoop to Elasticsearch or vice-versa. see
    elastic.co/hadoop and the various presentations on this topic. Again,
    es-hadoop is meant for Hadoop users trying to leverage Elasticsearch
  3. Mark already replied
  4. No - see 2. Note that HDFS is an archiving store so accessing in
    "real-time" means slow access especially for random access.

On Sun, May 3, 2015 at 1:01 PM, Mark Walkom markwalkom@gmail.com wrote:

Maybe, depends on your use case.
No, they connect but ES does not store data on HDFS
Not natively. eg
http://www.devopsa.net/2014/04/three-way-to-use-logstash-with-hadoop.html
Can you elaborate here, what do you mean (though see 2)?

On 3 May 2015 at 19:21, Lior Goldemberg liorg2@gmail.com wrote:

hi,

i have few basic questions about es-hadoop,
and i would really appreciate your kind help

  1. if i have currently ES cluster, do i have motivation to add hadoop
    layer?

  2. is the idea of ES-hadoop, that hadoop will be the data store, and ES
    the search engine above it?

  3. can logstash write to hadoop?

  4. when i run queries to ES, does it go to HDFS in real time?

thanks a lot!

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/cf3e6331-2ac2-4967-89b9-a60f2967b8de%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAEYi1X8gTi7kpTxEgN__Z7zrr2vDXOY%3DsdNmZ6XHC3xazkYxTA%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJogdmcDZ%2BZGjCkJX46iBYX4tEurRSqR1kScfxV-tRLxW3vD0w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(liorg2) #4

thanks guys,

i have an app that needs to write big data, currently writes directly to
ES.

also i have really heavy aggregations (scripted metric), which takes a
long time (few min)

since i know that ES supposed to be a search engine and not DB (by their
claim), i started to look for Solutions, and i thought of the following
form:

  1. write to HDFS
  2. index the data after manipulations, that will save me expensive
    aggregations
    3.run queries on ES

does it make sense?
you know a better scalable option?

Lior

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7aafd5c5-cae4-409d-90d7-719ec78155d3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Costin Leau) #5

Yes that works. It looks like you are only using HDFS and none of the
computational components of Hadoop (Map/Reduce, Hive, Spark, etc...)
thus you could just import the data from HDFS to Elasticsearch with or
without Hadoop.
With Hadoop you get parallelism but you need to learn (if you haven't
already) one of the compute frameworks out there (there are plenty of
options). Without it you can get a potentially easier solution that
doesn't parallelize the input (which tends to be a problem only after
a certain size).

What option fits your scenario depends on your requirements really.

I don't want to stray you away from Hadoop (I'm the lead of es-hadoop
project) rather point out that it is not the only solution out there
and that it comes with a cost.

Cheers,

On Sun, May 3, 2015 at 1:46 PM, Lior Goldemberg liorg2@gmail.com wrote:

thanks guys,

i have an app that needs to write big data, currently writes directly to
ES.

also i have really heavy aggregations (scripted metric), which takes a
long time (few min)

since i know that ES supposed to be a search engine and not DB (by their
claim), i started to look for Solutions, and i thought of the following
form:

  1. write to HDFS
  2. index the data after manipulations, that will save me expensive
    aggregations
    3.run queries on ES

does it make sense?
you know a better scalable option?

Lior

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/7aafd5c5-cae4-409d-90d7-719ec78155d3%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJogdmfXyMrvTkqiB5A0CZKWq2DgNWxScen7bT97_y8fskACPQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Mark Walkom) #6

ES can do aggregations very quickly, it really depends on what you want to
do.

I'd suggest trying ES and seeing if it can do what you want, and go from
there.

On 3 May 2015 at 20:46, Lior Goldemberg liorg2@gmail.com wrote:

thanks guys,

i have an app that needs to write big data, currently writes directly to
ES.

also i have really heavy aggregations (scripted metric), which takes a
long time (few min)

since i know that ES supposed to be a search engine and not DB (by their
claim), i started to look for Solutions, and i thought of the following
form:

  1. write to HDFS
  2. index the data after manipulations, that will save me expensive
    aggregations
    3.run queries on ES

does it make sense?
you know a better scalable option?

Lior

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/7aafd5c5-cae4-409d-90d7-719ec78155d3%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/7aafd5c5-cae4-409d-90d7-719ec78155d3%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEYi1X_Sv8DV6vX0SZM86H-D4UQAM%3DBXipq9vuJrjZc1qYNaJQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(liorg2) #7

Hi guys,

Thanks again for the quick replies, very much appreciated!!

We are using ES for the past year and from day 1 we haven’t had good
perforce for groovy script that use scripted metric aggregations.

Our data is not huge yet, we have 163 indices, 356 shards and 360M
documents but when we run the groovy script it can take up to 2-3 minutes.
From our understanding it should run much faster.

thus we are afraid from the future, when data becomes lots bigger, after
the beta stage

now i'm not sure whats the different between Hadoop and HDFS.
is Hadoop an engine that runs over HDFS?

btw, my complicated scenario, is that i have tons of events, with fields:
event type, user id, date,.. [lots more]...

for example :
{ userid:1, event_type:A,date:03/05/2015 14:25:01}
{ userid:1, event_type:T,date:03/05/2015 14:25:02}
{ userid:1, event_type:S,date:03/05/2015 14:25:03}
{ userid:1, event_type:Z,date:03/05/2015 14:25:04}
{ userid:1, event_type:B,date:03/05/2015 14:25:05}

in the query, i need to find specific flows of users, and not necessary in
a roll , for example: A->S->Z needs to return the user above, wither all
the relevant docs.
when using scripted metric aggregation, it takes a long time, and moreover-
takes lots of memory, and sometimes kill the ES

can Hadoop help me with it?
i thought of creating a list of events per user (currently i have a type
"events" in a daily index, with list of events ordered by date and time,
and the user is a field in this type )

thanks again!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b4fba6db-18ca-4365-af8e-9f95a8571075%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(liorg2) #8

come on guys, you were so helpful so far :slight_smile:

On Sun, May 3, 2015 at 2:32 PM, Lior Goldemberg liorg2@gmail.com wrote:

Hi guys,

Thanks again for the quick replies, very much appreciated!!

We are using ES for the past year and from day 1 we haven’t had good
perforce for groovy script that use scripted metric aggregations.

Our data is not huge yet, we have 163 indices, 356 shards and 360M
documents but when we run the groovy script it can take up to 2-3 minutes.
From our understanding it should run much faster.

thus we are afraid from the future, when data becomes lots bigger, after
the beta stage

now i'm not sure whats the different between Hadoop and HDFS.
is Hadoop an engine that runs over HDFS?

btw, my complicated scenario, is that i have tons of events, with fields:
event type, user id, date,.. [lots more]...

for example :
{ userid:1, event_type:A,date:03/05/2015 14:25:01}
{ userid:1, event_type:T,date:03/05/2015 14:25:02}
{ userid:1, event_type:S,date:03/05/2015 14:25:03}
{ userid:1, event_type:Z,date:03/05/2015 14:25:04}
{ userid:1, event_type:B,date:03/05/2015 14:25:05}

in the query, i need to find specific flows of users, and not necessary in
a roll , for example: A->S->Z needs to return the user above, wither all
the relevant docs.
when using scripted metric aggregation, it takes a long time, and
moreover- takes lots of memory, and sometimes kill the ES

can Hadoop help me with it?
i thought of creating a list of events per user (currently i have a type
"events" in a daily index, with list of events ordered by date and time,
and the user is a field in this type )

thanks again!

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/0XCJ1PS2H2o/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/b4fba6db-18ca-4365-af8e-9f95a8571075%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/b4fba6db-18ca-4365-af8e-9f95a8571075%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPYrotrxU5QaPu2A7kBNEhUZxc0J4OnjDTwf3eabRBkhU0%2Bxhw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Christian Dahlqvist) #9

Hi,

I am sure Hadoop can help you calculate this, but you may also be able to
go about this more efficiently in Elasticsearch. If you, as you mentioned,
were to create a user centric index in addition to the event centric one
that you have got, you could store a list of all the events belonging to a
user there. This would allow you to efficiently identify the users that
have all the required events through a simple query, and then just process
these to verify that the order is correct, which is likely to scale and
perform much better than the current approach. This is what is usually
referred to as entity-centric indexing [1].

As updating the user centric index for every event inserted can often be
expensive, a common approach is to create a batch job that periodically
retrieves all new events, aggregates these per user and updates the user
index. This will mean that the user index will not be completely up to date
all the time, but as you spread out the processing work, it can make
queries much more efficient.

[1] https://www.elastic.co/videos/entity-centric-indexing-london-meetup-sep-2014

Best regards,

Christian

On Sunday, 3 May 2015 10:21:35 UTC+1, Lior Goldemberg wrote:

hi,

i have few basic questions about es-hadoop,
and i would really appreciate your kind help

  1. if i have currently ES cluster, do i have motivation to add hadoop
    layer?

  2. is the idea of ES-hadoop, that hadoop will be the data store, and ES
    the search engine above it?

  3. can logstash write to hadoop?

  4. when i run queries to ES, does it go to HDFS in real time?

thanks a lot!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/891973ff-14be-4720-9895-d7e6581b2323%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(liorg2) #10

Thanks a lot
Very appreciated

On Sunday, May 3, 2015, Christian Dahlqvist acdahlqvist@gmail.com wrote:

Hi,

I am sure Hadoop can help you calculate this, but you may also be able to
go about this more efficiently in Elasticsearch. If you, as you mentioned,
were to create a user centric index in addition to the event centric one
that you have got, you could store a list of all the events belonging to a
user there. This would allow you to efficiently identify the users that
have all the required events through a simple query, and then just process
these to verify that the order is correct, which is likely to scale and
perform much better than the current approach. This is what is usually
referred to as entity-centric indexing [1].

As updating the user centric index for every event inserted can often be
expensive, a common approach is to create a batch job that periodically
retrieves all new events, aggregates these per user and updates the user
index. This will mean that the user index will not be completely up to date
all the time, but as you spread out the processing work, it can make
queries much more efficient.

[1]
https://www.elastic.co/videos/entity-centric-indexing-london-meetup-sep-2014

Best regards,

Christian

On Sunday, 3 May 2015 10:21:35 UTC+1, Lior Goldemberg wrote:

hi,

i have few basic questions about es-hadoop,
and i would really appreciate your kind help

  1. if i have currently ES cluster, do i have motivation to add hadoop
    layer?

  2. is the idea of ES-hadoop, that hadoop will be the data store, and ES
    the search engine above it?

  3. can logstash write to hadoop?

  4. when i run queries to ES, does it go to HDFS in real time?

thanks a lot!

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/0XCJ1PS2H2o/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com
<javascript:_e(%7B%7D,'cvml','elasticsearch%2Bunsubscribe@googlegroups.com');>
.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/891973ff-14be-4720-9895-d7e6581b2323%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/891973ff-14be-4720-9895-d7e6581b2323%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPYrotrbMGA6_sN%3DabqCexW%3DeSw%3DKJ1w2oX2BFbDj-MfVj0biw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(system) #11