Using Pig/Spark on ElasticSearch (as External Storage)

Hi All,
Currently I am using ElasticSearch for a logging system.
My first solution is that every log will put on ES and index will rolling
by date.
To do real time stats, I will use Aggregation.
To do statistic I will use Spark (or Hive, Shark whatever) on ES data
(thanks to ElasticSearch-Hadoop plugin
All is fine, but when my data grows (currently 17M record/index/date),
Spark (Hive also) becomes very slow.

I did benmark with the same data on ES and Hadoop, and I saw that Spark
(Hive) run on Hadoop is much faster.

Is there something I have missed ?

Currently I maintain two storage, one is ES for realtime stats, another is
Hadoop for other statistics. Is it fine ?

I really appreciate any idea or solution.,

Thanks.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/103fb68e-65e8-4b1c-9e75-b34d393b7210%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

It depends on various factors. Do you put all the data under one index or
is it one index per day/month/hour? What type of script and performance
degradation do you see? If it's easier feel free to reach out on irc. I'll
be traveling this week but we'll be back the next one.
Cheers
On Oct 12, 2014 2:51 PM, "Sang Dang" zkidkid@gmail.com wrote:

Hi All,
Currently I am using ElasticSearch for a logging system.
My first solution is that every log will put on ES and index will rolling
by date.
To do real time stats, I will use Aggregation.
To do statistic I will use Spark (or Hive, Shark whatever) on ES data
(thanks to ElasticSearch-Hadoop plugin
All is fine, but when my data grows (currently 17M record/index/date),
Spark (Hive also) becomes very slow.

I did benmark with the same data on ES and Hadoop, and I saw that Spark
(Hive) run on Hadoop is much faster.

Is there something I have missed ?

Currently I maintain two storage, one is ES for realtime stats, another is
Hadoop for other statistics. Is it fine ?

I really appreciate any idea or solution.,

Thanks.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/103fb68e-65e8-4b1c-9e75-b34d393b7210%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/103fb68e-65e8-4b1c-9e75-b34d393b7210%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJogdme%2BnCuz1tkmCtGX7Amq1%3Db2%3DirBwBVJxk_nGySja_PuoQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Hi Costin Leau,

Currently I just pull all data in one index (INDEX_NAME_DATE)
In my benmark, I just do two function, count and count distinct field.

P/S: Thanks for your fast response, I would really happy to see you at IRC
(just give me the time).

On Sunday, October 12, 2014 8:02:57 PM UTC+7, Costin Leau wrote:

It depends on various factors. Do you put all the data under one index or
is it one index per day/month/hour? What type of script and performance
degradation do you see? If it's easier feel free to reach out on irc. I'll
be traveling this week but we'll be back the next one.
Cheers
On Oct 12, 2014 2:51 PM, "Sang Dang" <zki...@gmail.com <javascript:>>
wrote:

Hi All,
Currently I am using ElasticSearch for a logging system.
My first solution is that every log will put on ES and index will rolling
by date.
To do real time stats, I will use Aggregation.
To do statistic I will use Spark (or Hive, Shark whatever) on ES data
(thanks to ElasticSearch-Hadoop plugin
All is fine, but when my data grows (currently 17M record/index/date),
Spark (Hive also) becomes very slow.

I did benmark with the same data on ES and Hadoop, and I saw that Spark
(Hive) run on Hadoop is much faster.

Is there something I have missed ?

Currently I maintain two storage, one is ES for realtime stats, another
is Hadoop for other statistics. Is it fine ?

I really appreciate any idea or solution.,

Thanks.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/103fb68e-65e8-4b1c-9e75-b34d393b7210%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/103fb68e-65e8-4b1c-9e75-b34d393b7210%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/efc5ac1e-863c-49ee-b13a-211fb03a54c6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.