Using Pig/Spark on ElasticSearch (as External Storage)

zkidkid · October 12, 2014, 12:51pm

Hi All,
Currently I am using ElasticSearch for a logging system.
My first solution is that every log will put on ES and index will rolling
by date.
To do real time stats, I will use Aggregation.
To do statistic I will use Spark (or Hive, Shark whatever) on ES data
(thanks to ElasticSearch-Hadoop plugin
All is fine, but when my data grows (currently 17M record/index/date),
Spark (Hive also) becomes very slow.

I did benmark with the same data on ES and Hadoop, and I saw that Spark
(Hive) run on Hadoop is much faster.

Is there something I have missed ?

Currently I maintain two storage, one is ES for realtime stats, another is
Hadoop for other statistics. Is it fine ?

I really appreciate any idea or solution.,

Thanks.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/103fb68e-65e8-4b1c-9e75-b34d393b7210%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

costin · October 12, 2014, 1:02pm

It depends on various factors. Do you put all the data under one index or
is it one index per day/month/hour? What type of script and performance
degradation do you see? If it's easier feel free to reach out on irc. I'll
be traveling this week but we'll be back the next one.
Cheers
On Oct 12, 2014 2:51 PM, "Sang Dang" zkidkid@gmail.com wrote:

Hi All,
Currently I am using Elasticsearch for a logging system.
My first solution is that every log will put on ES and index will rolling
by date.
To do real time stats, I will use Aggregation.
To do statistic I will use Spark (or Hive, Shark whatever) on ES data
(thanks to Elasticsearch-Hadoop plugin
All is fine, but when my data grows (currently 17M record/index/date),
Spark (Hive also) becomes very slow.

I did benmark with the same data on ES and Hadoop, and I saw that Spark
(Hive) run on Hadoop is much faster.

Is there something I have missed ?

Currently I maintain two storage, one is ES for realtime stats, another is
Hadoop for other statistics. Is it fine ?

I really appreciate any idea or solution.,

Thanks.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/103fb68e-65e8-4b1c-9e75-b34d393b7210%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/103fb68e-65e8-4b1c-9e75-b34d393b7210%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJogdme%2BnCuz1tkmCtGX7Amq1%3Db2%3DirBwBVJxk_nGySja_PuoQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

zkidkid · October 12, 2014, 1:41pm

Hi Costin Leau,

Currently I just pull all data in one index (INDEX_NAME_DATE)
In my benmark, I just do two function, count and count distinct field.

P/S: Thanks for your fast response, I would really happy to see you at IRC
(just give me the time).

On Sunday, October 12, 2014 8:02:57 PM UTC+7, Costin Leau wrote:

It depends on various factors. Do you put all the data under one index or
is it one index per day/month/hour? What type of script and performance
degradation do you see? If it's easier feel free to reach out on irc. I'll
be traveling this week but we'll be back the next one.
Cheers
On Oct 12, 2014 2:51 PM, "Sang Dang" <zki...@gmail.com <javascript:>>
wrote:

Hi All,
Currently I am using Elasticsearch for a logging system.
My first solution is that every log will put on ES and index will rolling
by date.
To do real time stats, I will use Aggregation.
To do statistic I will use Spark (or Hive, Shark whatever) on ES data
(thanks to Elasticsearch-Hadoop plugin
All is fine, but when my data grows (currently 17M record/index/date),
Spark (Hive also) becomes very slow.

I did benmark with the same data on ES and Hadoop, and I saw that Spark
(Hive) run on Hadoop is much faster.

Is there something I have missed ?

Currently I maintain two storage, one is ES for realtime stats, another
is Hadoop for other statistics. Is it fine ?

I really appreciate any idea or solution.,

Thanks.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/103fb68e-65e8-4b1c-9e75-b34d393b7210%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/103fb68e-65e8-4b1c-9e75-b34d393b7210%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/efc5ac1e-863c-49ee-b13a-211fb03a54c6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Query on Indexing using es-hadoop Elasticsearch es-hadoop	6	1957	July 6, 2017
Using Apache Spark for elasticsearch indexing Elasticsearch es-hadoop	3	813	July 6, 2017
Performance of Spark bulk index to Elasticsearch Elasticsearch es-hadoop	3	2599	September 1, 2017
Hadoop / Elasticsearch functionality Elasticsearch es-hadoop	20	3236	July 6, 2017
Use cases Elasticsearch and Spark Elasticsearch es-hadoop	5	3352	July 6, 2017

Using Pig/Spark on ElasticSearch (as External Storage)

Related topics