Hi All,
Currently I am using ElasticSearch for a logging system.
My first solution is that every log will put on ES and index will rolling
by date.
To do real time stats, I will use Aggregation.
To do statistic I will use Spark (or Hive, Shark whatever) on ES data
(thanks to ElasticSearch-Hadoop plugin
All is fine, but when my data grows (currently 17M record/index/date),
Spark (Hive also) becomes very slow.
I did benmark with the same data on ES and Hadoop, and I saw that Spark
(Hive) run on Hadoop is much faster.
Is there something I have missed ?
Currently I maintain two storage, one is ES for realtime stats, another is
Hadoop for other statistics. Is it fine ?
It depends on various factors. Do you put all the data under one index or
is it one index per day/month/hour? What type of script and performance
degradation do you see? If it's easier feel free to reach out on irc. I'll
be traveling this week but we'll be back the next one.
Cheers
On Oct 12, 2014 2:51 PM, "Sang Dang" zkidkid@gmail.com wrote:
Hi All,
Currently I am using Elasticsearch for a logging system.
My first solution is that every log will put on ES and index will rolling
by date.
To do real time stats, I will use Aggregation.
To do statistic I will use Spark (or Hive, Shark whatever) on ES data
(thanks to Elasticsearch-Hadoop plugin
All is fine, but when my data grows (currently 17M record/index/date),
Spark (Hive also) becomes very slow.
I did benmark with the same data on ES and Hadoop, and I saw that Spark
(Hive) run on Hadoop is much faster.
Is there something I have missed ?
Currently I maintain two storage, one is ES for realtime stats, another is
Hadoop for other statistics. Is it fine ?
Currently I just pull all data in one index (INDEX_NAME_DATE)
In my benmark, I just do two function, count and count distinct field.
P/S: Thanks for your fast response, I would really happy to see you at IRC
(just give me the time).
On Sunday, October 12, 2014 8:02:57 PM UTC+7, Costin Leau wrote:
It depends on various factors. Do you put all the data under one index or
is it one index per day/month/hour? What type of script and performance
degradation do you see? If it's easier feel free to reach out on irc. I'll
be traveling this week but we'll be back the next one.
Cheers
On Oct 12, 2014 2:51 PM, "Sang Dang" <zki...@gmail.com <javascript:>>
wrote:
Hi All,
Currently I am using Elasticsearch for a logging system.
My first solution is that every log will put on ES and index will rolling
by date.
To do real time stats, I will use Aggregation.
To do statistic I will use Spark (or Hive, Shark whatever) on ES data
(thanks to Elasticsearch-Hadoop plugin
All is fine, but when my data grows (currently 17M record/index/date),
Spark (Hive also) becomes very slow.
I did benmark with the same data on ES and Hadoop, and I saw that Spark
(Hive) run on Hadoop is much faster.
Is there something I have missed ?
Currently I maintain two storage, one is ES for realtime stats, another
is Hadoop for other statistics. Is it fine ?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.