we are going to receive a huge amount of logs and we are going to process
them with the typical ELK stack. But because of volume we plan to keep,
in ElasticSeach a week or at most a month of logs. After that time we plan
to use the Elasticsearch - hadoop integration, mainly for Long-Term
Archive, so older logs are moved to Hadoop. Disk space is a very
important issue here and we plan to use hadoop compression codecs like
gzip, for those logs older than a month.
If we use compression in hadoop, can we still index + graph data stored in
logs older than a month? Or for this goal, logs must be stored in a raw
format?
As far as I have read, if logs (in raw) are stored in hadoop, thanks to the
elasticsearch-hadoop integration, elasticsearch can index on them, and
kibana can report seamlessly both logs, current logs (stored locally in
elasticsearch servers) or older logs (in hadoop). Please correct me if I am
wrong. The question is if we lose this feature if we compress the logs in
hadoop.
Any help is appreciated.
Thanks and best regards,
Rodrigo.
P.S.: Do not hesitate to challenge the architecture/data-flow as well, if
you think there are better ways to do it.Thanks.
Es-Hadoop leverages the existing Hadoop infrastrcture. So whatever compression or splitting is used by your
infrastructure, it will simply work with Elasticsearch as well.
Take a map reduce job - on the reading part you can use whatever InputFormat you are currently using (to deal with gzip
or what have you) and as an OutputFormat the Elasticsearch one. Everything is transparent and works through your
existing infrastructure.
Notice that you don't need a raw format (what is that) - as long as your data can be read into Hadoop Map/Reduce, Pig,
Hive, Cascading, Storm or Spark it can also be written/indexed to Elasticsearch. And vice-versa.
On 4/10/15 12:11 PM, Rodrigo Merino wrote:
Hi all,
we are going to receive a huge amount of logs and we are going to process them with the typical ELK stack. But
because of volume we plan to keep, in ElasticSeach a week or at most a month of logs. After that time we plan to use the
Elasticsearch - hadoop integration, mainly for Long-Term Archive, so older logs are moved to Hadoop. Disk space is a
very important issue here and we plan to use hadoop compression codecs like gzip, for those logs older than a month.
If we use compression in hadoop, can we still index + graph data stored in logs older than a month? Or for this goal,
logs must be stored in a raw format?
As far as I have read, if logs (in raw) are stored in hadoop, thanks to the elasticsearch-hadoop integration,
elasticsearch can index on them, and kibana can report seamlessly both logs, current logs (stored locally in
elasticsearch servers) or older logs (in hadoop). Please correct me if I am wrong. The question is if we lose this
feature if we compress the logs in hadoop.
Any help is appreciated.
Thanks and best regards,
Rodrigo.
P.S.: Do not hesitate to challenge the architecture/data-flow as well, if you think there are better ways to do it.Thanks.
we are going to receive a huge amount of logs and we are going to
process them with the typical ELK stack. But because of volume we plan
to keep, in ElasticSeach a week or at most a month of logs. After that time
we plan to use the Elasticsearch - hadoop integration, mainly for Long-Term
Archive, so older logs are moved to Hadoop. Disk space is a very
important issue here and we plan to use hadoop compression codecs like
gzip, for those logs older than a month.
If we use compression in hadoop, can we still index + graph data stored
in logs older than a month? Or for this goal, logs must be stored in a
raw format?
As far as I have read, if logs (in raw) are stored in hadoop, thanks to
the elasticsearch-hadoop integration, elasticsearch can index on them, and
kibana can report seamlessly both logs, current logs (stored locally in
elasticsearch servers) or older logs (in hadoop). Please correct me if I am
wrong. The question is if we lose this feature if we compress the logs in
hadoop.
Any help is appreciated.
Thanks and best regards,
Rodrigo.
P.S.: Do not hesitate to challenge the architecture/data-flow as well, if
you think there are better ways to do it.Thanks.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.