I am new to Elasticsearch which I understand can do much more than this...
but could it be used just for that ?
I am storing 100GB of log files daily. The data scientists require this log
data to not contain duplicate log lines. Duplicates may come within the
same log file, with two sequential log files - it's better to expect any
possible scenario.
What I would like to achieve is to use Elasticsearch to detect & remove the
duplicate log lines from all logs in an HDFS directory. Can this be done ?
You could do it by using the hash of the unique bits of the log as the
id. But most systems would support this.
The trouble is that most operations in Elasticsearch are async and
non-atomic. Operations on ID are atomic and synchronous.
All and all, its not a horrible choice but its not the first tool I'd reach
for. If your researchers want the data in Elasticsearch in the end I'd go
with the hash hack. If not I'd investigate some more log processing tools.
It sounds like a fun problem that'd be fun to put together a solution for
but I'm reasonably sure someone has already done this though. OTOH some
fun mostly right system using hashing to push data to the right node and
time bucketed bloom filters would be fun to build and could probably be
tuned to be pretty good. But I'm sure smarter people than me have already
solved this problem though. And open sourced the solution.
I am new to Elasticsearch which I understand can do much more than this...
but could it be used just for that ?
I am storing 100GB of log files daily. The data scientists require this
log data to not contain duplicate log lines. Duplicates may come within the
same log file, with two sequential log files - it's better to expect any
possible scenario.
What I would like to achieve is to use Elasticsearch to detect & remove
the duplicate log lines from all logs in an HDFS directory. Can this be
done ?
This sounds like something you could do in a pre-processing pipeline before
indexing with Elasticsearch. Have you heard of Logstash? It is designed
to slurp up logs, filter them (including detecting and removing duplicates)
then insert them in Elasticsearch (or elsewhere). It can handle live
streaming of logs or can be run on existing log files. Definitely check it
out, but I'd imagine if not Logstash, some other kind of pre-processing is
going to be your best bet.
Regards,
Joshua
On Tuesday, 3 March 2015 01:45:18 UTC+11, Mihai Lucaciu wrote:
Hi,
I am new to Elasticsearch which I understand can do much more than this...
but could it be used just for that ?
I am storing 100GB of log files daily. The data scientists require this
log data to not contain duplicate log lines. Duplicates may come within the
same log file, with two sequential log files - it's better to expect any
possible scenario.
What I would like to achieve is to use Elasticsearch to detect & remove
the duplicate log lines from all logs in an HDFS directory. Can this be
done ?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.