Use case question - Can Elasticsearch be used as a log de-duplication solution?

Hi,

I am new to Elasticsearch which I understand can do much more than this...
but could it be used just for that ?

I am storing 100GB of log files daily. The data scientists require this log
data to not contain duplicate log lines. Duplicates may come within the
same log file, with two sequential log files - it's better to expect any
possible scenario.

What I would like to achieve is to use Elasticsearch to detect & remove the
duplicate log lines from all logs in an HDFS directory. Can this be done ?

Thank you,
Mihai

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a2dd6d7e-698f-4e03-908f-17358c902f6c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

You could do it by using the hash of the unique bits of the log as the
id. But most systems would support this.

The trouble is that most operations in Elasticsearch are async and
non-atomic. Operations on ID are atomic and synchronous.

All and all, its not a horrible choice but its not the first tool I'd reach
for. If your researchers want the data in Elasticsearch in the end I'd go
with the hash hack. If not I'd investigate some more log processing tools.

It sounds like a fun problem that'd be fun to put together a solution for
but I'm reasonably sure someone has already done this though. OTOH some
fun mostly right system using hashing to push data to the right node and
time bucketed bloom filters would be fun to build and could probably be
tuned to be pretty good. But I'm sure smarter people than me have already
solved this problem though. And open sourced the solution.

Nik

On Mon, Mar 2, 2015 at 9:45 AM, Mihai Lucaciu mlucaciu@gmail.com wrote:

Hi,

I am new to Elasticsearch which I understand can do much more than this...
but could it be used just for that ?

I am storing 100GB of log files daily. The data scientists require this
log data to not contain duplicate log lines. Duplicates may come within the
same log file, with two sequential log files - it's better to expect any
possible scenario.

What I would like to achieve is to use Elasticsearch to detect & remove
the duplicate log lines from all logs in an HDFS directory. Can this be
done ?

Thank you,
Mihai

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/a2dd6d7e-698f-4e03-908f-17358c902f6c%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/a2dd6d7e-698f-4e03-908f-17358c902f6c%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd2Ck-C%2BW84VfTSiUrrm-h35_chFU80v%2BuGNzvvgxzpKtg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Hi Mihai,

This sounds like something you could do in a pre-processing pipeline before
indexing with Elasticsearch. Have you heard of Logstash? It is designed
to slurp up logs, filter them (including detecting and removing duplicates)
then insert them in Elasticsearch (or elsewhere). It can handle live
streaming of logs or can be run on existing log files. Definitely check it
out, but I'd imagine if not Logstash, some other kind of pre-processing is
going to be your best bet.

Regards,

Joshua

On Tuesday, 3 March 2015 01:45:18 UTC+11, Mihai Lucaciu wrote:

Hi,

I am new to Elasticsearch which I understand can do much more than this...
but could it be used just for that ?

I am storing 100GB of log files daily. The data scientists require this
log data to not contain duplicate log lines. Duplicates may come within the
same log file, with two sequential log files - it's better to expect any
possible scenario.

What I would like to achieve is to use Elasticsearch to detect & remove
the duplicate log lines from all logs in an HDFS directory. Can this be
done ?

Thank you,
Mihai

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b47ca8d1-6ed2-4165-a021-820b26aa44de%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.