Comparing 2 sources of log input ( using fuzzy? hash? term ?)


(Noel) #1

I need to compare 2 log files from 2 different hosts ( eventually 2 sources of breadcrumbs). How to do it with logstash and elasticsearch?


Comparing 2 indices or 2 set of docs
(Mark Harwood) #2

What sort of comparison?

  • equals vs not equals?
  • line diffs?
  • lines on a time graph showing log volumes by time period?

Comparing 2 indices or 2 set of docs
(Noel) #3

Comparing xml data. Maybe one or more fields.
-equals
I don't need line diffs because both log files will be growing.
I don't care about the volumes.
For example,
log A:
12345
...
12346
...
12347

log B:
12345
...
...
12347

I want to use logstash+elasticsearch to identify log B is missing data (field1 =12346)

Thanks a lot


(Mark Harwood) #4

Can you give a brief example of the 2 inputs and your ideal response?
I'm now clearer on what format your data comes in but not what sort of comparison you are looking for


(Mark Harwood) #5

So you want to know the instant that two continually updated files become inconsistent?
How are you handling the timing of when these comparisons will run?

If a simple "not equal" comparison is required on stable sets then computation of a hash would make sense for efficient comparison.


(Noel) #6

True. The timing may be off. Both logs files are xml format.
So, something like this: pick a data from log1 find the same data in log2
-if found, pick the next data from log1 and find it in log2 , if match - the two logs are sycn
-if not found, the two logs are out of sync

Do you have example? How does hash 'not equal' comparison work?
Should I have 2 logstash instances and both of them supply to elasticsearch?

Thanks


(Mark Harwood) #7

Your approach would be too slow if you are talking about running searches as they involve disk seeks which are hardware operations that are expensive regardless of what software you use.
It always makes sense to reduce the number of seeks so if you can process each set of data as a stream and compute a single value (number of docs? size of all docs? hash of contents?) then you can compare these 2 values much more cheaply.


(system) #10