DataLoss in Logstash!


(sreeram) #1

Hi,

I'm using ELK for Centralized logging and i'm facing DataLoss while processing 5lakh logs(kibana hits),

  • (Logstash (Shipper&Indexer same instance)) Machine 1 -> (ElasticSearch -> Kibana) Machine 2

Scenario for DataLoss

  • Logstash started reading log files with 5lakh logs and i'm able to see kibana hits increasing.
  • While reading, ElasticSearch goes unavailable due to network issue between Machine 1 & 2.
  • I have configured Logstash output, to retry for 10mins (retry count 120 times & interval 5secs).
  1. Why am I facing data loss in this scenario?
  2. In SinceDB file, What will be the offset position ? (position of logs read successfully / position of logs reached elastic search successfully)
  3. How to handle this scenario(ElasticSearch not available) without data loss ?

(Magnus Bäck) #2

How do you know that data has been lost? Delayed, sure, but permanently lost? Please explain how you reached that conclusion.

  1. There shouldn't be any data loss. When any output stalls the whole Logstash pipeline stalls and Logstash will stop reading from the files.
  2. It's the number of bytes read and passed into the pipeline. The pipeline only has a 20 (or is it 20+20?) events in its buffer so you should never lose more than that.

(sreeram) #3

Scenario 1 (No network issue)

  • No. of logs in log files = kibana hits = 500,000

Scenario 2 (network issue)

  • No. of logs in log files = 500,000
  • kibana hits = 450,000 (varies each time)

Logstash Config for your reference

input {

file {
path => ["D:/logpath/**/*.txt"]
codec => plain { charset => "UTF-16" }
start_position => "beginning"
sincedb_path => ["D:/since.db"]
}
}
filter {

multiline {
# Grok pattern names are valid! :slight_smile:
pattern => "\d\t(?!$)"
negate => "true"
what => "previous"
}

mutate{
gsub => [message, "\n", "!n!"]
gsub => [message, """, "!dq!"]
gsub => [message, "'", "!sq!"]
}

 csv {
      columns => ["modulename", "threadid", "datedon","logtype","logdescription"]
      separator => "	"
    }   

date{
locale => "en"
timezone => "UTC"
match => [ "datedon", "dd-MM-yyyy HH:mm:ss Z","dd-MMM-yyyy HH:mm:ss Z", "dd/MM/yyyy h:mm:ss a Z","dd/MM/yyyy hh:mm:ss a Z","MM/dd/yyyy hh:mm:ss a Z","M/dd/yyyy hh:mm:ss a Z","MM/dd/yyyy h:mm:ss a Z"]

}
mutate {
remove_field => [ "column6" ]
remove_field => [ "datedon" ]
}

mutate{
convert => { "threadid" => "integer" }
gsub => [message, "!n!", "
"]
gsub => [logdescription, "!n!", "
"]
gsub => [message, "!dq!", '"']
gsub => [logdescription, "!dq!", '"']
gsub => [message, "!sq!", "'"]
gsub => [logdescription, "!sq!", "'"]
}
}

output {

elasticsearch {
host => "10.2.44.124"
protocol => http
workers => 3
flush_size => 50000
max_retries => 100

}

}


(Magnus Bäck) #4

Okay. I've seen cases at least with Logstash 1.4.2 where it gets upset when ES is unavailable and you have to restart it to get it going again—have you tried that? Also, what's in the sincedb file? Does Logstash think it has read everything that's in the input files, or is there unread data that it for some reason isn't trying to ship to ES?


(sreeram) #5
  • Will check whether restart is working (but in production I can't restart each time when n/w issues occurs).
    -- How to handle then?

  • Will check SinceDb offset & post it here.
    -- Can you please explain, will logstash moves .sincedb offset(pointer) immediately after it has read a log?

My environment details FYI,
OS => Win 7 64 bit
Logstash 1.5.2
ElasticSearch 1.6.2


(Magnus Bäck) #6

Will check whether restart is working (but in production I can't restart each time when n/w issues occurs).
-- How to handle then?

Let's understand the nature of the problem first.

Can you please explain, will logstash moves .sincedb offset(pointer) immediately after it has read a log?

That's controlled by the sincedb_write_interval configuration parameter.


(sreeram) #7

Data loss resolved!!!
previously in logstash configuration output plugin,
flush_size = 50,000
retry_item_count = 5000 (default)
when I changed to,
flush_size = 5000 (default)
retry_item_count = 5000 (default)

Data loss issue on network failure got resolved :slight_smile:


(sreeram) #8

While testing I observed the following,

  1. java.exe is the process which holds flush message and offset(no. of lines read)
  2. During network failure, if I kill "java.exe" & restart logstash service, data is getting duplicated.
    why this happening?!
  3. can I reduced sincedb_write_interval from 15secs(default) to 5 secs?!
  4. my production environment has very slow network speed (less than 256kbps), Is it possible to compress the data in flush before pushing it to elastic search?!

it will be very helpful for me to understand how logstash works if I get clarified.


(Magnus Bäck) #9

During network failure, if I kill "java.exe" & restart logstash service, data is getting duplicated.
why this happening?!

Because killing a Windows process doesn't allow it to shut down in an orderly fashion and do stuff like flush the sincedb. Assuming you by "kill" mean use of End Process in Task Manager or something equivalent that eventually ends up with a TerminateProcess() Win32 call.

can I reduced sincedb_write_interval from 15secs(default) to 5 secs?!

Yes, certainly.

my production environment has very slow network speed (less than 256kbps), Is it possible to compress the data in flush before pushing it to elastic search?!

I don't think that's possible out of box. You'd probably have to build some kind of proxy or transparent middle-man that does this. Or you could rearchitect your setup and e.g. ship logs in compressed form to the same network location as ES and do the Logstash processing there.


(sreeram) #10

Yes I can re-architect but I'm more interested in the idea of building a proxy or transparent middle-man for compression. But have no idea about it, can you please explain how to implement such set-up?


(Magnus Bäck) #11

I was thinking about something like Ziproxy. I don't have any particular experiences to share.


#12

I have the same problem ,logstash miss data, can you give me any advise


(system) #13