Best practice to track data loss when sending data to ElasticSeach

Wondering if we have any suitable ways to check the data loss when sending data via Logstash to ElasticSearch endpoint.

Ways I know

  1. check logstash.log file to see if any errors are there
    1.1 Works but seems to need a lot manual work, and we cannot do much if we have Timeout errors.
  2. check the number of lines generated by service, and validate the same number of lines from Kibana
    1.1 Works, but hard to track all the files continuously.

What I need

  1. Automatic way to track percent of files that are sent to ElasticSearch correctly
  2. Get notification when there are severe data loss

LogStash version: 2.1

Thanks!

Example of Timeout error (This is aws elasticsearch, may be different from open source logstash)

{:timestamp=>"2018-04-02T20:51:22.822000+0000", :message=>"Attempted to send a bulk request to Elasticsearch configured at '[\"https://fake_endpoint.es.amazonaws.com:443\"]', but an error occurred and it failed! Are you sure you can reach elasticsearch from this machine using the configuration provided?", :client_config=>{:hosts=>["https://fake_endpoint.es.amazonaws.com:443"], :region=>"us-east-1", :aws_access_key_id=>nil, :aws_secret_access_key=>nil, :aws_odin_material_set=>nil, :transport_options=> . {:request=>{:open_timeout=>0, :timeout=>60}, :proxy=>nil}, :transport_class=>Elasticsearch::Transport::Transport::HTTP::AWS, :logger=>nil, :tracer=>nil, :reload_connections=>false, :retry_on_failure=>false, :reload_on_failure=>false, :randomize_hosts=>false}, :error_message=>"fake_endpoint.es.amazonaws.com:443 failed to respond", :error_class=>"Faraday::ClientError", :backtrace=>nil, :level=>:error}

{:timestamp=>"2018-04-02T20:51:22.824000+0000", :message=>"Failed to flush outgoing items", :outgoing_count=>1, :exception=>"Faraday::ClientError", :backtrace=>nil, :level=>:warn}

Hey Raychen,

I don't know what the best method is but I can speak for what we've done to try and track data loss/issues getting data to Elasticsearch.

First off there's two main areas we wanted to monitor...

  1. Elasticsearch performance issues, resulting in 5xx's
  2. Bad documents, resulting in 4xx's (if I remember correctly)

In the case of the first instance, we have a queue in the middle of our Logstash layer so we have a monitor on this queue. If the queue starts building up, it is a good indication that Elasticsearch ingest cannot keep up with the amount of documents being sent. Ultimately Logstash will retry these messages, so there shouldn't be any data loss. However if you don't have a queue at all Logstash can eventually back up and you will lose messages.

In the second instance, documents can be dropped if they're in an invalid format, mapping conflicts etc., in which case you can use Logstash DLQ to write these bad documents to a file: https://www.elastic.co/guide/en/logstash/current/dead-letter-queues.html
You could monitor this queue via size, or even setup an additional pipeline to pick up these DLQ documents and process them into a 'quarantine' index, and monitor the size of that.

We also have a basic little application that sends x number of events to Logstash every y minutes, and then checks that all those events are available in Kibana within a certain period of time.

I appreciate this isn't a helpful answer, but I don't think there's a simple clean way to measure number of documents lost, I'd probably instead focus on increasing the resiliency so the messages are persisted, and monitor that instead.

Cheers,
Mike

Hey Michael,

Appreciate your suggestions, yeah, the DLQ sounds great!
Just one more follow up, for the first instance, can I ask some more details about the queue in the middle of logstash layer. how to create this queue? Is this a feature provided by Logstash? Can I have some references for that.

Thanks!

Logstash does provide persistent queues which achieve a similar purpose -- Events are written to disk, and are only removed once successfully received by your output (Elasticsearch). https://www.elastic.co/guide/en/logstash/current/persistent-queues.html

Other options are to have two Logstash layers -- One that simply receives events and puts them on a queue (i.e. Kafka, Redis etc.), and another layer that reads from this queue, applies filters, and then outputs to Elasticsearch:

Source -> Logstash -> Kafka/Redis/etc. -> Logstash -> Elasticsearch


Personally I wouldn't look into adding a persistent queue just yet -- I'd figure out if the data loss is due to bad documents or Elasticsearch performance first. Adding the DLQ should help you find out, you'll be able to see any documents that have been rejected and why.

That's great! Thanks for the detail explanation!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.