Wondering if we have any suitable ways to check the data loss when sending data via Logstash to ElasticSearch endpoint.
Ways I know
check logstash.log file to see if any errors are there
1.1 Works but seems to need a lot manual work, and we cannot do much if we have Timeout errors.
check the number of lines generated by service, and validate the same number of lines from Kibana
1.1 Works, but hard to track all the files continuously.
What I need
Automatic way to track percent of files that are sent to ElasticSearch correctly
Get notification when there are severe data loss
LogStash version: 2.1
Thanks!
Example of Timeout error (This is aws elasticsearch, may be different from open source logstash)
{:timestamp=>"2018-04-02T20:51:22.822000+0000", :message=>"Attempted to send a bulk request to Elasticsearch configured at '[\"https://fake_endpoint.es.amazonaws.com:443\"]', but an error occurred and it failed! Are you sure you can reach elasticsearch from this machine using the configuration provided?", :client_config=>{:hosts=>["https://fake_endpoint.es.amazonaws.com:443"], :region=>"us-east-1", :aws_access_key_id=>nil, :aws_secret_access_key=>nil, :aws_odin_material_set=>nil, :transport_options=> . {:request=>{:open_timeout=>0, :timeout=>60}, :proxy=>nil}, :transport_class=>Elasticsearch::Transport::Transport::HTTP::AWS, :logger=>nil, :tracer=>nil, :reload_connections=>false, :retry_on_failure=>false, :reload_on_failure=>false, :randomize_hosts=>false}, :error_message=>"fake_endpoint.es.amazonaws.com:443 failed to respond", :error_class=>"Faraday::ClientError", :backtrace=>nil, :level=>:error}
{:timestamp=>"2018-04-02T20:51:22.824000+0000", :message=>"Failed to flush outgoing items", :outgoing_count=>1, :exception=>"Faraday::ClientError", :backtrace=>nil, :level=>:warn}
I don't know what the best method is but I can speak for what we've done to try and track data loss/issues getting data to Elasticsearch.
First off there's two main areas we wanted to monitor...
Elasticsearch performance issues, resulting in 5xx's
Bad documents, resulting in 4xx's (if I remember correctly)
In the case of the first instance, we have a queue in the middle of our Logstash layer so we have a monitor on this queue. If the queue starts building up, it is a good indication that Elasticsearch ingest cannot keep up with the amount of documents being sent. Ultimately Logstash will retry these messages, so there shouldn't be any data loss. However if you don't have a queue at all Logstash can eventually back up and you will lose messages.
In the second instance, documents can be dropped if they're in an invalid format, mapping conflicts etc., in which case you can use Logstash DLQ to write these bad documents to a file: https://www.elastic.co/guide/en/logstash/current/dead-letter-queues.html
You could monitor this queue via size, or even setup an additional pipeline to pick up these DLQ documents and process them into a 'quarantine' index, and monitor the size of that.
We also have a basic little application that sends x number of events to Logstash every y minutes, and then checks that all those events are available in Kibana within a certain period of time.
I appreciate this isn't a helpful answer, but I don't think there's a simple clean way to measure number of documents lost, I'd probably instead focus on increasing the resiliency so the messages are persisted, and monitor that instead.
Appreciate your suggestions, yeah, the DLQ sounds great!
Just one more follow up, for the first instance, can I ask some more details about the queue in the middle of logstash layer. how to create this queue? Is this a feature provided by Logstash? Can I have some references for that.
Other options are to have two Logstash layers -- One that simply receives events and puts them on a queue (i.e. Kafka, Redis etc.), and another layer that reads from this queue, applies filters, and then outputs to Elasticsearch:
Personally I wouldn't look into adding a persistent queue just yet -- I'd figure out if the data loss is due to bad documents or Elasticsearch performance first. Adding the DLQ should help you find out, you'll be able to see any documents that have been rejected and why.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.