Logstash (ELK) not streaming data real-time from dynamodb, duplicates upon restart & fetching in some random order

We're facing multiple issues while using ELK stack. We suspect they're Logstash Configuration issues. Issues are as follows:

  1. Logstash connected to Dynamodb streams isn't showing real-time changes. We even have an explicit perform_stream=>true in our Logstash configuration. Note: We do get the latest data if we restart the logstash (which is running in a docker container). Could this be cross-region issue? Dynamodb (in us-east-1) while Logstash & Elasticsearch (in us-west-1)?

  2. Upon restarting Logstash the entire Dynamodb table data is presumably duplicated in Elasticsearch. Dynamodb has around 70K+ Item Count while Elasticsearch has more than double Searchable Documents. Could it be because we have perform_stream=>true config?

  3. Intermittently the latest data can be seen but it is sandwiched between older records; some kind of random data fetch order. Could it be due to multiple workers trying to log at the same time?

  4. We need the json message contents from Dynamodb as is. However, we noticed that when we run Logstash the output shows the data in "Stream Records". When we use log_format=>"json_binary_as_text", we can see the json message as we require. Is this sufficient?

Following is our Logstash Configuration:

input { 
    dynamodb {
      endpoint => "dynamodb.us-east-1.amazonaws.com"
      streams_endpoint => "streams.dynamodb.us-east-1.amazonaws.com"
      view_type => "new_image"
      perform_scan => true
      perform_stream => true
      publish_metrics => true
      table_name => "here-we-have-dynamodb-table-name"
      log_format => "json_binary_as_text"
  }
}
output {
    elasticsearch {
      hosts => "here-we-have-our-elasticsearch-endpoint-which-is-in-us-west-1"
    } 
}

NOTE: There are no errors in the logs (docker logs --follow container-name).
Any help on these issues is really appreciated.

I have never used the dynamodb plugin, so can not help with that. As it is not one of the official plugins you may want to reach out to the creator.

The reason you are getting duplicates is that you are not specifying any document_id for the document in the Elasticsearch output. Elasticsearch therefore automatically generates a new id every time the document is processed and inserts it multiple times. If you have a primary key, use this as document_id as this will cause Elasticsearch to update the record if you reprocess.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.