Logstash 5.6.1 inifinite loop? Encountered a retryable error

Hush077 · September 25, 2017, 7:51pm

Upgraded to 5.6.1 this morning on all pieces of the ELK stack.

Logstash started outputting this in the logs and all messages have thus stopped:

[2017-09-25T19:54:44,850][ERROR][logstash.outputs.elasticsearch] Encountered a retryable error. Will Retry with exponential backoff  {:code=>400, :url=>"http://es-data-int.ex.com:9200/_bulk", :body=>"{\"error\":{\"root_cause\":[{\"type\":\"parse_exception\",\"reason\":\"request body is required\"}],\"type\":\"parse_exception\",\"reason\":\"request body is required\"},\"status\":400}"}
[2017-09-25T19:54:47,042][ERROR][logstash.outputs.elasticsearch] Encountered a retryable error. Will Retry with exponential backoff  {:code=>400, :url=>"http://es-data-int.ex.com:9200/_bulk", :body=>"{\"error\":{\"root_cause\":[{\"type\":\"parse_exception\",\"reason\":\"request body is required\"}],\"type\":\"parse_exception\",\"reason\":\"request body is required\"},\"status\":400}"}
[2017-09-25T19:54:51,219][ERROR][logstash.outputs.elasticsearch] Encountered a retryable error. Will Retry with exponential backoff  {:code=>400, :url=>"http://es-data-int.ex.com:9200/_bulk", :body=>"{\"error\":{\"root_cause\":[{\"type\":\"parse_exception\",\"reason\":\"request body is required\"}],\"type\":\"parse_exception\",\"reason\":\"request body is required\"},\"status\":400}"}

These messages just continually loop and we stop receiving any messages from the logstash parsers.

theuntergeek · September 25, 2017, 8:05pm

From what version did you upgrade? All components.

Hush077 · September 25, 2017, 8:10pm

All components were 5.2.0.

I've removed all configs from logstash except our filebeat input and the elasticsearch output to remove any variables.

input {
  beats {
    port => 5044
  }
}

  output {
      elasticsearch {
        hosts           => ["http://es-data-int.ex.com:9200"]
        index           => "logstash-%{environment}-%{product}-%{role}-%{+YYYY.MM.dd}"
        document_type   => "%{[@metadata][type]}"
      }
}

I also removed the %{variables} from the index name to see if that maybe causing the issue and the same prevailed.

We are in AWS using Amazon Linux for all servers. The ES data nodes are behind an ALB listening on port 9200.

Hush077 · September 25, 2017, 8:49pm

Cross posted to this issue @theuntergeek Undefined method `sanitized' as it seems to be specific to the elasticsearch-output-plugin.

I removed everything and did a clean install. This gave me the santize error, so tried updating the plugins and now I'm back to where i started above.

PhaedrusTheGreek · September 25, 2017, 9:05pm

Definitely a strange one. ES is complaining that there's no HTTP request body, and from what I can tell there's no record of Logstash being responsible for such a thing. Any chance that the ALB is modifying the request? Can you post an ASCII dump of a packet trace leaving the LS machine vs hitting ES ?

Hush077 · September 25, 2017, 9:15pm

Looks like it's similar to what was going on here? https://github.com/logstash-plugins/logstash-output-elasticsearch/issues/612

Originally I had logstash going to the localhost es master nodes and still had the issue, so I'm pretty positive it has nothing to do with the ALB. The ALB was just an attempt at removing the lesser hardware master nodes as a variable and sending the requests directly to the data nodes.

Hush077 · September 25, 2017, 9:18pm

@PhaedrusTheGreek what's the expected behavior if an error is retried but continues to fail? Retry 10 times then drop or anything?

Hush077 · September 25, 2017, 9:54pm

So I rolled back to logstash 5.2.0 and this seems to be an error we were having I just wasn't paying attention to it since it didn't block anything.

[2017-09-25T21:47:45,664][ERROR][logstash.outputs.elasticsearch] Got a bad response code from server, but this code is not considered retryable. Request will be dropped {:code=>400, :response_body=>"{\"error\":{\"root_cause\":[{\"type\":\"parse_exception\",\"reason\":\"request body is required\"}],\"type\":\"parse_exception\",\"reason\":\"request body is required\"},\"status\":400}"}
[2017-09-25T21:48:50,376][ERROR][logstash.outputs.elasticsearch] Got a bad response code from server, but this code is not considered retryable. Request will be dropped {:code=>400, :response_body=>"{\"error\":{\"root_cause\":[{\"type\":\"parse_exception\",\"reason\":\"request body is required\"}],\"type\":\"parse_exception\",\"reason\":\"request body is required\"},\"status\":400}"}

But now since we are retrying 400 errors and it can't seem to ever be successful it gets stuck.

PhaedrusTheGreek · September 26, 2017, 1:26pm

Chad,

As of 7.4.0 release of the plugin, the retry policy has changed significantly and we have added a Dead Letter Queue to handle events that probably will never succeed.

If you have DLQ enabled, 400 response code events should immediately be sent to this queue for handling.

In logstash.yml:

dead_letter_queue.enable: true

Then

input {
 stdin {}
  dead_letter_queue {
    path => "/opt/logstash-5.6.1/data/dead_letter_queue"
    commit_offsets => true
    tags => ["dlq"]
  }
}

output {
 if "dlq" in [tags] {
  file {
    path => "/var/log/logstash/dlq-output"
    codec =>  json_lines
  }
 } else {
   elasticsearch {  }
 }
}

The only thing that concerns me is that you're saying you have a 400 error that is being retried indefinitely even though we advertise this:

If a DLQ is not enabled a log message will be emitted and the event will be dropped.

Can you confirm that this is the case? If so, we can investigate further internally.

Hush077 · September 26, 2017, 5:14pm

First test is default installation of logstash 5.6.2 on Amazon Linux via the RPM repo.

logstash.yml
path.data: /var/lib/logstash
path.config: /etc/logstash/conf.d
path.logs: /var/log/logstash

**DLQ is set to whatever the default is? false?. **

Config is what I posted above. Filebeat input Elasticsearch Output.

After about 10 minutes I start getting the sanitize error

[2017-09-26T16:45:23,521][ERROR][logstash.outputs.elasticsearch] Encountered an unexpected error submitting a bulk request! Will retry. {:error_message=>"undefined methodsanitized' for "http://localhost:9200/_bulk":String", :class=>"NoMethodError", :backtrace=>["/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-7.4.0-java/lib/logstash/outputs/elasticsearch/common.rb:249:in safe_bulk'", "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-7.4.0-java/lib/logstash/outputs/elasticsearch/common.rb:222:insafe_bulk'", "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-7.4.0-java/lib/logstash/outputs/elasticsearch/common.rb:119:in submit'", "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-7.4.0-java/lib/logstash/outputs/elasticsearch/common.rb:87:inretrying_submit'", "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-7.4.0-java/lib/logstash/outputs/elasticsearch/common.rb:38:in multi_receive'", "/usr/share/logstash/logstash-core/lib/logstash/output_delegator_strategies/shared.rb:13:inmulti_receive'", "/usr/share/logstash/logstash-core/lib/logstash/output_delegator.rb:49:in multi_receive'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:436:inoutput_batch'", "org/jruby/RubyHash.java:1342:in each'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:435:inoutput_batch'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:381:in worker_loop'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:342:instart_workers'"]}`

Stopping logstash and updating all plugins to latest.

Updated logstash-input-dead_letter_queue 1.1.0 to 1.1.1
Updated logstash-input-elasticsearch 4.0.5 to 4.0.6
Updated logstash-input-http_poller 3.3.3 to 3.3.4
Updated logstash-output-elasticsearch 7.4.0 to 7.4.2

Starting logstash another ~10mins or less I get:
[2017-09-26T16:57:17,652][ERROR][logstash.outputs.elasticsearch] Encountered a retryable error. Will Retry with exponential backoff {:code=>400, :url=>"http://localhost:9200/_bulk"} [2017-09-26T16:57:22,364][ERROR][logstash.outputs.elasticsearch] Encountered a retryable error. Will Retry with exponential backoff {:code=>400, :url=>"http://localhost:9200/_bulk"}

Enabling DLQ and adding an output for it.
mkdir -p /opt/logstash/data/dead_letter_queue

Editing logstash.yml with DLQ:
dead_letter_queue.enable: true
path.dead_letter_queue: /opt/logstash/data/dead_letter_queue

Added the input and output as stated above. Started logstash:
[2017-09-26T17:09:43,309][INFO ][logstash.outputs.file ] Opening file {:path=>"/var/log/logstash/dlq-output"}
[2017-09-26T17:10:02,329][ERROR][logstash.outputs.elasticsearch] Encountered a retryable error. Will Retry with exponential backoff {:code=>400, :url=>"http://localhost:9200/_bulk"}
[2017-09-26T17:10:05,006][ERROR][logstash.outputs.elasticsearch] Encountered a retryable error. Will Retry with exponential backoff {:code=>400, :url=>"http://localhost:9200/_bulk"}

Checking /var/log/logstash/dlq-output I see some lines like:
{"request":["/api/v1/program_memberships/123456","/api/v1/program_memberships/123456"],"os_lib":"mobo","agent":["mobo/3.2.3 (iPhone; iOS 10.3.3; Scale/3.00)","\"mobo/3.2.3 (iPhone; iOS 10.3.3; Scale/3.00)\""],"role":"web","app_version":"3.2.3","auth":["-","-"],"ident":["-","-"],"os_minor":"3","os_major":"10","app_patch":"3","source":"/var/log/nginx/server.log","type":"server-nginx","@hostname":"server","log_type":"nginx","clientip":["1.2.3.4","1.2.3.4"],"@version":"1","timestamp":["26/Sep/2017:16:58:12 +0000","26/Sep/2017:16:58:12 +0000"],"product":"server","geoip":{"timezone":"America/New_York","ip":"1.2.3.4","latitude":35.1137,"continent_code":"NA","city_name":"Indian Trail","country_name":"United States","country_code2":"US","dma_code":517,"country_code3":"US","region_name":"North Carolina","location":{"lon":-80.6083,"lat":35.1137},"postal_code":"28079","region_code":"NC","longitude":-80.6083},"os":"iOS","verb":["GET","GET"],"team":"undef","message":"174.193.150.139 - - [26/Sep/2017:16:58:12 +0000] \"GET /api/v1/program_memberships/123456 HTTP/1.1\" 200 2846 \"-\" \"mobo/3.2.3 (iPhone; iOS 10.3.3; Scale/3.00)\"","app_major":"3","app_minor":"2","tags":["_geoip_lookup_failure","dlq"],"os_branch":"3","environment":"production","device_scale":"3.00","@timestamp":"2017-09-26T17:05:54.914Z","build":"","bytes":["2846","2846"],"name":"Mobile Safari UI/WKWebView","request_path":["/api/v1/program_memberships/123456","/api/v1/program_memberships/123456","/api/v1/program_memberships/123456"],"os_name":"iOS","httpversion":["1.1","1.1"],"device":"iPhone","status":["200","200"]}

Logstash is now giving the retryable error message over and over and no longer processing any new messages.

PhaedrusTheGreek · September 26, 2017, 6:38pm

So the events are failing 400 and being punted to the DLQ, as expected (?) .

Maybe something is wrong with Elasticsearch. Try posting a bulk request at ES manually:

POST _bulk
{ "index" : { "_index" : "test", "_type" : "type1", "_id" : "1" } }
{ "field1" : "value1" }

You can use a tool like Sense or Postman for testing.

Hush077 · September 26, 2017, 10:41pm

Not sure I understand. The error happens with or without DLQ enabled.

If DLQ is enabled they are outputted to the DLQ, but continue to be retried. They are never dropped.

I've reverted back to 5.2.0 because it doesn't have the retry mechanism and I can see the 400 requests come in and get dropped.

The newer version just doesn't seem to ever drop the erroneous messages.

Hush077 · September 26, 2017, 10:43pm

{
  "took": 169,
  "errors": false,
  "items": [
    {
      "index": {
        "_index": "test",
        "_type": "type1",
        "_id": "1",
        "_version": 1,
        "result": "created",
        "_shards": {
          "total": 2,
          "successful": 2,
          "failed": 0
        },
        "created": true,
        "status": 201
      }
    }

PhaedrusTheGreek · September 27, 2017, 9:36am

@Hush077,

Can I see the Logstash configuration that you had when DLQ wasn't working for you?

Hush077 · September 27, 2017, 3:24pm

Of course sir. I removed all filters and configurations when I was testing it.

 # logstash.yml
dead_letter_queue.enable: true
path.dead_letter_queue: /opt/logstash/data/dead_letter_queue

 # input.conf
 # All client side logs come in via this method
input {
 beats {
   port => 5044
 }

 stdin {}
  dead_letter_queue {
    path => "/opt/logstash/data/dead_letter_queue"
    commit_offsets => true
    tags => ["dlq"]
 }
}

 # output.conf
output {
  if "dlq" in [tags] {
    file {
     path => "/var/log/logstash/dlq-output"
     codec =>  json_lines
    }
  }

  else {
    elasticsearch {
     hosts           => ["localhost:9200"]
     index           => "logstash-%{+YYYY.MM.dd}"
     document_type   => "%{[@metadata][type]}"
    }
  }
}

Hush077 · September 27, 2017, 4:33pm

Here is an example of one of our filebeat prospectors. In case any shines any clues.

---
filebeat:
  prospectors:
    - input_type: log
      paths:
        - /var/log/app_name/web/*.log
        - /var/log/app_name/web/*.err
      encoding: plain
      fields:
        environment: production
        client: production
        product: app_name
        role: web
        team: undef
        log_type: rails
      fields_under_root: true
      document_type: app_name
      scan_frequency: 10s
      harvester_buffer_size: 16384
      max_bytes: 10485760


      tail_files: false

      # Experimental: If symlinks is enabled, symlinks are opened and harvested. The harvester is openening the
      # original for harvesting but will report the symlink name as source.
      #symlinks: false

      backoff: 1s
      max_backoff: 10s
      backoff_factor: 2

      # Experimental: Max number of harvesters that are started in parallel.
      # Default is 0 which means unlimited
      #harvester_limit: 0

      ### Harvester closing options

      # Close inactive closes the file handler after the predefined period.
      # The period starts when the last line of the file was, not the file ModTime.
      # Time strings like 2h (2 hours), 5m (5 minutes) can be used.
      close_inactive: 5m


      # Close renamed closes a file handler when the file is renamed or rotated.
      # Note: Potential data loss. Make sure to read and understand the docs for this option.
      close_renamed: false

      # When enabling this option, a file handler is closed immediately in case a file can't be found
      # any more. In case the file shows up again later, harvesting will continue at the last known position
      # after scan_frequency.
      close_removed: true

      # Closes the file handler as soon as the harvesters reaches the end of the file.
      # By default this option is disabled.
      # Note: Potential data loss. Make sure to read and understand the docs for this option.
      close_eof: false

      ### State options

      # Files for the modification data is older then clean_inactive the state from the registry is removed
      # By default this is disabled.
      clean_inactive: 0

      # Removes the state for file which cannot be found on disk anymore immediately
      clean_removed: true

      # Close timeout closes the harvester after the predefined time.
      # This is independent if the harvester did finish reading the file or not.
      # By default this option is disabled.
      # Note: Potential data loss. Make sure to

Hush077 · September 29, 2017, 3:29am

@PhaedrusTheGreek Not sure if related or not, but maybe filebeat / logstash is doing something similar? https://github.com/logstash-plugins/logstash-input-elasticsearch/issues/62 I updated all of my agents to the latest filebeat across our entire platform today. Just to see if that made any difference in the 400 errors, but it hasn't. I'm still getting them.

hyderNBK117 · October 18, 2017, 9:39am

I have the same issue, I originally had Logstash 5.6.2 and got the santized error message. I upgraded to 5.6.3 and now get the reported error message.

[logstash.outputs.elasticsearch] Encountered a retryable error. Will Retry with exponential backoff {:code=>400, :url=>

I'm importing data from Sql using the JDBC. Is this where a message is too big for Elasticsearch and it fails? I have another Logstash input which is running with smaller document sizes and it processes for hours before that eventually freezes. Obviously this is another issue but resolving this issue is crucial.

BTW I have a 3 node cluster and the error messages appears for all of them.

BrianBeaulieu · November 1, 2017, 1:21am

I'm having the same exact issues here... has anyone found a solution?
Glad I found this thread because i've been troubleshooting this for days.. I thought it was a problem with lingering CLOSE_WAITs chewing up all file handles but that seemed to be a side effect.

BrianBeaulieu · November 1, 2017, 1:23am

Eventually this VM with 16GB memory and 12gb allocated to Logstash, the VM is starved of memory.

[root@es-broker bin]# ./logstash-plugin
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00000004d4cc0000, 12535988224, 0) failed; error='Cannot allocate memory' (errno=12)

There is insufficient memory for the Java Runtime Environment to continue.

Native memory allocation (mmap) failed to map 12535988224 bytes for committing reserved memory.

An error report file with more information is saved as:

/usr/share/logstash/bin/hs_err_pid10509.log

Topic		Replies	Views
Require suggestion for the logstash processing : Logstash Logstash	1	479	November 16, 2018
Inifinite-loop-encountered-a-retryable-error 400 error from Elasticsearch Logstash	2	3548	March 14, 2018
Encountered a retryable error in Logstash 5.6.3 Logstash	2	3014	December 15, 2017
Logstash stop communicating with Elasticsearch Elasticsearch	4	593	July 6, 2017
Encountered a retryable error. Will Retry with exponential backoff code=>400 Logstash	5	6986	September 27, 2019

Logstash 5.6.1 inifinite loop? Encountered a retryable error

There is insufficient memory for the Java Runtime Environment to continue.

Native memory allocation (mmap) failed to map 12535988224 bytes for committing reserved memory.

An error report file with more information is saved as:

/usr/share/logstash/bin/hs_err_pid10509.log

Related topics