Elasitcsearch-ruby raises "Cannot get new connection from pool" error


(Repeatedly) #1

Hi folks,

Original discussion is here: Elasticsearch::Transport Cannot get new connection from pool

Recently, some users reported fluent-plugin-elasticsearch failed to import logs to Elasticsearch.
I checked their logs and the cause is elasticsearch-ruby raises "Cannot get new connection from pool" when perform import requests.

This error seems to happen at this point: https://github.com/elastic/elasticsearch-ruby/blob/7ba1fbdc2285a51fe6447b9241dfc871579dc118/elasticsearch-transport/lib/elasticsearch/transport/transport/base.rb#L193
And here is the fluent-plugin-elasticsearch code: https://github.com/uken/fluent-plugin-elasticsearch/blob/e44f323c99f3a5aca29b569023b846c1111f87da/lib/fluent/plugin/out_elasticsearch.rb

I want to know this problem is elasticsearch-ruby or fluent-plugin-elasticsearch.
If former, I want to know when this error happens and how to recover in the code.
If latter, where is the wrong and how to fix?

Currently, restarting fluentd is needed...

Thanks,


Elasticsearch::Transport Cannot get new connection from pool
(Robert Firek) #2

We found the reason behind this issue. Problem is in configuration of Elasticsearch Ruby client (class https://github.com/elastic/elasticsearch-ruby/blob/master/elasticsearch-transport/lib/elasticsearch/transport/transport/base.rb).

When we have heavy traffic in fluentd we request Elasticsearch every second. Some of this request are timeouted by elasticsearch so this means that one of the connection in the connection pool is marked as dead.

Problem is how we resurrect connection. Class "base.rb" class in method "get_connection" tries to do that like this:

resurrect_dead_connections! if Time.now > @last_request_at + @resurrect_after

By default @resurrect_after is set to 60s (and Fluentd plugin for elasticsearch uses this value),
and @last_request_at is reset on each request in method "perform_request":

ensure @last_request_at = Time.now end

If we have high traffic we request elasticsearch every 1s and some of this request slowly kill connections. At some point we don't have enough connection, but we still try to post request, but @resurrect_after doesn't allow to resurrect any connection.

We fixed that by forking fluentd elasticsearch plugin and introducing resurrect_after as parameter. When we setup this to 0s we try to resurrect connection on every request. It is very aggressive setup, but I think we will try to find better based on our experiments.


(Mark Walkom) #3

It'd be awesome if you made a PR with this enhancement :smiley:


(Robert Firek) #4

Change in the fluent-plugin-elasticsearch is already available in version >=1.2.0.
Plugin contributors responded very quickly :slight_smile:


(Repeatedly) #5

Sorry for the delay.

Robert,

Thanks for awesome investigation and the patch!
If I get same report from users, I will suggest your change.


(Robert Firek) #6

The problem with reoccurring exception:

temporarily failed to flush the buffer. next_retry=2015-12-12 23:56:36 +0000 error_class="Elasticsearch::Transport::Transport::Error" error="Cannot get new connection from pool." plugin_id="output_to_elasticsearch"

can be triggered also by plugin setting "reload_connections" (default set to true) and occurs only for Amazon Elasticsearch Service

By default Elasticsearch Ruby client reloads connections after 10_000 requests. The client use endpoint _nodes from Elasticsearch REST API to obtain all available nodes.
Unfortunately Amazon Elasticsearch Service doesn't implement this function in the same way as the original Elasticsearch. Amazon doesn't add any information about addresses and after reload our connection pool to Amazon ES is empty and it is impossible to reload it at all.

Example Elasticsearch Output (contains address data):

{
  "cluster_name": "elasticsearch",
  "nodes": {
    "daaweaw-DASDAD123131": {
      "name": "Destiny",
      "transport_address": "inet[/172.17.0.3:9300]",***
      "host": "249885187c52",
      "ip": "172.17.0.3",
      "version": "1.7.4",
      "build": "0d3159b",
      "http_address": "inet[/172.17.0.3:9200]",
      "http": {
        "bound_address": "inet[/0:0:0:0:0:0:0:0:9200]",
        "publish_address": "inet[/172.17.0.3:9200]",
        "max_content_length_in_bytes": 104857600
      }
    }
  }
}

Amazon Elasticsearch Service output (no address data):

{
  "cluster_name": "123123123123:schema",
  "nodes": {
    "ABC": {
      "build": "62ff986",
      "version": "1.5.2",
      "name": "Marduks"
    },
    "BCD": {
      "build": "62ff986",
      "version": "1.5.2",
      "name": "Fern"
    },
    "EDF": {
      "build": "62ff986",
      "version": "1.5.2",
      "name": "Emilly"
    }
  }
}

Code in Elasticsearch Ruby client which is the cause of this issue:

def reload_connections!
    hosts = sniffer.hosts
    __rebuild_connections :hosts => hosts, :options => options
   self
rescue SnifferTimeoutError
    logger.error "[SnifferTimeoutError] Timeout when reloading connections." if logger
    self
end

sniffer.hosts tries to obtain nodes data from _nodes endpoint.


(Pablo Zed) #7

Still an issue over Amazon Elasticsearch Service. The only workaround that i found is restart the service everyday or every hour through crontab

crontab -e

and then write down something like

0 4 * * * service td-agent restart +5

In order to restart the service all night at 04:05. This is just a 'move on' solution...a hack.


(Robert Firek) #8

@PabloZed What error do you have?


(Pablo Zed) #9

2015-12-12 05:52:14 +0000 fluent.warn: {"next_retry":"2015-12-12 14:19:11
+0000","error_class":"Elasticsearch::Transport::Transport::Error","error":"Cannot
get new connection from
pool.","plugin_id":"object:3ffb5dfb2578","message":"temporarily failed to
flush the buffer. next_retry=2015-12-12 14:19:11 +0000
error_class="Elasticsearch::Transport::Transport::Error"
error="Cannot get new connection from pool
."
plugin_id="object:3ffb5dfb2578""}

(I upgrade to 1.2.0) BUT I am using, AWS elastic search service.

aws-sdk-v1 (1.64.0)

elasticsearch (1.0.14)
elasticsearch-api (1.0.14)
elasticsearch-transport (1.0.14)

fluent-logger (0.4.10)
fluent-mixin-config-placeholders (0.3.0)
fluent-mixin-plaintextformatter (0.2.6)
fluent-mixin-rewrite-tag-name (0.1.0)
fluent-plugin-elasticsearch (1.2.0)
fluent-plugin-geoip (0.5.1)
fluent-plugin-mongo (0.7.10)
fluent-plugin-rewrite-tag-filter (1.4.1)
fluent-plugin-s3 (0.5.9)
fluent-plugin-scribe (0.10.14)
fluent-plugin-script (0.0.2)
fluent-plugin-td (0.10.27)
fluent-plugin-td-monitoring (0.2.1)
fluent-plugin-webhdfs (0.4.1)

*I think that is a AWS issue, * that returns a different JSON for the same
API, (no address data) as you said before. It's a pity... because if they
change the standard behavior, a lot of plugins are not going to work.


Filebeat : Badly Stuck with Connection reset by peer error
(Robert Firek) #10

You can setup fluent-plugin-elasticsearch to avoid this issue. In my project we set option:

    ##############################################################
    # !!! DO NOT SET THIS PROPERTY TO true FOR Amazon Elasticsearch Service !!!
    # Elasticsearch service doesn't return addresses of nodes and Elasticsearch client
    #         fails to reload connections properly.
    # ############################################################
    reload_connections false

This will stop reloading your connection pool. By default it was reloading connection after each 10_000 requests.

For me it works fine so far.


(Pablo Zed) #11

Yes, you are right Robert!. Thank you so much.

With fluent-plugin-elasticsearch 1.2.0 plus reload_connections set to false and AWS elasticsearch service everything works perfect!.

My conf looks like this

type copy type elasticsearch host {{ config['es']['nginx']['endpoint'] }} port {{ config['es']['nginx']['port'] }} include_tag_key true tag_key @log_name logstash_format true **reload_connections false** flush_interval 10s

....

Great stuff!


(Tom Jennings) #12

Hi,

I came across this trying to fix the same issue but in Logstash the fix was slightly different for me. I've added it to help anyone else who's struggling with the same problem.

:message=>"Cannot get new connection from pool.", :class=>"Elasticsearch::Transport::Transport::Error", :backtrace=>["/opt/logstash/vendor/bundle/jruby/1.9/gems/elasticsearch-transport-1.0.15/lib/elasticsearch/transport/transport/base.rb:193:in `perform_request'",

ElasticSearch would work for a bit but then stop ingesting data.

Failing configuration:

output {
  elasticsearch {
    hosts => ["https://search-*.us-east-1.es.amazonaws.com"]
    sniffing => true
    manage_template => false
    index => "%{[@metadata][beat]}-%{+YYYY.MM.dd}"
    document_type => "%{[@metadata][type]}"
  }
}

Logstash tries to get a list of hosts from Elasticsearch but AWS's implementation has changed the format of the data returned. For more details on the specifics. https://forums.aws.amazon.com/thread.jspa?threadID=222600

Working configuration:

output
 {
  elasticsearch {
    hosts => ["https://search-*.us-east-1.es.amazonaws.com"]
    manage_template => false
    index => "%{[@metadata][beat]}-%{+YYYY.MM.dd}"
    document_type => "%{[@metadata][type]}"
  }
}

tomwj


Filebeat : Badly Stuck with Connection reset by peer error
(system) #13