Filebeat not connecting directly to Elasticsearch from particular machine

Beats: latest
OS: Windows Server 2012 R2 Datacenter

I am able to get other machines working fine but for some reason, I ran into issues with one particular machine.

Here is the filebeat.yml

################### Filebeat Configuration Example #########################

############################# Filebeat ######################################
filebeat:
  # List of prospectors to fetch data.
  prospectors:
    # Each - is a prospector. Below are the prospector specific configurations
    -
      # Paths that should be crawled and fetched. Glob based paths.
      # To fetch all ".log" files from a specific level of subdirectories
      # /var/log/*/*.log can be used.
      # For each file found under this path, a harvester is started.
      # Make sure not file is defined twice as this can lead to unexpected behaviour.
      paths:
        - E:\approot\logs\*.log
        #- c:\programdata\elasticsearch\logs\*

      # Type of the files. Based on this the way the file is read is decided.
      # The different types cannot be mixed in one prospector
      #
      # Possible options are:
      # * log: Reads every line of the log file (default)
      # * stdin: Reads the standard in
      input_type: log

############################# Output ##########################################

# Configure what outputs to use when sending the data collected by the beat.
# Multiple outputs may be used.
output:

  ### Elasticsearch as output
  elasticsearch:
    # Array of hosts to connect to.
    # Scheme and port can be left out and will be set to the default (http and 9200)
    # In case you specify and additional path, the scheme is required: http://localhost:9200/path
    # IPv6 addresses should always be defined as: https://[2001:db8::1]:9200
    hosts: ["http://42.10.20.13:9200"]

    # Optional protocol and basic auth credentials.
    protocol: "http"
    username: "es_admin"
    password: "notrealpassword"

    # Optional index name. The default is "filebeat" and generates
    # [filebeat-]YYYY.MM.DD keys.
    index: "paas"

    # A template is used to set the mapping in Elasticsearch
    # By default template loading is disabled and no template is loaded.
    # These settings can be adjusted to load your own template or overwrite existing ones
    template:

      # Template name. By default the template name is filebeat.
      #name: "filebeat"

      # Path to template file
      path: "filebeat.template.json"

############################# Logging #########################################

# There are three options for the log ouput: syslog, file, stderr.
# Under Windos systems, the log files are per default sent to the file output,
# under all other system per default to syslog.
logging:

  # To enable logging to files, to_files option has to be set to true
  files:
    # The directory where the log files will written to.
    path: D:\Program Files\Filebeat\logs

    # The name of the files where the logs are written to.
    #name: mybeat

    # Configure log file size limit. If limit is reached, log file will be
    # automatically rotated
    rotateeverybytes: 10485760 # = 10MB

    # Number of rotated log files to keep. Oldest files will be deleted first.
    #keepfiles: 7

  # Enable debug output for selected components. To enable all selectors use ["*"]
  # Other available selectors are beat, publish, service
  # Multiple selectors can be chained.
  #selectors: [ ]

  # Sets log level. The default log level is error.
  # Available log levels are: critical, error, warning, info, debug
  level: info


I keep getting these errors in the log for Filebeat:

INFO backoff retry: 1m0s
2016-06-06T20:07:27Z INFO Connecting error publishing events (retrying): Head http://42.10.20.13:9200: dial tcp 40.79.43.133:9200: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
2016-06-06T20:07:27Z INFO send fail
2016-06-06T20:07:27Z INFO backoff retry: 1m0s
2016-06-06T20:08:48Z INFO Connecting error publishing events (retrying): Head http://42.10.20.13:9200: dial tcp 40.79.43.133:9200: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
2016-06-06T20:08:48Z INFO send fail
2016-06-06T20:08:48Z INFO backoff retry: 1m0s
2016-06-06T20:10:09Z INFO Connecting error publishing events (retrying): Head http://42.10.20.13:9200: dial tcp 40.79.43.133:9200: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
2016-06-06T20:10:09Z INFO send fail
2016-06-06T20:10:09Z INFO backoff retry: 1m0s

I've got other machines reporting just fine, but this is the first one this is happening on.

Any thoughts?

Note : this is happening only on an Azure PaaS worker role.

One thing I notice, is that if I access the VIP to get the cluster details via ?pretty, it initially challenges me and gives me the json, but anytime after that, if I try to refresh, I get 'the page cannot be displayed'

It does not seem to persist the connection to the cluster even after successful authentication. I have a feeling this might be causing a lack of connectivity to ES and producing those errors.

Based on the above errors you posted, I agree that it seems to be a connectivity issue to elasticsearch. As you manage to reproduce this also manually, this is probably the best way to start investigating what is going wrong here with the connection between these two servers. It does not seem like a filebeat specific issue.

can you run filebeat with elasticsearch output in debug mode with -v -d 'elasticsearch'. By default HTTP connections are re-used (if supported by endpoint) and only closed after some (unconfigurable) time if no HTTP request is run.

Can you provide a trace via tcpdump between filebeat and elasticsearch for us to see what happens when connecting to elasticsearch?

I have discovered that there is an inconsistency pinging from that PaaS instance to the VIP of the ES cluster. Very strange how it only occurs on PaaS and not IaaS. Anyhow, it is Azure-platform related.

Actually, I did a lot of changes and now I have it trying to connect to an internal load balancer IP vs. a public VIP, I have set up a psping to the ILB IP 50 times consecutively without any timeouts.

Now I try connecting filebeat to the ILB, and I am seeing a different error in the logs ::

ERR Failed to perform any bulk index operations: Post http://10.0.1.20:9200/_bulk: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2016-06-10T19:43:53Z INFO Error publishing events (retrying): Post http://10.0.1.20:9200/_bulk: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2016-06-10T19:43:53Z INFO send fail
2016-06-10T19:43:53Z INFO backoff retry: 1s
2016-06-10T19:45:24Z ERR Failed to perform any bulk index operations: Post http://10.0.1.20:9200/_bulk: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2016-06-10T19:45:24Z INFO Error publishing events (retrying): Post http://10.0.1.20:9200/_bulk: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2016-06-10T19:45:24Z INFO send fail
2016-06-10T19:45:24Z INFO backoff retry: 2s

Steffen,

I do not see any output from running it this way ::

oh, windows. Is the ' character passed as to CLI tools or stripped like on unix based shells. Maybe -d elasticsearch works. Can you try with -d '*' instead (how can one pass * as string to command line tools on windows?).

Alternatively you can enable debug modes in filebeat.yml via selectors config in logging section.

I enabled the selector using ["*"] and the log file looks the same. Same messages as I mentioned.

OK. Looks like a general connectivity issue. The output is complaining multiple times the request being timed out. Default timeout is 30s (it's huuuuuge). Is elasticsearch really reachable? Are POST operations allowed?

I installed 'Postman' for Chrome, I was able to just do a GET and a POST within 5 seconds or less. They both came back with results rather quickly. I don't know why Filebeat would be timing out.

And to add more, now this is happening on IaaS as well as PaaS. I notice Filebeat logs on one of my web servers now has that timeout message.

One thing that starts me worrying is that you write 5s or less. For most basic GET or POST request I would expect it to be in ms. It should not matter for Filebeat, but I felt it is worth mentioning. Are the machine with Filebeat and Elasticsearch installed in the same data center? Which version of elasticsearch are you using?

what kind of queries did you send? 5s is huge, given your requests are almost empty. filebeat tries to push 2048 events by default. You can try to increase the timeout in filebeat, but anything > a few ms response time for http://es_host:9200 is ridiculous.

Actually, it does GET and POST very quickly. It takes about 34ms.

Can you share some more details on the versions?

Filebeat is the latest version that is available for download. ES is at 2.2.1

I assume that means version 1.2.3. I state the version as otherwise the post will be not accurate anymore in the future.

TBH I'm really stuck here on what further recommend, especially as it seems to work on other machines as expected. One of my last ideas would be to install elasticsearch on the same machine and in case it works, we could at least have some confidence, that it seems to be a network issue.

One other thing that came to my mind is: Could it be that the load balancer has issues with large bulk requests? So all the small requests work as expected but larger bulk_requests fail? To see if that is the issue, could you try to set bulk_max_size to a very low value, event 1? https://www.elastic.co/guide/en/beats/filebeat/1.2/elasticsearch-output.html#_bulk_max_size This will have a big performance impact but I think it is worth testing.

Yes, Filebeat is 1.2.3

I will try that large bulk request test.

Actually, this happens on both PaaS (cloud services) and even more so, now on IaaS VM's (such as SharePoint web front ends)

I have no idea what is going on here....my only other test I can think of is trying a different log forwarder, such as nxlog, but unfortunately nxlog does not support auth (for Shield auth to ES)