Http_poller input only new lines?

eebee · January 11, 2018, 10:36am

I'm using the http_poller Logstash input plugin to ingest a logfile into Elasticsearch. But every time it polls data from the logfile, it polls the whole file. Config file:

input {
  http_poller {
    urls => {
      test => {
        method => get
        url => "http://xxx.xxx.xx/api/log"
        headers => {
          "Accept" => "application/json"
          "x-xx-api" => "xxxxx"
        }
     }
    }
    request_timeout => 20
    # Supports "cron", "every", "at" and "in" schedules by rufus scheduler
    schedule => { cron => "* * * * * UTC"}
    codec => "json_lines"
    # A hash of request metadata info (timing, response headers, etc.) will be sent here
    metadata_target => "http_poller_metadata"
  }
}

output {
    elasticsearch {
        hosts => [ "192.168.1.174:9200" ]
	index => "xx-testing-%{+YYYY.MM}"
    }
     stdout {
	codec => rubydebug
     }
}

Log file looks like this:

{"@message":"Successful api request","@timestamp":"2018-01-11T10:11:00.260Z","@fields":{"origin":"xx.xx.xx.xx","environment":"production_beta","label":"askquestiongui","level":"info"}}
{"@message":"Successful api request","@timestamp":"2018-01-11T10:12:00.317Z","@fields":{"origin":"xx.xx.xx.xx","environment":"production_beta","label":"askquestiongui","level":"info"}}

If I use codec "json", I only get the first log-line once, codec "json_lines" writes the complete logfile to Elasticsearch each time. PLease advice.

s1m0ne · January 11, 2018, 10:41am

Hi,

why do not use FileBeat to produce events based on your log and send directly to Elastic?

Cheers,
s1m0ne

eebee · January 11, 2018, 11:46am

Doesn't filebeat have to be run at the server where the log file is? I cannot install anything on the server where the log-file is located.

eebee · January 11, 2018, 2:45pm

Another issue for me is that I have to pull the log file from the server, which is on a public server, to my Elasticsearch server that is on a private network. I cannot use filebeat to push the data tom Elasticsearch.

guyboertje · January 11, 2018, 2:59pm

Your http server will need to stateful.

Say you use a query string of http://xxx.xxx.xx/api/log?lines=100 and the server, having remembered that in the previous call it served lines 0 to 99, serves lines 100 to 199 to this call.

The http_poller input is not stateful and has no facility to remember what the last processed line number was and adjust the query string for example.

eebee · January 11, 2018, 3:05pm

I think I got it right now. I added:

        document_id => "%{@timestamp}"

..to my elasticsearch output. Then Elasticsearch doesn't duplicate that document_id. Before it gave every new reading a unique document_id.

guyboertje · January 11, 2018, 3:05pm

Consider a different architecture.

Put Filebeat in the public zone on the server, Logstash in the DMZ and ES in the private zone.

eebee · January 12, 2018, 7:42am

We are aiming at having Elastic on a public server, but I cannot use Filebeat since we cannot install anything in the environment where the logfile are located. But our logging system uses the Winston library which can send logging messages directly to Logstash, so whenever I get a public server running, I think that may be an excellent way to go.

guyboertje · January 12, 2018, 10:31am

I just read the Winston docs and some of the code. It looks like it will try to dispatch the log line string to a destination immediately. The HTTP transport is acting as a client not a server AFAICT.

I don't see how you are achieving persistence - via a Winston File transport? If so then the file is a persistent buffer. Then, with what tech does the LS http_poller connect to so it retrieves the log lines from those files?

I ask these questions not out of malice or because I doubt your solution but because I and others here can get to appreciate an alternative method to ship log lines from the edge.

Regarding Elasticsearch clusters in the public zone, if you have not already done it, you must secure it

Regarding your future plans.

But our logging system uses the Winston library which can send logging messages directly to Logstash

By this I think you mean Winston HTTP transport (client mode) to LS http input (server mode). If so, there is a problem with buffering. LS will have to be up 24/7. How does the Winston client transport behave when the HTTP server is not available? Consider a load balancer between Winston and 2/3 LS instances (haproxy or nginx). If you consider a load balancer, then remember that consecutive log lines will be sent to different LS instance - no ordering.

eebee · January 12, 2018, 11:25am

Thanks for the input.
We have a few limitations in this project that currently cause us some problems, but I think this method I'm using is a fairly good way to overcome thoose issues. Having Logstash use the http_poller input with a private api-key to fetch the data. I don't see much difference (performance wise) between having logstash to pull data from a server compared to having the logg-server to send the data to the Elastic server. Right now this setup will only run for a few weeks as a proof of concept. If we launch it properly, we have to scale everything alot anyway.

guyboertje · January 12, 2018, 3:00pm

Great.

However you did not answer the question of what tech you are using the serve the requests from the http_poller?

eebee · January 12, 2018, 4:14pm

We are running an application in IBM Cloud, and it's a little restricted what you can and cannot do there.

system · February 9, 2018, 4:14pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Http_poller input Logstash	3	251	May 26, 2020
Help with http http_poller input plugin Logstash	3	306	June 17, 2021
HTTP_Poller send output to elasticsearch Logstash	7	3072	October 26, 2017
Problem updating logs using elasticsearch ouput Logstash	1	344	June 5, 2017
Http_poller plugin and parsing second level of json Logstash	1	218	December 8, 2021

Http_poller input only new lines?

Related topics