I'm using the http_poller Logstash input plugin to ingest a logfile into Elasticsearch. But every time it polls data from the logfile, it polls the whole file. Config file:
input {
http_poller {
urls => {
test => {
method => get
url => "http://xxx.xxx.xx/api/log"
headers => {
"Accept" => "application/json"
"x-xx-api" => "xxxxx"
}
}
}
request_timeout => 20
# Supports "cron", "every", "at" and "in" schedules by rufus scheduler
schedule => { cron => "* * * * * UTC"}
codec => "json_lines"
# A hash of request metadata info (timing, response headers, etc.) will be sent here
metadata_target => "http_poller_metadata"
}
}
output {
elasticsearch {
hosts => [ "192.168.1.174:9200" ]
index => "xx-testing-%{+YYYY.MM}"
}
stdout {
codec => rubydebug
}
}
Log file looks like this:
{"@message":"Successful api request","@timestamp":"2018-01-11T10:11:00.260Z","@fields":{"origin":"xx.xx.xx.xx","environment":"production_beta","label":"askquestiongui","level":"info"}}
{"@message":"Successful api request","@timestamp":"2018-01-11T10:12:00.317Z","@fields":{"origin":"xx.xx.xx.xx","environment":"production_beta","label":"askquestiongui","level":"info"}}
If I use codec "json", I only get the first log-line once, codec "json_lines" writes the complete logfile to Elasticsearch each time. PLease advice.
Another issue for me is that I have to pull the log file from the server, which is on a public server, to my Elasticsearch server that is on a private network. I cannot use filebeat to push the data tom Elasticsearch.
Say you use a query string of http://xxx.xxx.xx/api/log?lines=100 and the server, having remembered that in the previous call it served lines 0 to 99, serves lines 100 to 199 to this call.
The http_poller input is not stateful and has no facility to remember what the last processed line number was and adjust the query string for example.
We are aiming at having Elastic on a public server, but I cannot use Filebeat since we cannot install anything in the environment where the logfile are located. But our logging system uses the Winston library which can send logging messages directly to Logstash, so whenever I get a public server running, I think that may be an excellent way to go.
I just read the Winston docs and some of the code. It looks like it will try to dispatch the log line string to a destination immediately. The HTTP transport is acting as a client not a server AFAICT.
I don't see how you are achieving persistence - via a Winston File transport? If so then the file is a persistent buffer. Then, with what tech does the LS http_poller connect to so it retrieves the log lines from those files?
I ask these questions not out of malice or because I doubt your solution but because I and others here can get to appreciate an alternative method to ship log lines from the edge.
Regarding Elasticsearch clusters in the public zone, if you have not already done it, you must secure it
Regarding your future plans.
But our logging system uses the Winston library which can send logging messages directly to Logstash
By this I think you mean Winston HTTP transport (client mode) to LS http input (server mode). If so, there is a problem with buffering. LS will have to be up 24/7. How does the Winston client transport behave when the HTTP server is not available? Consider a load balancer between Winston and 2/3 LS instances (haproxy or nginx). If you consider a load balancer, then remember that consecutive log lines will be sent to different LS instance - no ordering.
Thanks for the input.
We have a few limitations in this project that currently cause us some problems, but I think this method I'm using is a fairly good way to overcome thoose issues. Having Logstash use the http_poller input with a private api-key to fetch the data. I don't see much difference (performance wise) between having logstash to pull data from a server compared to having the logg-server to send the data to the Elastic server. Right now this setup will only run for a few weeks as a proof of concept. If we launch it properly, we have to scale everything alot anyway.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.