Webpage Input Plugin

Salma · March 17, 2018, 10:32am

Hi,

I'm wondering if there is a way to input webpage data into Elasticsearch using Logstash pipeline.
I have a webpage that is publicly accessible like website/logs/access.log what input plugin should I use?

Thanks,
Salma

wwalker · March 17, 2018, 5:17pm

Are you trying to ingest IIS/Samba logs?

Salma · March 17, 2018, 5:41pm

Hi wwalker,

Thanks for your replay,
I don’t know what is IIS/Samba logs. The logs that I’m trying to ingest are
Web Access logs and generated on a specific webpage.

wwalker · March 17, 2018, 5:48pm

IIS and Samba are popular applications for hosting websites, IIS is native to Windows Server OS and Samba is used a lot in Linux. I haven't done it yet, but I imagine you'd use the file input and then the grok filter to pull out the fields you want processed. I wish I could help you more but grok is something I've read through briefly and was left scratching my head.

Salma · March 17, 2018, 6:12pm

I don't own the website, I found the link on github when I was searching for a publicly accessible web access logs.
here if the link that I want to stream the logs from
http://www.almhuette-raith.at/apache-log/access.log
So I'm not sure if this website is hosted using llS or Samba.

I tried to download the web log from the website and ingest it using the file input plugin and it worked fine for me, but the problem is that events are not on real time since I've downloaded a copy of the web access logs of this website.

What I want to do is to make Logstash connect directly to the webpage of the access logs and stream the data in real time.

I've tried this input but it didn't collect any data for me

     input {
  http_poller {
    urls => {
      logs => "http://www.almhuette-raith.at/apache-log/access.log"
    }
    schedule => { cron => "* * * * * UTC"}
 
    codec => "plain"
  }
}

wwalker · March 17, 2018, 6:17pm

What's your output look like? Based on your input config, that should work getting the data into the pipeline, it could be that Logstash/Elasticsearch doesn't know what to do with it afterwards, which is where grok comes in.

Salma · March 17, 2018, 6:31pm

This is the whole config file Every thing is working fine and data is streaming if the input from the file not HTTP

input {
  http_poller {
    urls => {
      logs => "http://www.almhuette-raith.at/apache-log/access.log"
    }
    schedule => { cron => "* * * * * UTC"}
 
    codec => "plain"
  }
}


filter {
  grok {

            match => { "message" => '%{IPV4:clientip} %{NOTSPACE:ER} %{NOTSPACE:EO} \[%{HTTPDATE:timestamp}\] \"%{NOTSPACE:Method} %{DATA:request} HTTP/%{NUMBER:httpversion}\" %{NUMBER:HTTPStatus} %{NOTSPACE:ObjectSize} %{QS:referrer} %{QS:User_Agent} %{QS:What}'
              }


 
    }


    mutate {

      gsub => [
         "ObjectSize", "-", "0"
      ]

    
     
      convert => { "ObjectSize" => "integer" }
      convert => { "HTTPStatus" => "integer" }
      convert => { "httpversion" => "float" }
      add_field => { "domain" => "almhuetteraith.at" }
      

    }
       
    date {
      match => [ "timestamp", "dd/MMM/YYYY:HH:mm:ss Z" ]
    }

    geoip {
      source => "clientip"
    }
   }

output {
  elasticsearch { 
      hosts => ["localhost:9200"] 
      index => "real"
      user => "elastic"
      password => "*********"
      }
  stdout { codec => rubydebug }
}

And when I run the pipeline, it run without any errors, this is the output

Sending Logstash's logs to /****/logstash-6.2.2/logs which is now configured via log4j2.properties
[2018-03-17T22:26:47,891][INFO ][logstash.modules.scaffold] Initializing module {:module_name=>"fb_apache", :directory=>"/***/logstash-6.2.2/modules/fb_apache/configuration"}
[2018-03-17T22:26:47,923][INFO ][logstash.modules.scaffold] Initializing module {:module_name=>"netflow", :directory=>"/****/logstash-6.2.2/modules/netflow/configuration"}
[2018-03-17T22:26:49,325][INFO ][logstash.modules.scaffold] Initializing module {:module_name=>"arcsight", :directory=>"/*****/logstash-6.2.2/vendor/bundle/jruby/2.3.0/gems/x-pack-6.2.2-java/modules/arcsight/configuration"}
[2018-03-17T22:26:49,880][WARN ][logstash.config.source.multilocal] Ignoring the 'pipelines.yml' file because modules or command line options are specified
[2018-03-17T22:26:50,796][INFO ][logstash.runner          ] Starting Logstash {"logstash.version"=>"6.2.2"}
[2018-03-17T22:26:51,370][INFO ][logstash.agent           ] Successfully started Logstash API endpoint {:port=>9600}
[2018-03-17T22:26:59,378][INFO ][logstash.pipeline        ] Starting pipeline {:pipeline_id=>"main", "pipeline.workers"=>4, "pipeline.batch.size"=>125, "pipeline.batch.delay"=>50}
[2018-03-17T22:27:00,377][INFO ][logstash.outputs.elasticsearch] Elasticsearch pool URLs updated {:changes=>{:removed=>[], :added=>[http://elastic:xxxxxx@localhost:9200/]}}
[2018-03-17T22:27:00,428][INFO ][logstash.outputs.elasticsearch] Running health check to see if an Elasticsearch connection is working {:healthcheck_url=>http://elastic:xxxxxx@localhost:9200/, :path=>"/"}
[2018-03-17T22:27:01,233][WARN ][logstash.outputs.elasticsearch] Restored connection to ES instance {:url=>"http://elastic:xxxxxx@localhost:9200/"}
[2018-03-17T22:27:01,386][INFO ][logstash.outputs.elasticsearch] ES Output version determined {:es_version=>nil}
[2018-03-17T22:27:01,392][WARN ][logstash.outputs.elasticsearch] Detected a 6.x and above cluster: the `type` event field won't be used to determine the document _type {:es_version=>6}
[2018-03-17T22:27:01,420][INFO ][logstash.outputs.elasticsearch] Using mapping template from {:path=>nil}
[2018-03-17T22:27:01,452][INFO ][logstash.outputs.elasticsearch] Attempting to install template {:manage_template=>{"template"=>"logstash-*", "version"=>60001, "settings"=>{"index.refresh_interval"=>"5s"}, "mappings"=>{"_default_"=>{"dynamic_templates"=>[{"message_field"=>{"path_match"=>"message", "match_mapping_type"=>"string", "mapping"=>{"type"=>"text", "norms"=>false}}}, {"string_fields"=>{"match"=>"*", "match_mapping_type"=>"string", "mapping"=>{"type"=>"text", "norms"=>false, "fields"=>{"keyword"=>{"type"=>"keyword", "ignore_above"=>256}}}}}], "properties"=>{"@timestamp"=>{"type"=>"date"}, "@version"=>{"type"=>"keyword"}, "geoip"=>{"dynamic"=>true, "properties"=>{"ip"=>{"type"=>"ip"}, "location"=>{"type"=>"geo_point"}, "latitude"=>{"type"=>"half_float"}, "longitude"=>{"type"=>"half_float"}}}}}}}}
[2018-03-17T22:27:01,528][INFO ][logstash.outputs.elasticsearch] New Elasticsearch output {:class=>"LogStash::Outputs::ElasticSearch", :hosts=>["//localhost:9200"]}
[2018-03-17T22:27:01,856][INFO ][logstash.filters.geoip   ] Using geoip database {:path=>"/***/logstash-6.2.2/vendor/bundle/jruby/2.3.0/gems/logstash-filter-geoip-5.0.3-java/vendor/GeoLite2-City.mmdb"}
[2018-03-17T22:27:06,983][INFO ][logstash.inputs.http_poller] Registering http_poller Input {:type=>nil, :schedule=>{"cron"=>"* * * * * UTC"}, :timeout=>nil}
[2018-03-17T22:27:07,048][INFO ][logstash.pipeline        ] Pipeline started succesfully {:pipeline_id=>"main", :thread=>"#<Thread:0x7f55886 run>"}
[2018-03-17T22:27:07,210][INFO ][logstash.agent           ] Pipelines running {:count=>1, :pipelines=>["main"]}

maybe the wrong input plugin was used?

wwalker · March 17, 2018, 6:51pm

So you get zero output in elasticsearch using the http_poller input?

Salma · March 17, 2018, 6:57pm

Yes! Zero output, I don't know if it take too long to ingest to read the log file and ingest something to Elasticsearch.

I'll wait for a while to check if that was the problem

wwalker · March 17, 2018, 6:58pm

What's your logging level in Logstash set to? Have you enabled debugging?

Salma · March 17, 2018, 7:00pm

What is the logging level?
No, I did not enabled debugging

wwalker · March 17, 2018, 7:01pm

Default log level for Logstash is info. In your logstash.yml, add log.level: debug and restart Logstash. See if that sheds any further light on how the input is functioning

Salma · March 17, 2018, 7:14pm

This what I've got after waiting for a while

[2018-03-17T22:52:28,124][INFO ][logstash.agent           ] Pipelines running {:count=>1, :pipelines=>["main"]}
java.lang.OutOfMemoryError: Java heap space
Dumping heap to java_pid34742.hprof ...
Heap dump file created [735808746 bytes in 9.036 secs]
[2018-03-17T23:02:10,330][ERROR][org.logstash.Logstash    ] java.lang.OutOfMemoryError: Java heap space

I'll try to change the log level to debug now.

wwalker · March 17, 2018, 7:18pm

Salma:

This what I've got after waiting for a while

[2018-03-17T22:52:28,124][INFO ][logstash.agent           ] Pipelines running {:count=>1, :pipelines=>["main"]}
java.lang.OutOfMemoryError: Java heap space
Dumping heap to java_pid34742.hprof ...
Heap dump file created [735808746 bytes in 9.036 secs]
[2018-03-17T23:02:10,330][ERROR][org.logstash.Logstash    ] java.lang.OutOfMemoryError: Java heap space

I'll try to change the log level to debug now.

That's good info right there, Logstash ran out of RAM. 735-ish MB was consumed and then Logstash died. My guess is that since your schedule is set to always run, it never knows when to stop and process what it has collected. Try changing the schedule to every 5 or 10 seconds, see if you start getting data.

Though, then the question becomes, where does it start reading from, the top or bottom of the http result...you may end up with the same data over and over.

Salma · March 17, 2018, 7:30pm

I tried to replace the

logs => "http://www.almhuette-raith.at/apache-log/access.log"

with

logs => "http://www.almhuette-raith.at/apache-log/access.log:80"

and run it, I found that the output was as following

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <HTML><HEAD> <TITLE>404 Not Found</TITLE> </HEAD><BODY> <H1>Not Found</H1> The requested URL /apache-log/access.log:80 was not found on this server.<P> </BODY></HTML>

since the page dose not exist on port 80 it is reasonable to get an output as shown above,
so I thought that the main issue is that the Logstash is trying to ingest the whole html page as a single input.
Thats why I got

The question that raises itself, how would it be possible to let logsatsh read the log entries and skip the HTML script?

wwalker · March 17, 2018, 7:34pm

Wrong formatting on that, you're telling it to look for a file named access.log:80. You want to put the port number before the sub-directory pathing:

http://www.almguette-raith.at:80/apache-log/access.log

Double expression of the same port shouldn't make a difference anyways. By using http://, you're instructing it to look at port 80.

Salma · March 17, 2018, 7:47pm

Yes Walker, I added a wrong format which means a wrong page on a purpose.
since the error page has two lines of text I thought that Logstash will read each line independently.
But what I saw from the message is that Logsatsh has read the whole page as a message.
So if it reads the whole page as a single input or message, I think it did the same on the correct page which has millions of log entries. And Logstash is trying to ingest them as a single message
like

  "message" => " <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> 
    <HTML>
    <HEAD> 
     </HEAD><BODY> log1
    log2
    log3
    log4
    log 5
    .
    .
    .
    .
    log 100000000
    </BODY></HTML>"

wwalker · March 17, 2018, 7:57pm

Salma:

Yes Walker, I added a wrong format which means a wrong page on a purpose.
since the error page has two lines of text I thought that Logstash will read each line independently.
But what I saw from the message is that Logsatsh has read the whole page as a message.
So if it read the whole page as a single input or message, I think it did the same on the correct page which has millions of log entries. And Logstash is trying to ingest them as a single message
like
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> 
<HTML>
<HEAD> 
 </HEAD><BODY> log1
log2
log3
log4
log 5
.
.
.
.
log 100000000
</BODY></HTML>

Lightbulb!

From the http_poller documentation: Reads from a list of urls and decodes the body of the response with a codec

From the plain codec documentation: The "plain" codec is for plain text with no delimiting between events.

That seems to fall in line with what you're experiencing. Everything is stream in as one big line. So let's change the codec from plain to multiline. Though the problem then becomes, how to tell the multiline codec when to create a new line. Appears it'd be a numeric expression...I'm not sure what that regex expression would look like, my guess is below.

codec => multiline {
  pattern => "[0-9]"
  what => "next"
}

magnusbaeck · March 17, 2018, 8:08pm

That seems to fall in line with what you're experiencing. Everything is stream in as one big line. So let's change the codec from plain to multiline.

I believe you've diagnosed the problem correctly but the suggested cure is wrong. Use the line codec instead.

Salma · March 17, 2018, 8:19pm

I've used the line codec, It just worked for me ! YAY.

Thank you so much Walker an @magnusbaeck.

Topic		Replies	Views
HTTP_POLLER Plugin for Logstash Logstash	30	4200	February 25, 2018
Http plugin of logstash Logstash	13	3802	July 6, 2017
Http_poller input only new lines? Logstash	12	2334	February 9, 2018
Feed one input event into another input Logstash	1	645	July 6, 2017
Logatash input file from URL Logstash	6	2685	July 6, 2017

Webpage Input Plugin

Related topics