Webpage Input Plugin

Hi,

I'm wondering if there is a way to input webpage data into Elasticsearch using Logstash pipeline.
I have a webpage that is publicly accessible like website/logs/access.log what input plugin should I use?

Thanks,
Salma

Are you trying to ingest IIS/Samba logs?

Hi wwalker,

Thanks for your replay,
I don’t know what is IIS/Samba logs. The logs that I’m trying to ingest are
Web Access logs and generated on a specific webpage.

IIS and Samba are popular applications for hosting websites, IIS is native to Windows Server OS and Samba is used a lot in Linux. I haven't done it yet, but I imagine you'd use the file input and then the grok filter to pull out the fields you want processed. I wish I could help you more but grok is something I've read through briefly and was left scratching my head.

I don't own the website, I found the link on github when I was searching for a publicly accessible web access logs.
here if the link that I want to stream the logs from
http://www.almhuette-raith.at/apache-log/access.log
So I'm not sure if this website is hosted using llS or Samba.

I tried to download the web log from the website and ingest it using the file input plugin and it worked fine for me, but the problem is that events are not on real time since I've downloaded a copy of the web access logs of this website.

What I want to do is to make Logstash connect directly to the webpage of the access logs and stream the data in real time.

I've tried this input but it didn't collect any data for me

     input {
  http_poller {
    urls => {
      logs => "http://www.almhuette-raith.at/apache-log/access.log"
    }
    schedule => { cron => "* * * * * UTC"}
 
    codec => "plain"
  }
}

What's your output look like? Based on your input config, that should work getting the data into the pipeline, it could be that Logstash/Elasticsearch doesn't know what to do with it afterwards, which is where grok comes in.

This is the whole config file Every thing is working fine and data is streaming if the input from the file not HTTP

input {
  http_poller {
    urls => {
      logs => "http://www.almhuette-raith.at/apache-log/access.log"
    }
    schedule => { cron => "* * * * * UTC"}
 
    codec => "plain"
  }
}


filter {
  grok {

            match => { "message" => '%{IPV4:clientip} %{NOTSPACE:ER} %{NOTSPACE:EO} \[%{HTTPDATE:timestamp}\] \"%{NOTSPACE:Method} %{DATA:request} HTTP/%{NUMBER:httpversion}\" %{NUMBER:HTTPStatus} %{NOTSPACE:ObjectSize} %{QS:referrer} %{QS:User_Agent} %{QS:What}'
              }


 
    }


    mutate {

      gsub => [
         "ObjectSize", "-", "0"
      ]

    
     
      convert => { "ObjectSize" => "integer" }
      convert => { "HTTPStatus" => "integer" }
      convert => { "httpversion" => "float" }
      add_field => { "domain" => "almhuetteraith.at" }
      

    }
       
    date {
      match => [ "timestamp", "dd/MMM/YYYY:HH:mm:ss Z" ]
    }

    geoip {
      source => "clientip"
    }
   }

output {
  elasticsearch { 
      hosts => ["localhost:9200"] 
      index => "real"
      user => "elastic"
      password => "*********"
      }
  stdout { codec => rubydebug }
}

And when I run the pipeline, it run without any errors, this is the output

Sending Logstash's logs to /****/logstash-6.2.2/logs which is now configured via log4j2.properties
[2018-03-17T22:26:47,891][INFO ][logstash.modules.scaffold] Initializing module {:module_name=>"fb_apache", :directory=>"/***/logstash-6.2.2/modules/fb_apache/configuration"}
[2018-03-17T22:26:47,923][INFO ][logstash.modules.scaffold] Initializing module {:module_name=>"netflow", :directory=>"/****/logstash-6.2.2/modules/netflow/configuration"}
[2018-03-17T22:26:49,325][INFO ][logstash.modules.scaffold] Initializing module {:module_name=>"arcsight", :directory=>"/*****/logstash-6.2.2/vendor/bundle/jruby/2.3.0/gems/x-pack-6.2.2-java/modules/arcsight/configuration"}
[2018-03-17T22:26:49,880][WARN ][logstash.config.source.multilocal] Ignoring the 'pipelines.yml' file because modules or command line options are specified
[2018-03-17T22:26:50,796][INFO ][logstash.runner          ] Starting Logstash {"logstash.version"=>"6.2.2"}
[2018-03-17T22:26:51,370][INFO ][logstash.agent           ] Successfully started Logstash API endpoint {:port=>9600}
[2018-03-17T22:26:59,378][INFO ][logstash.pipeline        ] Starting pipeline {:pipeline_id=>"main", "pipeline.workers"=>4, "pipeline.batch.size"=>125, "pipeline.batch.delay"=>50}
[2018-03-17T22:27:00,377][INFO ][logstash.outputs.elasticsearch] Elasticsearch pool URLs updated {:changes=>{:removed=>[], :added=>[http://elastic:xxxxxx@localhost:9200/]}}
[2018-03-17T22:27:00,428][INFO ][logstash.outputs.elasticsearch] Running health check to see if an Elasticsearch connection is working {:healthcheck_url=>http://elastic:xxxxxx@localhost:9200/, :path=>"/"}
[2018-03-17T22:27:01,233][WARN ][logstash.outputs.elasticsearch] Restored connection to ES instance {:url=>"http://elastic:xxxxxx@localhost:9200/"}
[2018-03-17T22:27:01,386][INFO ][logstash.outputs.elasticsearch] ES Output version determined {:es_version=>nil}
[2018-03-17T22:27:01,392][WARN ][logstash.outputs.elasticsearch] Detected a 6.x and above cluster: the `type` event field won't be used to determine the document _type {:es_version=>6}
[2018-03-17T22:27:01,420][INFO ][logstash.outputs.elasticsearch] Using mapping template from {:path=>nil}
[2018-03-17T22:27:01,452][INFO ][logstash.outputs.elasticsearch] Attempting to install template {:manage_template=>{"template"=>"logstash-*", "version"=>60001, "settings"=>{"index.refresh_interval"=>"5s"}, "mappings"=>{"_default_"=>{"dynamic_templates"=>[{"message_field"=>{"path_match"=>"message", "match_mapping_type"=>"string", "mapping"=>{"type"=>"text", "norms"=>false}}}, {"string_fields"=>{"match"=>"*", "match_mapping_type"=>"string", "mapping"=>{"type"=>"text", "norms"=>false, "fields"=>{"keyword"=>{"type"=>"keyword", "ignore_above"=>256}}}}}], "properties"=>{"@timestamp"=>{"type"=>"date"}, "@version"=>{"type"=>"keyword"}, "geoip"=>{"dynamic"=>true, "properties"=>{"ip"=>{"type"=>"ip"}, "location"=>{"type"=>"geo_point"}, "latitude"=>{"type"=>"half_float"}, "longitude"=>{"type"=>"half_float"}}}}}}}}
[2018-03-17T22:27:01,528][INFO ][logstash.outputs.elasticsearch] New Elasticsearch output {:class=>"LogStash::Outputs::ElasticSearch", :hosts=>["//localhost:9200"]}
[2018-03-17T22:27:01,856][INFO ][logstash.filters.geoip   ] Using geoip database {:path=>"/***/logstash-6.2.2/vendor/bundle/jruby/2.3.0/gems/logstash-filter-geoip-5.0.3-java/vendor/GeoLite2-City.mmdb"}
[2018-03-17T22:27:06,983][INFO ][logstash.inputs.http_poller] Registering http_poller Input {:type=>nil, :schedule=>{"cron"=>"* * * * * UTC"}, :timeout=>nil}
[2018-03-17T22:27:07,048][INFO ][logstash.pipeline        ] Pipeline started succesfully {:pipeline_id=>"main", :thread=>"#<Thread:0x7f55886 run>"}
[2018-03-17T22:27:07,210][INFO ][logstash.agent           ] Pipelines running {:count=>1, :pipelines=>["main"]}

maybe the wrong input plugin was used?

So you get zero output in elasticsearch using the http_poller input?

Yes! Zero output, I don't know if it take too long to ingest to read the log file and ingest something to Elasticsearch.

I'll wait for a while to check if that was the problem

What's your logging level in Logstash set to? Have you enabled debugging?

What is the logging level?
No, I did not enabled debugging

Default log level for Logstash is info. In your logstash.yml, add log.level: debug and restart Logstash. See if that sheds any further light on how the input is functioning

This what I've got after waiting for a while

[2018-03-17T22:52:28,124][INFO ][logstash.agent           ] Pipelines running {:count=>1, :pipelines=>["main"]}
java.lang.OutOfMemoryError: Java heap space
Dumping heap to java_pid34742.hprof ...
Heap dump file created [735808746 bytes in 9.036 secs]
[2018-03-17T23:02:10,330][ERROR][org.logstash.Logstash    ] java.lang.OutOfMemoryError: Java heap space

I'll try to change the log level to debug now.

That's good info right there, Logstash ran out of RAM. 735-ish MB was consumed and then Logstash died. My guess is that since your schedule is set to always run, it never knows when to stop and process what it has collected. Try changing the schedule to every 5 or 10 seconds, see if you start getting data.

Though, then the question becomes, where does it start reading from, the top or bottom of the http result...you may end up with the same data over and over.

I tried to replace the

logs => "http://www.almhuette-raith.at/apache-log/access.log"

with

logs => "http://www.almhuette-raith.at/apache-log/access.log:80"

and run it, I found that the output was as following

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <HTML><HEAD> <TITLE>404 Not Found</TITLE> </HEAD><BODY> <H1>Not Found</H1> The requested URL /apache-log/access.log:80 was not found on this server.<P> </BODY></HTML>

since the page dose not exist on port 80 it is reasonable to get an output as shown above,
so I thought that the main issue is that the Logstash is trying to ingest the whole html page as a single input.
Thats why I got

The question that raises itself, how would it be possible to let logsatsh read the log entries and skip the HTML script?

Wrong formatting on that, you're telling it to look for a file named access.log:80. You want to put the port number before the sub-directory pathing:

http://www.almguette-raith.at:80/apache-log/access.log

Double expression of the same port shouldn't make a difference anyways. By using http://, you're instructing it to look at port 80.

Yes Walker, I added a wrong format which means a wrong page on a purpose.
since the error page has two lines of text I thought that Logstash will read each line independently.
But what I saw from the message is that Logsatsh has read the whole page as a message.
So if it reads the whole page as a single input or message, I think it did the same on the correct page which has millions of log entries. And Logstash is trying to ingest them as a single message
like

  "message" => " <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> 
    <HTML>
    <HEAD> 
     </HEAD><BODY> log1
    log2
    log3
    log4
    log 5
    .
    .
    .
    .
    log 100000000
    </BODY></HTML>"

Lightbulb!

From the http_poller documentation: Reads from a list of urls and decodes the body of the response with a codec

From the plain codec documentation: The "plain" codec is for plain text with no delimiting between events.

That seems to fall in line with what you're experiencing. Everything is stream in as one big line. So let's change the codec from plain to multiline. Though the problem then becomes, how to tell the multiline codec when to create a new line. Appears it'd be a numeric expression...I'm not sure what that regex expression would look like, my guess is below.

codec => multiline {
  pattern => "[0-9]"
  what => "next"
}

That seems to fall in line with what you're experiencing. Everything is stream in as one big line. So let's change the codec from plain to multiline.

I believe you've diagnosed the problem correctly but the suggested cure is wrong. Use the line codec instead.

I've used the line codec, It just worked for me ! YAY.

Thank you so much Walker an @magnusbaeck.