Bing API JSON - how to parse?


(Josh Brower) #1

I am using the HTTP Poller to hit the Bing search API, which returns JSON. Unfortunately, it does not appear to be parsed correctly. (http://screencast.com/t/xCSAc7hd) It is just being returned as one log entry, as a single block of text.

What are my next steps to try?

Thanks

Config:


input {
http_poller {
urls => {
test2 => {
method => get
url => "https://api.datamarket.azure.com/Bing/Search/Web?Query='sytsmon'&Adult='Off'&$format=json"
headers => {
Accept => "application/json"
}
auth => {
user => ""
password => "***********************"
}
}
}
request_timeout => 60
interval => 604800
codec => json
}
}

output {
elasticsearch {
host => "*********"
protocol => http
}
}


(Josh Brower) #2

Here is how the json looks: http://screencast.com/t/Czi6L1nRx

I would like to have each item under Results be a new log entry....


(Jay Greenberg) #3

@DefensiveDepth,

Try the JSON Filter.

If that does not work, please provide me a GIST of the sample input.

Thanks


(Josh Brower) #4

@PhaedrusTheGreek

Thanks, but still no go. Get a jsonparsefailure. I think it is because there is an array within the json.

Here are the gists of the raw input data from the Bing api:

raw: https://gist.github.com/defensivedepth/434de9e801bca9d5314f

cleaned-up a bit: https://gist.github.com/defensivedepth/cd69e1cbf14df16fa181

(Not sure why they aren't line-wrapping properly)


(Jay Greenberg) #5

@DefensiveDepth

This configuration seems to work. What is the difference between the 2 URL outputs?

input {
 http_poller {
  urls => {
   test2 => {
    method => get
    url => "https://gist.githubusercontent.com/defensivedepth/434de9e801bca9d5314f/raw/bee1be0df429272e21c5c6ecbff908b0f3d7e851/example-raw.json"
    headers => {
     Accept => "application/json"
    }
   }
  }
  request_timeout => 60
  interval => 604800
  codec => json
 }
}

output {
  stdout { codec => rubydebug }
}

Resulted in much of this:

...
           [49] {
                 "__metadata" => {
                     "uri" => "https://api.datamarket.azure.com/Data.ashx/Bing/Search/Web?Query='sysmon'&Adult='Off'&$skip=49&$top=1",
                    "type" => "WebResult"
                },
                         "ID" => "58acff15-3a21-404e-95a7-2db01fd98004",
                      "Title" => "Download System Monitor (Sysmon) - MajorGeeks",
                "Description" => "System Monitor (Sysmon) is a Windows system service and device driver that, once installed on a system, remains resident across system reboots to monitor and log ...",
                 "DisplayUrl" => "www.majorgeeks.com/files/details/sysmon.html",
                        "Url" => "http://www.majorgeeks.com/files/details/sysmon.html"
            }
        ],
         "__next" => "https://api.datamarket.azure.com/Data.ashx/Bing/Search/Web?Query='sysmon'&Adult='Off'&$skip=50"
    },
      "@version" => "1",
    "@timestamp" => "2015-10-21T20:26:24.703Z"
...

(Jay Greenberg) #6

BTW, what versions are you on?

me:logstash-1.5.4 jay$ bin/logstash -V
logstash 1.5.4
me:logstash-1.5.4 jay$ bin/plugin list --verbose | grep poller
logstash-input-http_poller (1.1.2)

(Josh Brower) #7

Same versions as you, running logstash on Win10 x64, ES on a separate Ubuntu system.

Using the config you posted the & raw json on the gist, I sent it to my ES instance.... (added the ES output config) From stdout I can confirm that it broke out like yours.

Here is what I am seeing on ES:
http://screencast.com/t/qX5mMyZWeh7

The Results are not broken out, and are still un-indexed:
http://screencast.com/t/wiE5vSMaO8gF

I am newer to ELK, so please excuse my ignorance if I am missing something obvious.

Thanks


(Jay Greenberg) #8

@DefensiveDepth,

I think I know what you mean now. I think the problem is that you are loading an array of results into a single Elasticsearch document, which isn't of much use. Kibana says "Objects in Arrays are not well supported", but what you really need is to split out each "result" into it's own document so that you can search results instead of searching a single field for everything.

Adding this filter will split each Bing Result into a separate Elasticsearch document:

filter {
  split {
        field => "[d][results]"
  }
}

Then you can make better use of Kibana, or search them like this:

curl -XGET "http://localhost:9200/logstash-2015.10.22/_search?q=d.results.Title:monitor&_source=d.results.Title"

Yields:

{
   "took": 4,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 8,
      "max_score": 0.9181428,
      "hits": [
         {
            "_index": "logstash-2015.10.22",
            "_type": "logs",
            "_id": "AVCPzC6Frc7DPZQz42Rw",
            "_score": 0.9181428,
            "_source": {
               "d": {
                  "results": {
                     "Title": "UltraScale Architecture System Monitor - Xilinx"
                  }
               }
            }
         },
         {
            "_index": "logstash-2015.10.22",
            "_type": "logs",
            "_id": "AVCPzC6Erc7DPZQz42Rj",
            "_score": 0.8942287,
            "_source": {
               "d": {
                  "results": {
                     "Title": "System Monitor (Windows)"
                  }
               }
            }
         },
         {
            "_index": "logstash-2015.10.22",
            "_type": "logs",
            "_id": "AVCPzC6Frc7DPZQz42Ro",
            "_score": 0.78697956,
            "_source": {
               "d": {
                  "results": {
                     "Title": "Sysinternals New Tool Sysmon (System Monitor)"
                  }
               }
            }
         },
         {
            "_index": "logstash-2015.10.22",
            "_type": "logs",
            "_id": "AVCPzC6Frc7DPZQz42Rt",
            "_score": 0.78697956,
            "_source": {
               "d": {
                  "results": {
                     "Title": "SysMon System Monitor | Windows CMD | SS64.com"
                  }
               }
            }
         },
         {
            "_index": "logstash-2015.10.22",
            "_type": "logs",
            "_id": "AVCPzC6Frc7DPZQz42SB",
            "_score": 0.78697956,
            "_source": {
               "d": {
                  "results": {
                     "Title": "Sysmon v2.0 - System Activity Monitor for Windows"
                  }
               }
            }
         },
         {
            "_index": "logstash-2015.10.22",
            "_type": "logs",
            "_id": "AVCPzC6Frc7DPZQz42ST",
            "_score": 0.7824501,
            "_source": {
               "d": {
                  "results": {
                     "Title": "Download System Monitor (Sysmon) - MajorGeeks"
                  }
               }
            }
         },
         {
            "_index": "logstash-2015.10.22",
            "_type": "logs",
            "_id": "AVCPzC6Frc7DPZQz42SO",
            "_score": 0.6706715,
            "_source": {
               "d": {
                  "results": {
                     "Title": "Sysinternals Sysmon system monitor for Windows"
                  }
               }
            }
         },
         {
            "_index": "logstash-2015.10.22",
            "_type": "logs",
            "_id": "AVCPzC6Frc7DPZQz42SE",
            "_score": 0.55889297,
            "_source": {
               "d": {
                  "results": {
                     "Title": "Using Sysinternals System Monitor (Sysmon) in a Malware ..."
                  }
               }
            }
         }
      ]
   }
}

(Josh Brower) #9

Fantastic, worked great!

I thought I might need to use the split function, but couldn't get the syntax correct.

Thanks for your help, much appreciated.


(system) #10