Problem processing squid logs with logstash

Hello, everyone.

I am breaking my head over this issue.

Logstash configuration has the following grok statement:

%{MONTH}\s+%{MONTHDAY}\s+%{TIME}\s+%{IPORHOST:src_proxy}?\s+\S+\s+%{BASE16FLOAT:timestamp}\s+%{NUMBER:request_msec:float}\s+%{IPV4:src_ip}\s+%{WORD:cache_result}/%{NUMBER:response_status:int}\s+%{NUMBER:response_size:int}\s+%{WORD:http_method}\s+(%{URIPROTO:http_proto}://)?%{IPORHOSTWITHUNDERSCORE:dst_host}(?::%{POSINT:port:int})?(?:%{URIPATHPARAM:uri_param})?\s+%{DATA:cache_user}\s+%{DATA:request_route}/(?:%{IPORHOST:forwarded_to}|-)\s(?:%{GREEDYDATA:content_type}|-)?

uilizing the following custom patterns:

HOSTWITHUNDERSCORE \b(?:[0-9A-Za-z][0-9A-Za-z_-]{0,62})(?:.(?:[0-9A-Za-z][0-9A-Za-z_-]{0,62}))*(.?|\b)
IPORHOSTWITHUNDERSCORE (?:%{IP}|%{HOSTWITHUNDERSCORE})

The following squid statements follow with _grokparsefailure while being processed by logstash on the server. However, they cleanly match the test on the grok test web site ( http://grokconstructor.appspot.com/do/match#result ):

Sep 21 08:40:24 proxy squid[3635]: 1474461624.308 87 1.1.1.13 TCP_MISS/200 313 GET http://aax.amazon-adsystem.com/x/px/IDfEVLsFwE5MoEa-0noSYhoAAAFXTMLUDQEAAAzmEXiz-w/{"adCsm":%20[{"vfrd":1,"dbg":"366x47"},{"lteu":"0.08","ltut":"0.05","ltpq":"0.24","ltvd":"0.22","lths":"0.16","ltpm":"0.28","ltfm":"0.65","csmTot":"5.06"}],%20"pixelId":%20"gc1383zqwx4aq0k9",%20"ts":%201474461624324}&cb=8238899 user1 USERHASH_PARENT/path1.ext image/gif

Sep 21 09:11:10 proxy squid[3635]: 1474463470.424 46 1.1.1.14 TCP_MISS/302 610 GET http://dpm.demdex.net/ibs:dpid=30862&puuid=2392662699457081&redir=https%3A%2F%2Ft.mookie1.com%2Ft%2Fv1%2Fevent%3FmigClientId%3L7413%26migAction%3Dsync%26migSource%3Dmig%26migParam1%3D${DD_UUID} user2 USERHASH_PARENT/path3.ext -

I cannot understand why they are not being processed. The configuration seems correct. How can I troubleshoot further?

I will greatly appreciate your insights.

Have you investigated which part of the expression is causing the mismatch? Start with the simplest possible expression, e.g. %{MONTH} and add more pieces until you yet again get _grokparsefailure.

HOSTWITHUNDERSCORE \b(?:[0-9A-Za-z][0-9A-Za-z_\-]{0,62})(?:.(?:[0-9A-Za-z][0-9A-Za-z_\-]{0,62}))*(.?|\b)

I think this kind of grok pattern is a mistake. I don't think Logstash's purpose is to validate hostnames, so I'd use the simplest possible expression. The hostname ends when we encounter either a colon or a slash, yes? Hence:

HOSTWITHUNDERSCORE [^:/]+