URIPATH with other languages failing Grok parse

We have some URIs with 8bit characters that are not getting parsed by Grok but are valid URIs. I've validated with the debugger that it is failing at this URIPATH field in our logs.

/uk-UA/Properties/бц-«леонардо»-фаза-2/UKR-вул-бхмельницького-19-21/UKR55000015

/vi-VN/Properties/tòa-nhà-văn-phòng-acb/VNM-444a-446-cách-mạng-tháng-tám-phường-11-quận-3-thành-phố-hồ-chí-minh/VNM26000130

/fr-CA/propriétés/espace-bureau-dans-le-centroparc-de-mascouche/CAN-701-louis-blériot-street-mascouche-quebec-canada/CAN2005057

/en/Properties/three-parcels-totaling-±7119-acres-in-rocky-hill-cromwell-ct-for-sale/USA-7-belamose-ave-6-pleasant-valley-rd-rocky-hill-ct-700r-main-street-cromwell-ct/USA1066860

How can we get these URIs with different characters to parse correctly in Grok as a valid path?

Those are not valid URIs. They may work, but they do not conform to the RFCs. All those alpha characters that are outside of [A-Za-z] should be URI encoded. So бц-«леонардо»-фаза-2 should be %D0%B1%D1%86-%C2%AB%D0%BB%D0%B5%D0%BE%D0%BD%D0%B0%D1%80%D0%B4%D0%BE%C2%BB-%D1%84%D0%B0%D0%B7%D0%B0-2

URIPATH is defined as

URIPATH (?:/[A-Za-z0-9$.+!*'(){},~:;=@#%&_\-]*)+

so it only matches RFC conformant URIs.. If you use

pattern_definitions => { "URI" => "(?:/[[[:alpha:]]0-9$.+!*'(){},~:;=@#%&_\-]*)+" }

it will match the first three. The ± causes the last one not to match. You can add whatever you want to the pattern to get your use case to work.

Thanks, Badger. These are the URIs in our IIS logs, and they are generating 200 responses. It looks like I can use the GREEDYDATA pattern to ingest the path as needed. I'm expecting the index to take this in as-is so we can search on it.

2020-07-04 03:08:14 A0155V1WWEB0001 10.183.238.115 GET /es-MX/Propiedades/oficina-disponible-para-venta-torres-bioparque-ciudad-de-méxico/MEX-avenida-central-254-colonia-carola-del-álvaro-obregón-ciudad-de-méxico/MEX4001855 - 443 - 10.183.238.104 Mozilla/4.0+(compatible+;+MSIE+6.0;+Windows+NT+5.1) - www2.colliers.com 200 0 64 0 641 2140 189.174.126.159,+23.64.141.52,+23.205.127.76

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.