We have some URIs with 8bit characters that are not getting parsed by Grok but are valid URIs. I've validated with the debugger that it is failing at this URIPATH field in our logs.
Those are not valid URIs. They may work, but they do not conform to the RFCs. All those alpha characters that are outside of [A-Za-z] should be URI encoded. So бц-«леонардо»-фаза-2 should be %D0%B1%D1%86-%C2%AB%D0%BB%D0%B5%D0%BE%D0%BD%D0%B0%D1%80%D0%B4%D0%BE%C2%BB-%D1%84%D0%B0%D0%B7%D0%B0-2
URIPATH is defined as
URIPATH (?:/[A-Za-z0-9$.+!*'(){},~:;=@#%&_\-]*)+
so it only matches RFC conformant URIs.. If you use
Thanks, Badger. These are the URIs in our IIS logs, and they are generating 200 responses. It looks like I can use the GREEDYDATA pattern to ingest the path as needed. I'm expecting the index to take this in as-is so we can search on it.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.