Even though it's primarily about categorizing non-English logs, you might find this blog interesting. The key part in relation to what you're suffering from is:
Two things to be aware of when customizing the categorization_analyzer
are:
- Although techniques such as lowercasing, stemming and decompounding work well for search, for categorizing machine-generated log messages it’s best not to do these things. For example, stemming rules out the possibility of distinguishing “service starting” from “service started”. In human-generated text this could be appropriate, as people use slightly different words when writing about the same thing. But for machine-generated log messages from a given program, different words mean a different message.
- The tokens generated by the
categorization_analyzer
need to be sufficiently similar to those generated by the analyzer used at index time that when you search for them you’ll match the original message. This is required in order for drilldown from category definitions to the original data to work.
It seems that you've run into this without customizing anything because the ml_classic
tokenizer splits on colons but the standard
tokenizer doesn't.
You can make things work by recreating your job with a custom categorization_analyzer
that uses the standard
tokenizer. Basically, add a section into your job JSON like this:
"categorization_analyzer" : {
"tokenizer" : "standard",
"filter" : [
{ "type" : "pattern_replace", "pattern": "^[0-9].*" },
{ "type" : "stop", "stopwords" : [
"Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
"Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
"January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
"Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
"GMT", "UTC"
] }
]
}
(If you know the field you're categorizing doesn't contain dates then you can omit the stop filter to make it more concise and efficient.)
The docs for the categorization_analyzer
show an example in the context of a full job config if you need it.