Why doesn't this document match?

I create a document in 6.3.1 using

output { elasticsearch { hosts => "localhost" index => "deleteme8" } }
input { generator { count => 1 message => "Foo: 'UserID:xx0001'" } }

If in Kibana I request "message: UserID" it does not find the document. I found this out when running categorization jobs in ML and the link to show examples would not load any documents. ML is under the impression that that document does contain UserID. There is no template for the index.

I figure MLs expectations of how that document will be analyzed do not match reality. Is this an ML issue or an ES issue?

I think we'd need to drop down a level to see the elasticsearch mapping and the raw JSON doc to make any headway on this.

A search for * in the index returns this

"hits": [
  {
    "_index": "deleteme8",
    "_type": "doc",
    "_id": "ZyUGG2UBe2VO2Oy1ehDH",
    "_score": 1,
    "_source": {
      "message": "Foo: 'UserID:xx0001'",
      "@timestamp": "2018-08-08T19:32:13.828Z",
      "@version": "1",
      "host": "...",
      "sequence": 0
    }
  }
]

GET deleteme8/_mapping returns this

{
  "deleteme8": {
    "mappings": {
      "doc": {
        "properties": {
          "@timestamp": {
            "type": "date"
          },
          "@version": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "host": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "message": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "sequence": {
            "type": "long"
          }
        }
      }
    }
  }
}

Is that what you are asking for?

Thanks for that. The issue is the user ID and the value are indexed as a single term - you can see the way a field is indexed using the _analyze API:

POST test/_analyze
{
  "field":"message",
  "text": "Foo: 'UserID:xx0001'"
}

The output of which is:

{
  "tokens": [
	{
	  "token": "foo",
	  "start_offset": 0,
	  "end_offset": 3,
	  "type": "<ALPHANUM>",
	  "position": 0
	},
	{
	  "token": "userid:xx0001",
	  "start_offset": 6,
	  "end_offset": 19,
	  "type": "<ALPHANUM>",
	  "position": 1
	}
  ]
}

Searching for message:UserID* works but may not be very efficient if you have lots of unique user ids of this form.

So the fact that machine learning expects to be able to search for UserID would be an ML defect? As I said, when I click on the link to show examples it returns nothing, because if it categorized that message it would do a search for "message: Foo AND message: UserID" to find examples of it.

Taking a guess I'd say the categorisation job uses a different analyzer to the one used in your search index and ML winds up discovering tokens that can't then be searched for in a drill-down.

I think it would make sense to re-post this in the ML forum as a new "Can't drill down from categorisation job" issue rather than a core elasticsearch matching issue.

1 Like

Even though it's primarily about categorizing non-English logs, you might find this blog interesting. The key part in relation to what you're suffering from is:

Two things to be aware of when customizing the categorization_analyzer are:

  • Although techniques such as lowercasing, stemming and decompounding work well for search, for categorizing machine-generated log messages it’s best not to do these things. For example, stemming rules out the possibility of distinguishing “service starting” from “service started”. In human-generated text this could be appropriate, as people use slightly different words when writing about the same thing. But for machine-generated log messages from a given program, different words mean a different message.
  • The tokens generated by the categorization_analyzer need to be sufficiently similar to those generated by the analyzer used at index time that when you search for them you’ll match the original message. This is required in order for drilldown from category definitions to the original data to work.

It seems that you've run into this without customizing anything because the ml_classic tokenizer splits on colons but the standard tokenizer doesn't.

You can make things work by recreating your job with a custom categorization_analyzer that uses the standard tokenizer. Basically, add a section into your job JSON like this:

    "categorization_analyzer" : {
      "tokenizer" : "standard", 
      "filter" : [
        { "type" : "pattern_replace", "pattern": "^[0-9].*" },
        { "type" : "stop", "stopwords" : [
          "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
          "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
          "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
          "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
          "GMT", "UTC"
        ] } 
      ]
    }

(If you know the field you're categorizing doesn't contain dates then you can omit the stop filter to make it more concise and efficient.)

The docs for the categorization_analyzer show an example in the context of a full job config if you need it.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.