Why doesn't this document match?

Badger · August 8, 2018, 7:37pm

I create a document in 6.3.1 using

output { elasticsearch { hosts => "localhost" index => "deleteme8" } }
input { generator { count => 1 message => "Foo: 'UserID:xx0001'" } }

If in Kibana I request "message: UserID" it does not find the document. I found this out when running categorization jobs in ML and the link to show examples would not load any documents. ML is under the impression that that document does contain UserID. There is no template for the index.

I figure MLs expectations of how that document will be analyzed do not match reality. Is this an ML issue or an ES issue?

Mark_Harwood · August 10, 2018, 12:39pm

I think we'd need to drop down a level to see the elasticsearch mapping and the raw JSON doc to make any headway on this.

Badger · August 10, 2018, 1:35pm

A search for * in the index returns this

"hits": [
  {
    "_index": "deleteme8",
    "_type": "doc",
    "_id": "ZyUGG2UBe2VO2Oy1ehDH",
    "_score": 1,
    "_source": {
      "message": "Foo: 'UserID:xx0001'",
      "@timestamp": "2018-08-08T19:32:13.828Z",
      "@version": "1",
      "host": "...",
      "sequence": 0
    }
  }
]

GET deleteme8/_mapping returns this

{
  "deleteme8": {
    "mappings": {
      "doc": {
        "properties": {
          "@timestamp": {
            "type": "date"
          },
          "@version": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "host": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "message": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "sequence": {
            "type": "long"
          }
        }
      }
    }
  }
}

Is that what you are asking for?

Mark_Harwood · August 10, 2018, 1:47pm

Thanks for that. The issue is the user ID and the value are indexed as a single term - you can see the way a field is indexed using the _analyze API:

POST test/_analyze
{
  "field":"message",
  "text": "Foo: 'UserID:xx0001'"
}

The output of which is:

{
  "tokens": [
	{
	  "token": "foo",
	  "start_offset": 0,
	  "end_offset": 3,
	  "type": "<ALPHANUM>",
	  "position": 0
	},
	{
	  "token": "userid:xx0001",
	  "start_offset": 6,
	  "end_offset": 19,
	  "type": "<ALPHANUM>",
	  "position": 1
	}
  ]
}

Searching for message:UserID* works but may not be very efficient if you have lots of unique user ids of this form.

Badger · August 10, 2018, 1:51pm

So the fact that machine learning expects to be able to search for UserID would be an ML defect? As I said, when I click on the link to show examples it returns nothing, because if it categorized that message it would do a search for "message: Foo AND message: UserID" to find examples of it.

Mark_Harwood · August 10, 2018, 2:05pm

Taking a guess I'd say the categorisation job uses a different analyzer to the one used in your search index and ML winds up discovering tokens that can't then be searched for in a drill-down.

I think it would make sense to re-post this in the ML forum as a new "Can't drill down from categorisation job" issue rather than a core elasticsearch matching issue.

droberts195 · August 10, 2018, 2:26pm

Even though it's primarily about categorizing non-English logs, you might find this blog interesting. The key part in relation to what you're suffering from is:

Two things to be aware of when customizing the categorization_analyzer are:

Although techniques such as lowercasing, stemming and decompounding work well for search, for categorizing machine-generated log messages it’s best not to do these things. For example, stemming rules out the possibility of distinguishing “service starting” from “service started”. In human-generated text this could be appropriate, as people use slightly different words when writing about the same thing. But for machine-generated log messages from a given program, different words mean a different message.

The tokens generated by the categorization_analyzer need to be sufficiently similar to those generated by the analyzer used at index time that when you search for them you’ll match the original message. This is required in order for drilldown from category definitions to the original data to work.

It seems that you've run into this without customizing anything because the ml_classic tokenizer splits on colons but the standard tokenizer doesn't.

You can make things work by recreating your job with a custom categorization_analyzer that uses the standard tokenizer. Basically, add a section into your job JSON like this:

    "categorization_analyzer" : {
      "tokenizer" : "standard", 
      "filter" : [
        { "type" : "pattern_replace", "pattern": "^[0-9].*" },
        { "type" : "stop", "stopwords" : [
          "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
          "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
          "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
          "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
          "GMT", "UTC"
        ] } 
      ]
    }

(If you know the field you're categorizing doesn't contain dates then you can omit the stop filter to make it more concise and efficient.)

The docs for the categorization_analyzer show an example in the context of a full job config if you need it.

system · September 7, 2018, 2:26pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Query not returning document - mapping issue? Elasticsearch	4	443	January 31, 2020
Best query to only return documents matching a nested unique id 'id' and not the ES _id field? Elasticsearch	9	1541	December 14, 2020
Field is not matching anything in field list, but works correctly in 'table_view' Elasticsearch	2	338	July 6, 2017
URI query not matching all fields Elasticsearch	6	1293	April 30, 2019
I need help! Elasticsearch does not match my regexp, wildcard, or query_string queries Elasticsearch	5	4847	July 29, 2020

Why doesn't this document match?

Related topics