Problems with Tokenization


#1

Hi,
I'm seeing some surprising behaviour with the standard analyzer. I'm testing using the analyze api and the elasticsearch version is 2.2.0. Input data looks like this:

{
  "analyzer": "standard",
  "tokenizer": "standard",
  "text": "Word,9,3,5,8,1,Another,Word"
}

The output looks like this:

{
  "tokens" : [ {
    "token" : "word",
    "start_offset" : 0,
    "end_offset" : 4,
    "type" : "<ALPHANUM>",
    "position" : 0
  }, {
    "token" : "9,3,5,8,1",
    "start_offset" : 5,
    "end_offset" : 14,
    "type" : "<NUM>",
    "position" : 1
  }, {
    "token" : "another",
    "start_offset" : 15,
    "end_offset" : 22,
    "type" : "<ALPHANUM>",
    "position" : 2
  }, {
    "token" : "word",
    "start_offset" : 23,
    "end_offset" : 27,
    "type" : "<ALPHANUM>",
    "position" : 3
  } ]
}

What I'm surprised by is the tokenization of the comma separated numbers. We have logs that have message fields containing comma separated strings and integers and require the tokenization to be more fine-grained.

How can I update the tokenization of our logs to handle this better?

Regards,
David


(tri-man) #2

I tested your string with ES v5.6.1, all tokens were generated as you expected.

I notice you are using ES v2.2.0 so try to update your ES and test it again with the latest version?


(Ivan Brusic) #3

If you logs consistently do not insert whitespace around the commans, the
pattern tokenizer with default arguments will tokenize your test case
correctly

{
"tokenizer": "pattern",
"text": "Word,9,3,5,8,1,Another,Word"
}


(system) #4

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.