Problems with Tokenization

dawiro · September 28, 2017, 2:01pm

Hi,
I'm seeing some surprising behaviour with the standard analyzer. I'm testing using the analyze api and the elasticsearch version is 2.2.0. Input data looks like this:

{
  "analyzer": "standard",
  "tokenizer": "standard",
  "text": "Word,9,3,5,8,1,Another,Word"
}

The output looks like this:

{
  "tokens" : [ {
    "token" : "word",
    "start_offset" : 0,
    "end_offset" : 4,
    "type" : "<ALPHANUM>",
    "position" : 0
  }, {
    "token" : "9,3,5,8,1",
    "start_offset" : 5,
    "end_offset" : 14,
    "type" : "<NUM>",
    "position" : 1
  }, {
    "token" : "another",
    "start_offset" : 15,
    "end_offset" : 22,
    "type" : "<ALPHANUM>",
    "position" : 2
  }, {
    "token" : "word",
    "start_offset" : 23,
    "end_offset" : 27,
    "type" : "<ALPHANUM>",
    "position" : 3
  } ]
}

What I'm surprised by is the tokenization of the comma separated numbers. We have logs that have message fields containing comma separated strings and integers and require the tokenization to be more fine-grained.

How can I update the tokenization of our logs to handle this better?

Regards,
David

thn · September 28, 2017, 2:22pm

I tested your string with ES v5.6.1, all tokens were generated as you expected.

I notice you are using ES v2.2.0 so try to update your ES and test it again with the latest version?

Ivan · September 28, 2017, 2:48pm

If you logs consistently do not insert whitespace around the commans, the
pattern tokenizer with default arguments will tokenize your test case
correctly

{
"tokenizer": "pattern",
"text": "Word,9,3,5,8,1,Another,Word"
}

system · October 26, 2017, 2:49pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
ElasticSearch standard Analyzer - exceptional case Elasticsearch	10	1026	January 10, 2018
Problem with token delimiter and regular expression Elasticsearch	2	600	July 6, 2017
Standard tokenizer documentation doesn't match behavior Elasticsearch	2	316	July 6, 2017
Phrases with special characters Elasticsearch	1	1386	July 6, 2017
Bug in official document sample Elasticsearch	4	725	July 5, 2017

Problems with Tokenization

Related topics