Hi,
I'm seeing some surprising behaviour with the standard analyzer. I'm testing using the analyze api and the elasticsearch version is 2.2.0. Input data looks like this:
{
"analyzer": "standard",
"tokenizer": "standard",
"text": "Word,9,3,5,8,1,Another,Word"
}
The output looks like this:
{
"tokens" : [ {
"token" : "word",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 0
}, {
"token" : "9,3,5,8,1",
"start_offset" : 5,
"end_offset" : 14,
"type" : "<NUM>",
"position" : 1
}, {
"token" : "another",
"start_offset" : 15,
"end_offset" : 22,
"type" : "<ALPHANUM>",
"position" : 2
}, {
"token" : "word",
"start_offset" : 23,
"end_offset" : 27,
"type" : "<ALPHANUM>",
"position" : 3
} ]
}
What I'm surprised by is the tokenization of the comma separated numbers. We have logs that have message fields containing comma separated strings and integers and require the tokenization to be more fine-grained.
How can I update the tokenization of our logs to handle this better?
Regards,
David