I am looking at very strange (to me at least) behavior and am unable to find clear documentation for it.
GET _analyze
{
"analyzer": "standard",
"text": "server.mycopany.com"
}
Produces
{
"tokens": [
{
"token": "server.mycopany.com",
"start_offset": 0,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 0
}
]
}
So far so good. I do not expect the standard tokenzier to break on dots without w/s.
But when a number creeps in before a dot:
GET _analyze
{
"analyzer": "standard",
"text": "server1.mycopany.com"
}
This happens:
{
"tokens": [
{
"token": "server1",
"start_offset": 0,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "mycopany.com",
"start_offset": 8,
"end_offset": 20,
"type": "<ALPHANUM>",
"position": 1
}
]
}
- Why is this happening?
- How to overcome this? ==> I dont want a number followed by a dot to break the word.