Running ngrams analyzer produced this. I think this is what you were referring to way back:
POST _analyze
{
"tokenizer": "ngram",
"text": "N-WO-001"
}
This produces:
{
"tokens": [
{
"token": "N",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0
},
{
"token": "N-",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 1
},
{
"token": "-",
"start_offset": 1,
"end_offset": 2,
"type": "word",
"position": 2
},
{
"token": "-W",
"start_offset": 1,
"end_offset": 3,
"type": "word",
"position": 3
},
{
"token": "W",
"start_offset": 2,
"end_offset": 3,
"type": "word",
"position": 4
},
{
"token": "WO",
"start_offset": 2,
"end_offset": 4,
"type": "word",
"position": 5
},
{
"token": "O",
"start_offset": 3,
"end_offset": 4,
"type": "word",
"position": 6
},
{
"token": "O-",
"start_offset": 3,
"end_offset": 5,
"type": "word",
"position": 7
},
{
"token": "-",
"start_offset": 4,
"end_offset": 5,
"type": "word",
"position": 8
},
{
"token": "-0",
"start_offset": 4,
"end_offset": 6,
"type": "word",
"position": 9
},
{
"token": "0",
"start_offset": 5,
"end_offset": 6,
"type": "word",
"position": 10
},
{
"token": "00",
"start_offset": 5,
"end_offset": 7,
"type": "word",
"position": 11
},
{
"token": "0",
"start_offset": 6,
"end_offset": 7,
"type": "word",
"position": 12
},
{
"token": "01",
"start_offset": 6,
"end_offset": 8,
"type": "word",
"position": 13
},
{
"token": "1",
"start_offset": 7,
"end_offset": 8,
"type": "word",
"position": 14
}
]
}
Exactly what I want. Now I have to figure out how to code the analyzer to use this.
How performant is ngrams based tokenization?