cawoodm
(Marc)
March 26, 2020, 8:17pm
1
We want our edge_ngram search to only to word break on whitespace.
If we use:
"tokenizer": {
"token_chars": ["letter", "digit", "punctuation", "symbol"],
"min_gram": "2",
"type": "edge_ngram",
"max_gram": "20"
}
Then it breaks words on things not considered* symbols (like =
).
How can we use the edge_ngram tokenizer only tokenizing (word breaking) on whitespace.
Note: "token_chars": ["whitespace"],
produces no tokens at all.
Test case:
"some foo/text with=this.kind?of_thing"
should produce ["some", "foo/text", "with=this.kind?of_thing"]
Bonus Question: Which characters are considered symbols. It's not obvious from the source code.
cawoodm
(Marc)
March 27, 2020, 9:05am
2
OK, found a solution using the rather poorly documented custom_token_chars
:
"tokenizer": {
"token_chars": ["letter", "digit", "punctuation", "symbol"],
"custom_token_chars": ["="],
"min_gram": "2",
"type": "edge_ngram",
"max_gram": "20"
}
I can confirm that at least the following character are already covered by punctutation
and symbols
above:
/ * - + „ “ * % & ! . , ; : ( ) ° ß ø
spinscale
(Alexander Reelsen)
March 30, 2020, 2:52pm
3
dumb question: If you only want to tokenize on whitespace, why not use a whitespace tokenizer ? I guess there is some more logic done on your side?
cawoodm
(Marc)
March 31, 2020, 10:14am
4
Indeed, I was unaware how to combine filters, analyzers and tokenizers but I got some feedback on GitHub which resulted in these settings:
"settings": {
"index": {
"analysis": {
"analyzer": {
"whitesp": {
"tokenizer": "whitespace",
"filter": [
"lowercase"
]
},
"edgegram": {
"tokenizer": "whitespace",
"filter": [
"lowercase",
"edgegram"
]
}
},
"filter": {
"edgegram": {
"min_gram": "2",
"type": "edge_ngram",
"max_gram": "20"
}
}
}
}
},
"mappings": {
"dynamic": "strict",
"properties": {
"text_en": {
"search_analyzer": "whitesp",
"analyzer": "edgegram",
"type": "text"
},
"text_de": {
"search_analyzer": "whitesp",
"analyzer": "edgegram",
"type": "text"
}
...
As I understand it this ensures the analysis/indexer uses edge ngram tokenizing on whitespace whilst the search analyser also tokenizes on whitespace but does not to the ngram stuff.
system
(system)
Closed
April 28, 2020, 10:14am
5
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.