Hello,
I am writing some custom analyzers to handle specific special characters that I care about, and still use the standard tokenizer (can't use whitespace or pattern for my documents).
This is the scenario I'm trying to get working:
A document contains the text AT&T
which gets character mapped to ATAMPERSAND_SYMBOLampSEMICOLON_SYMBOLT
via the index_analyzer
(should be single token with standard tokenizer).
Then, by searching AT&T
(not AT&T
) the same document will be returned by converting AT&T
to ATAMPERSAND_SYMBOLampSEMICOLON_SYMBOLT
with the search_analyzer
.
Here's my Sense code trying this out, but the wrong document is being returned... Any help is much appreciated!
The custom analyzers and mapping:
PUT /index
{
"settings": {
"analysis": {
"analyzer": {
"TextIndexAnalyzer": {
"type": "custom",
"tokenizer": "standard",
"char_filter": [ "SpecialCharactersIndexFilter" ]
},
"TextSearchAnalyzer": {
"type": "custom",
"tokenizer": "standard",
"char_filter": [ "SpecialCharactersSearchFilter" ]
}
},
"char_filter": {
"SpecialCharactersIndexFilter": {
"type": "mapping",
"mappings": [
"&=>AMPERSAND_SYMBOL",
";=>SEMICOLON_SYMBOL"
]
},
"SpecialCharactersSearchFilter": {
"type": "mapping",
"mappings": [
"&=>AMPERSAND_SYMBOLampSEMICOLON_SYMBOL"
]
}
}
}
},
"mappings": {
"Doc": {
"properties": {
"Text": {
"type": "string",
"index_analyzer": "TextIndexAnalyzer",
"search_analyzer": "TextSearchAnalyzer"
}
}
}
}
}
Two example documents:
PUT /index/Doc/1
{
"Doc.Text": "AT&T"
}
PUT /index/Doc/2
{
"Doc.Text": "AT&T"
}
Want to get document _id: 1
with this search, but _id: 2
is returned:
GET /index/Doc/_search
{
"query": {
"match_phrase": {
"Doc.Text": "AT&T"
}
}
}
Showing that the two analyzers return the same tokens:
POST /index/_analyze?analyzer=TextIndexAnalyzer
{
AT&T
}
POST /index/_analyze?analyzer=TextSearchAnalyzer
{
AT&T
}
Thanks again for any help!