Hi
I'm trying to implement a filter to find compounded words and needs to combine it with wildcard-search but it doesn't work.
Example case:
There are documents with names "Coca-Cola", "Cocacola" and "Coca Cola". I want hits from all of them when searching on each word.
My mappings:
- "whitespace" to create token for each word
- "shingle" to compound words
- "pattern_replace" to remove spaces etc
PUT /testindex
{
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "test_analyzer"
}
}
},
"settings": {
"analysis": {
"analyzer": {
"test_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"shingle",
"remove_special_characters_filter"
]
}
},
"remove_special_characters_filter": {
"type": "pattern_replace",
"pattern": """[^\p{L}\p{Nd}]""",
"replacement": ""
}
}
}
}
Creating documents:
PUT /testindex/_doc/1
{
"name": "Coca-Cola"
}
PUT /testindex/_doc/2
{
"name": "CocaCola"
}
PUT /testindex/_doc/3
{
"name": "Coca Cola"
}
Searching
match hits all documents:
GET /testindex/_search
{
"query": {
"match": {
"name": "Coca-Cola"
}
}
}
wildcard hits no documents (nor without the wildcards):
GET /testindex/_search
{
"query": {
"wildcard": {
"name": {
"value": "*Coca-Cola*"
}
}
}
}
removing the hyphen hits all documents:
GET /testindex/_search
{
"query": {
"wildcard": {
"name": {
"value": "*CocaCola*"
}
}
}
}
Any suggestions?