I am trying to solve a problem where users sometimes include an extra space in their search terms, or alternatively a space was missing in the search term compared to what is in the index. In order to do this, I attempted to use the shingle filter with an empty separator so each pair of words is included as a token with the space between them removed. For example, if a field in the document is "some phrase" then the tokens will include "some", "somephrase", and "phrase", allowing the user to search for "somephrase" without the space and still match that document. However, I think I'm misunderstanding exactly how this filter works, as I'm not seeing the behavior I expect when I use a simple_query_string to match on this field.
I have the partial index mapping (other fields, analyzers, and filters stripped out for clarity)
{
"settings": {
...,
"analysis": {
"analyzer": {
...,
"company_name_analyzer_shingled": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"company_suffix_synonym_filter",
"company_ownership_filter",
"company_alias_filter",
"shingle_filter"
],
"char_filter": [
"apostrophes"
]
},
...
},
"filter": {
...,
"company_alias_filter": {
"type": "synonym",
"synonyms": {
"$ref": "../synonyms/company_aliases.json"
}
},
"company_ownership_filter": {
"type": "synonym",
"synonyms": {
"$ref": "../synonyms/company_ownership.json"
}
},
"company_suffix_synonym_filter": {
"type": "synonym",
"synonyms": {
"$ref": "../synonyms/company_suffixes.json"
}
},
...,
"shingle_filter": {
"type": "shingle",
"token_separator": ""
}
},
"char_filter": {
...,
"apostrophes": {
"type": "mapping",
"mappings": [
"\\u2018=>",
"\\u2019=>",
"\\u201B=>",
"\\u0027=>"
]
},
...
}
}
},
"mappings": {
"product_root": {
"properties": {
...,
"products": {
"type": "nested",
"properties": {
...,
"manufacturer": {
"type": "object",
"properties": {
...,
"name": {
"type": "text",
"analyzer": "company_name_analyzer",
"fields": {
...,
"shingled": {
"type": "text",
"analyzer": "company_name_analyzer_shingled"
}
}
},
...
}
},
...
}
},
...
}
}
}
}
One of the documents I indexed looks like
/srv # curl http://elasticsearch-master:9200/product_roots/product_root/1330748454?pretty
{
...
"_source" : {
...,
"products" : [
{
...,
"manufacturer" : {
...,
"name" : "Celltreat",
...
},
...
}
],
...
}
}
Running the relevant analyzer on that text gives me
/srv # curl --data '{"analyzer":"company_name_analyzer_shingled","text":"Celltreat"}' http://elasticsearch-master:9200/product_roots/_analyze?pretty
{
"tokens" : [
{
"token" : "celltreat",
"start_offset" : 0,
"end_offset" : 9,
"type" : "<ALPHANUM>",
"position" : 0
}
]
}
However, running this search with an extra space does not return that document.
{
"query": {
"bool": {
"must": {
"nested": {
"path": "products",
"query": {
"bool": {
"must": {
"simple_query_string": {
"fields": [
"products.manufacturer.name.shingled"
],
"default_operator": "OR",
"flags": "OR|AND|NOT|PHRASE|PRECEDENCE|ESCAPE|WHITESPACE|FUZZY",
"lenient": true,
"query": "Cell treat"
}
}
}
}
}
}
}
}
}
Analyzing the query using the same analyzer gives these tokens.
/srv # curl --data '{"analyzer":"company_name_analyzer_shingled","text":"Cell treat"}' http://elasticsearch-master:9200/product_roots/_analyze?pretty
{
"tokens" : [
{
"token" : "cell",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "celltreat",
"start_offset" : 0,
"end_offset" : 10,
"type" : "shingle",
"position" : 0,
"positionLength" : 2
},
{
"token" : "treat",
"start_offset" : 5,
"end_offset" : 10,
"type" : "<ALPHANUM>",
"position" : 1
}
]
}
Because they both contain the token "celltreat" I expected that query to match that field, but when I run it Elasticsearch returns no matching documents. I'm sure this is just something I'm misunderstanding about the shingle filter, but I'm hoping someone can point me to exactly what I'm misunderstanding about it so I can fix it.
Thank you.