I have used the following settings to index content and generate bigrams.
es.indices.create(
index= "shingles",
body= {
"settings": {
"analysis": {
"analyzer": {
"shingle_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"stop",
"shingle_filter",
"trim",
"kill_filler"]
}
},
"filter": {
"shingle_filter":{
"type" : "shingle",
"max_shingle_size" : 2,
"min_shingle_size" : 2,
"output_unigrams" : "false",
"output_unigrams_if_no_shingles" : "true",
"enable_position_increments":"false"
},
"kill_filler": {
"type": "pattern_replace",
"pattern": ".*_.*",
"replace": ""
}
}
}
},
"mappings": {
"properties": {
"my_join_field": {
"type": "join",
"relations": {
"document": "page"
}
}
}
}
})
When I run analyze API on a string "Word1 Word2 StopWord1 StopWord2 Word3 Word4", I correctly get 4 shingles
i) Word1 Word2
ii) ""
iii) ""
iv) Word3 Word4
What query can I use to retrieve content matching i) and/or iv) ?
I have not been able to get match_phrase or bool must match to work ?