Is there a way too keep only the longest token if two or more tokens occupy the same positions?
e.g. if I define "fox" and "quick fox" as keep words obviously both would be return when analyzing the sentence "the quick fox jumped of the lazy dog." But since "quick fox" is more specific I would like to keep only "quick fox" and discard "fox".
curl -X PUT "localhost:9200/my-index-000001?pretty" -H 'Content-Type: application/json' -d'
{
"settings": {
"analysis": {
"analyzer": {
"std_folded": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"custom_shingle",
"keepMe"
]
}
},
"filter": {
"keepMe": {
"type": "keep",
"keep_words": [ "quick fox", "fox" ]
},
"custom_shingle": {
"type": "shingle",
"min_shingle_size": "2",
"max_shingle_size": "3",
"output_unigrams_if_no_shingles": true
}
}
}
},
"mappings": {
"properties": {
"my_text": {
"type": "text",
"analyzer": "std_folded"
}
}
}
}
'
curl -X POST "localhost:9200/my-index-000001/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
"analyzer": "std_folded",
"text": "The quick fox jumped over the lazy dog."
}
'