I have an index with a field that uses N-gram tokenizer to provide partial search across documents. I have a requirement where if the length of text to be indexed is more than threshold limit (say 300) i want to index it as a different property in the same index, that is to not use N-gram tokenizer, but still be able to support full text and phrase queries.
I want to search across both the fields. The problem i landed up is picking up the right tokenizer to index the content that is more than threshold length. The tokenizer i am looking for should split text as following.
"text":"The quick brown fox"
"tokens" : "The , quick, brown, fox, The quick, quick brown, brown fox, The quick brown, quick brown fox, The quick brown fox"
I have developed the following index mappings and settings and looking for a tokenizer for value-meta
field.
{
"mappings": {
"message": {
"properties": {
"title": {
"analyzer": "ngram-index_analyzer",
"fields": {
"keyword": {
"type": "keyword"
}
},
"search_analyzer": "search_analyzer",
"type": "text"
},
"value": {
"analyzer": "ngram-index_analyzer",
"fields": {
"keyword": {
"type": "keyword"
}
},
"search_analyzer": "search_analyzer",
"type": "text"
},
"value-meta": {
"fields": {
"keyword": {
"type": "keyword"
}
},
"type": "text"
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"ngram-index_analyzer": {
"filter": [
"lowercase"
],
"tokenizer": "ngram-index_analyzer"
},
"search_analyzer": {
"filter": [
"lowercase"
],
"tokenizer": "keyword"
}
},
"tokenizer": {
"ngram-index_analyzer": {
"max_gram": "30",
"min_gram": "2",
"token_chars": [
"whitespace",
"digit",
"letter",
"punctuation",
"symbol"
],
"type": "ngram"
}
}
},
"number_of_shards": "20"
}
}