Tokenizer to get combinations of words


(Keerthika Raj) #1

I have an index with a field that uses N-gram tokenizer to provide partial search across documents. I have a requirement where if the length of text to be indexed is more than threshold limit (say 300) i want to index it as a different property in the same index, that is to not use N-gram tokenizer, but still be able to support full text and phrase queries.

I want to search across both the fields. The problem i landed up is picking up the right tokenizer to index the content that is more than threshold length. The tokenizer i am looking for should split text as following.

"text":"The quick brown fox"
"tokens" : "The , quick, brown, fox, The quick, quick brown, brown fox, The quick brown, quick brown fox, The quick brown fox"

I have developed the following index mappings and settings and looking for a tokenizer for value-meta field.

{
 "mappings": {
 "message": {
  "properties": {
    "title": {
      "analyzer": "ngram-index_analyzer",
      "fields": {
        "keyword": {
          "type": "keyword"
        }
      },
      "search_analyzer": "search_analyzer",
      "type": "text"
    },
    "value": {
      "analyzer": "ngram-index_analyzer",
      "fields": {
        "keyword": {
          "type": "keyword"
        }
      },
      "search_analyzer": "search_analyzer",
      "type": "text"
    },
    "value-meta": {
      "fields": {
        "keyword": {
          "type": "keyword"
        }
      },         
      "type": "text"
    }
  }
}
},
"settings": {
 "analysis": {
  "analyzer": {
    "ngram-index_analyzer": {
      "filter": [
        "lowercase"
      ],
      "tokenizer": "ngram-index_analyzer"
    },
    "search_analyzer": {
      "filter": [
        "lowercase"
      ],
      "tokenizer": "keyword"
    }
  },
  "tokenizer": {
    "ngram-index_analyzer": {
      "max_gram": "30",
      "min_gram": "2",
      "token_chars": [
        "whitespace",
        "digit",
        "letter",
        "punctuation",
        "symbol"
      ],
      "type": "ngram"
    }
  }
},
"number_of_shards": "20"
}
}

(Keerthika Raj) #2

I achieved this by using whitespace analyzer as index and search analyzer for value-meta field.

.


(system) #3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.