Hi,
I am trying to create a tokenizer that is going to create a tokens
looking something like this:
"ab c dd c" would be tokenized as "ab", "abc", "abcd", "abcdd", "abcddc",
"cd", "cdd", "cddc", "dd", "ddc"
so basically I need something that is going to do an ngram indexing from
the start of each token. This is different
then edge n-gram which will tokenize each token separatelly.
Any ideas on how to do this without coding a specific tokenizer.
You can use mapping char filter to remove white space and then ngram tokenises with min_gram=2/max_gram= to make it ngrams.
(not sure if you’d like to omit “bc”, “bcd”… or not though)
Hi,
I am trying to create a tokenizer that is going to create a tokens
looking something like this:
"ab c dd c" would be tokenized as "ab", "abc", "abcd", "abcdd", "abcddc",
"cd", "cdd", "cddc", "dd", "ddc"
so basically I need something that is going to do an ngram indexing from
the start of each token. This is different
then edge n-gram which will tokenize each token separatelly.
Any ideas on how to do this without coding a specific tokenizer.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.