Tokenization help mixed n-grams

Ilija_Subasic · February 26, 2015, 12:46pm

Hi,
I am trying to create a tokenizer that is going to create a tokens
looking something like this:

"ab c dd c" would be tokenized as "ab", "abc", "abcd", "abcdd", "abcddc",
"cd", "cdd", "cddc", "dd", "ddc"

so basically I need something that is going to do an ngram indexing from
the start of each token. This is different
then edge n-gram which will tokenize each token separatelly.
Any ideas on how to do this without coding a specific tokenizer.

Thanks,
Ilija

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ec72da21-ed99-4cc8-829c-058467c020a5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

masaru · February 27, 2015, 7:55am

Hi,

You can use mapping char filter to remove white space and then ngram tokenises with min_gram=2/max_gram= to make it ngrams.
(not sure if you’d like to omit “bc”, “bcd”… or not though)

Masaru

On February 26, 2015 at 21:46:42, Ilija Subasic (subasic.ilija@gmail.com) wrote:

Hi,
I am trying to create a tokenizer that is going to create a tokens
looking something like this:

"ab c dd c" would be tokenized as "ab", "abc", "abcd", "abcdd", "abcddc",
"cd", "cdd", "cddc", "dd", "ddc"

so basically I need something that is going to do an ngram indexing from
the start of each token. This is different
then edge n-gram which will tokenize each token separatelly.
Any ideas on how to do this without coding a specific tokenizer.

Thanks,
Ilija

--
You received this message because you are subscribed to the Google Groups "elasticsearch"
group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ec72da21-ed99-4cc8-829c-058467c020a5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/etPan.54f02339.625558ec.129%40citra.local.
For more options, visit https://groups.google.com/d/optout.