Yes. And even more token. I want max_shingle_size to be infinite ; that is to say ; the longest token is the input text (with dot replaced by space).
But it should be possible given this error message from server :
In Shingle TokenFilter the difference between max_shingle_size and min_shingle_size (and +1 if outputting unigrams) must be less than or equal to: [3] but was [19]. This limit can be set by changing the [index.max_shingle_diff] index level setting.
Given the input text "NP 4.8.1", you won't be able to match the input search "NP 4".
With standard tokenizer, you obtain 2 tokens : "NP" and "4.8.1".
On these two tokens, you apply your filters :
"NP => unchanged => "NP"
"4.8.1" => myfilter => "4 8 1" processed token
shingle_filter => it shingle two adjacent tokens ; so it generates the new token : "NP 4 8 1".
So you have three tokens in all : "NP", "4 8 1" and "NP 4 8 1"...... missing "NP 4" token ! So no matches.
With shingle_filter you have to operates on tokenizer, whatever the filter chain is. So I was using the "simple_pattern_split" tokenizer. But I need to enumerate word breaks characters.
Complicated, because even increasing the shingle limit you will get more tokens in addition to the 4 you want and I don't know if that's what you need.
I agree with you, but certainly much less than min_gram and max_gram parameters whose NGram tokenizer relies on. Cutting a phrase into words, generates much less token than cutting into characters for sure.
The dilemma is : how much word the user can enter in its input ? It follows then directly the param value max_shingle_size
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.