How can I use a tokenizer which is similar to the Word Delimiter Graph Token tokenizer but without using the following rules:
• Split tokens at letter case transitions. For example: PowerShot → Power, Shot
• Split tokens at letter-number transitions. For example: XL500 → XL, 500
• Remove the English possessive ('s) from the end of each token. For example: Neil's → Neil
So from the example: "Neil's-Super-Duper-XL500--42+AutoCoder"
instead of these tokens:
[ Neil, Super, Duper, XL, 500, 42, Auto, Coder ]
the analyzer need to produce these tokens:
[Neil, s, Super, Duper, XL500, 42, AutoCoder]
Thanks @warkolm,
my problem is when I try to search for a text field which contains concatenated text and numbers in this formula: "{text}{number}" like DOC0000000009 then I don't know how to search on them with queries like these:
DOC0000000009 : For this I tried to use SpanTermQuery, MatchQuery without success
DOC000000004? : For this I tried to use WildcardQuery without success
At indexing time I set WordDelimiterGraphTokenFilter and lowercase filter for this text field analyzer and search analyzer property.
The queries work only with lowercase letters like doc0000000009, doc000000004 even if I try to use MatchQuery with setting the same analyzer.
I am only able to execute these queries with QuerystringQuery but if I use then I cannot use a ProximityQuery which contains QueryStringQuery. Therefore I can use proximity queries with lowercased queries.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.