How can I only remove the leading and trailing punctuation from terms when indexing?
The standard tokenizer also splits words like "t-shirt" into "t" and "shirt", and that is not desired in our use case. But we want to turn "t-shirt." into "t-shirt".
You could use a character filter that changes a hyphen in the middle of a word into an underscore. Underscores are not removed by the standard tokenizer. Something like this:
Thanks , that has fixed our problems; together with a patternreplace as the first term filter to make sure the worddelimiter keeps working (for the terms "t-shirt", "t", "shirt" and "tshirt").
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.