The standard filter should remove punctuation from tokens.
You can use the analysis API to view the differences between analyzers
(and unfortunately not between tokenizers or filters). The Lucene in
Action book has a summary of the different classes.
On Sun, May 20, 2012 at 4:35 PM, Ivan Brusic ivan@brusic.com wrote:
The standard filter should remove punctuation from tokens.
You can use the analysis API to view the differences between analyzers
(and unfortunately not between tokenizers or filters). The Lucene in
Action book has a summary of the different classes.
I'm still having no luck on this. I've created a more self contained
example for the behavior. In short, I'm storing a document with a field
containing:
I'd like ES to provide a match if I search on "p.f.", "p.f", "pf.", "pf"
with regard to case. Currently only the 1st two work. My approach relies on
using the Synonym filter for translating all forms above to "pf". I'd be
happy to fix this approach or, even better, to learn that there's a general
approach that will not require as much configuration.
Thanks! --Mike
On Sun, May 20, 2012 at 4:35 PM, Ivan Brusic ivan@brusic.com wrote:
The standard filter should remove punctuation from tokens.
You can use the analysis API to view the differences between analyzers
(and unfortunately not between tokenizers or filters). The Lucene in
Action book has a summary of the different classes.
I just occurred to me as I was testing things that the standard filter
does NOT remove punctuation. My custom filter in Lucene was stripping
punctuation, not the standard filter.
I was able to remove punctuation by using a mapping char_filter. Mine
simply removes dots '.'
I'm still having no luck on this. I've created a more self contained example
for the behavior. In short, I'm storing a document with a field containing:
I'd like ES to provide a match if I search on "p.f.", "p.f", "pf.", "pf"
with regard to case. Currently only the 1st two work. My approach relies on
using the Synonym filter for translating all forms above to "pf". I'd be
happy to fix this approach or, even better, to learn that there's a general
approach that will not require as much configuration.
Thanks! --Mike
On Sun, May 20, 2012 at 4:35 PM, Ivan Brusic ivan@brusic.com wrote:
The standard filter should remove punctuation from tokens.
You can use the analysis API to view the differences between analyzers
(and unfortunately not between tokenizers or filters). The Lucene in
Action book has a summary of the different classes.
On Wed, May 30, 2012 at 2:21 PM, Ivan Brusic ivan@brusic.com wrote:
I just occurred to me as I was testing things that the standard filter
does NOT remove punctuation. My custom filter in Lucene was stripping
punctuation, not the standard filter.
I was able to remove punctuation by using a mapping char_filter. Mine
simply removes dots '.'
I'm still having no luck on this. I've created a more self contained
example
for the behavior. In short, I'm storing a document with a field
containing:
I'd like ES to provide a match if I search on "p.f.", "p.f", "pf.", "pf"
with regard to case. Currently only the 1st two work. My approach relies
on
using the Synonym filter for translating all forms above to "pf". I'd be
happy to fix this approach or, even better, to learn that there's a
general
approach that will not require as much configuration.
Thanks! --Mike
On Sun, May 20, 2012 at 4:35 PM, Ivan Brusic ivan@brusic.com wrote:
The standard filter should remove punctuation from tokens.
You can use the analysis API to view the differences between analyzers
(and unfortunately not between tokenizers or filters). The Lucene in
Action book has a summary of the different classes.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.