Excluding punctuation from fields


(Michael Sick) #1

Hi All,

What's the best way (or tradeoffs) to exclude punctuation (or specific
characters) from certain fields during analysis and searching?

i.e.
In document, "P.F. Changs" would match a search for "P.F. Changs" or "PF
Changs".

It looks like I could do this with the Synonym filter and the ICU plugin.
Are there better options? Which is best or what are the tradeoffs?

Overall, if anyone knows of any resources that compare/contrast the various
analyzers/filters/..., it would be very helpful.

Thanks for any advice/pointers,

--Mike


(Ivan Brusic) #2

The standard filter should remove punctuation from tokens.

You can use the analysis API to view the differences between analyzers
(and unfortunately not between tokenizers or filters). The Lucene in
Action book has a summary of the different classes.

--
Ivan

On Sat, May 19, 2012 at 8:14 AM, Michael Sick
michael.sick@serenesoftware.com wrote:

Hi All,

What's the best way (or tradeoffs) to exclude punctuation (or specific
characters) from certain fields during analysis and searching?

i.e.
In document, "P.F. Changs" would match a search for "P.F. Changs" or "PF
Changs".

It looks like I could do this with the Synonym filter and the ICU plugin.
Are there better options? Which is best or what are the tradeoffs?

Overall, if anyone knows of any resources that compare/contrast the various
analyzers/filters/..., it would be very helpful.

Thanks for any advice/pointers,

--Mike


(Michael Sick) #3

Ivan,

Thanks! The Analysis API is priceless. Thanks,

--Mike

On Sun, May 20, 2012 at 4:35 PM, Ivan Brusic ivan@brusic.com wrote:

The standard filter should remove punctuation from tokens.

You can use the analysis API to view the differences between analyzers
(and unfortunately not between tokenizers or filters). The Lucene in
Action book has a summary of the different classes.

--
Ivan

On Sat, May 19, 2012 at 8:14 AM, Michael Sick
michael.sick@serenesoftware.com wrote:

Hi All,

What's the best way (or tradeoffs) to exclude punctuation (or specific
characters) from certain fields during analysis and searching?

i.e.
In document, "P.F. Changs" would match a search for "P.F. Changs" or "PF
Changs".

It looks like I could do this with the Synonym filter and the ICU plugin.
Are there better options? Which is best or what are the tradeoffs?

Overall, if anyone knows of any resources that compare/contrast the
various
analyzers/filters/..., it would be very helpful.

Thanks for any advice/pointers,

--Mike


(Michael Sick) #4

I'm still having no luck on this. I've created a more self contained
example for the behavior. In short, I'm storing a document with a field
containing:

"P.F. Changs Burgers"

Create & Run Test: https://gist.github.com/2792582
Delete Artifacts: https://gist.github.com/2792590

I'd like ES to provide a match if I search on "p.f.", "p.f", "pf.", "pf"
with regard to case. Currently only the 1st two work. My approach relies on
using the Synonym filter for translating all forms above to "pf". I'd be
happy to fix this approach or, even better, to learn that there's a general
approach that will not require as much configuration.

Thanks! --Mike

On Sun, May 20, 2012 at 4:35 PM, Ivan Brusic ivan@brusic.com wrote:

The standard filter should remove punctuation from tokens.

You can use the analysis API to view the differences between analyzers
(and unfortunately not between tokenizers or filters). The Lucene in
Action book has a summary of the different classes.

--
Ivan

On Sat, May 19, 2012 at 8:14 AM, Michael Sick
michael.sick@serenesoftware.com wrote:

Hi All,

What's the best way (or tradeoffs) to exclude punctuation (or specific
characters) from certain fields during analysis and searching?

i.e.
In document, "P.F. Changs" would match a search for "P.F. Changs" or "PF
Changs".

It looks like I could do this with the Synonym filter and the ICU plugin.
Are there better options? Which is best or what are the tradeoffs?

Overall, if anyone knows of any resources that compare/contrast the
various
analyzers/filters/..., it would be very helpful.

Thanks for any advice/pointers,

--Mike


(Ivan Brusic) #5

I just occurred to me as I was testing things that the standard filter
does NOT remove punctuation. My custom filter in Lucene was stripping
punctuation, not the standard filter.

I was able to remove punctuation by using a mapping char_filter. Mine
simply removes dots '.'

index :
analysis :
analyzer :
unstemmed :
type : custom
filter : [unique , standard, asciifolding, lowercase,]
char_filter : [punctuation]
char_filter :
punctuation :
type: mapping
mappings: [".=>"]

On Fri, May 25, 2012 at 11:41 PM, Michael Sick
michael.sick@serenesoftware.com wrote:

I'm still having no luck on this. I've created a more self contained example
for the behavior. In short, I'm storing a document with a field containing:

"P.F. Changs Burgers"

Create & Run Test: https://gist.github.com/2792582
Delete Artifacts: https://gist.github.com/2792590

I'd like ES to provide a match if I search on "p.f.", "p.f", "pf.", "pf"
with regard to case. Currently only the 1st two work. My approach relies on
using the Synonym filter for translating all forms above to "pf". I'd be
happy to fix this approach or, even better, to learn that there's a general
approach that will not require as much configuration.

Thanks! --Mike

On Sun, May 20, 2012 at 4:35 PM, Ivan Brusic ivan@brusic.com wrote:

The standard filter should remove punctuation from tokens.

You can use the analysis API to view the differences between analyzers
(and unfortunately not between tokenizers or filters). The Lucene in
Action book has a summary of the different classes.

--
Ivan

On Sat, May 19, 2012 at 8:14 AM, Michael Sick
michael.sick@serenesoftware.com wrote:

Hi All,

What's the best way (or tradeoffs) to exclude punctuation (or specific
characters) from certain fields during analysis and searching?

i.e.
In document, "P.F. Changs" would match a search for "P.F. Changs" or "PF
Changs".

It looks like I could do this with the Synonym filter and the ICU
plugin.
Are there better options? Which is best or what are the tradeoffs?

Overall, if anyone knows of any resources that compare/contrast the
various
analyzers/filters/..., it would be very helpful.

Thanks for any advice/pointers,

--Mike


(Michael Sick) #6

Thanks Ivan - I'll give that a shot.

On Wed, May 30, 2012 at 2:21 PM, Ivan Brusic ivan@brusic.com wrote:

I just occurred to me as I was testing things that the standard filter
does NOT remove punctuation. My custom filter in Lucene was stripping
punctuation, not the standard filter.

I was able to remove punctuation by using a mapping char_filter. Mine
simply removes dots '.'

index :
analysis :
analyzer :
unstemmed :
type : custom
filter : [unique , standard, asciifolding, lowercase,]
char_filter : [punctuation]
char_filter :
punctuation :
type: mapping
mappings: [".=>"]

On Fri, May 25, 2012 at 11:41 PM, Michael Sick
michael.sick@serenesoftware.com wrote:

I'm still having no luck on this. I've created a more self contained
example
for the behavior. In short, I'm storing a document with a field
containing:

"P.F. Changs Burgers"

Create & Run Test: https://gist.github.com/2792582
Delete Artifacts: https://gist.github.com/2792590

I'd like ES to provide a match if I search on "p.f.", "p.f", "pf.", "pf"
with regard to case. Currently only the 1st two work. My approach relies
on
using the Synonym filter for translating all forms above to "pf". I'd be
happy to fix this approach or, even better, to learn that there's a
general
approach that will not require as much configuration.

Thanks! --Mike

On Sun, May 20, 2012 at 4:35 PM, Ivan Brusic ivan@brusic.com wrote:

The standard filter should remove punctuation from tokens.

You can use the analysis API to view the differences between analyzers
(and unfortunately not between tokenizers or filters). The Lucene in
Action book has a summary of the different classes.

--
Ivan

On Sat, May 19, 2012 at 8:14 AM, Michael Sick
michael.sick@serenesoftware.com wrote:

Hi All,

What's the best way (or tradeoffs) to exclude punctuation (or specific
characters) from certain fields during analysis and searching?

i.e.
In document, "P.F. Changs" would match a search for "P.F. Changs" or
"PF

Changs".

It looks like I could do this with the Synonym filter and the ICU
plugin.
Are there better options? Which is best or what are the tradeoffs?

Overall, if anyone knows of any resources that compare/contrast the
various
analyzers/filters/..., it would be very helpful.

Thanks for any advice/pointers,

--Mike


(system) #7