Excluding punctuation from fields

Michael_Sick · May 19, 2012, 3:14pm

Hi All,

What's the best way (or tradeoffs) to exclude punctuation (or specific
characters) from certain fields during analysis and searching?

i.e.
In document, "P.F. Changs" would match a search for "P.F. Changs" or "PF
Changs".

It looks like I could do this with the Synonym filter and the ICU plugin.
Are there better options? Which is best or what are the tradeoffs?

Overall, if anyone knows of any resources that compare/contrast the various
analyzers/filters/..., it would be very helpful.

Thanks for any advice/pointers,

--Mike

Ivan · May 20, 2012, 8:35pm

The standard filter should remove punctuation from tokens.

You can use the analysis API to view the differences between analyzers
(and unfortunately not between tokenizers or filters). The Lucene in
Action book has a summary of the different classes.

--
Ivan

On Sat, May 19, 2012 at 8:14 AM, Michael Sick
michael.sick@serenesoftware.com wrote:

Hi All,

What's the best way (or tradeoffs) to exclude punctuation (or specific
characters) from certain fields during analysis and searching?

i.e.
In document, "P.F. Changs" would match a search for "P.F. Changs" or "PF
Changs".

It looks like I could do this with the Synonym filter and the ICU plugin.
Are there better options? Which is best or what are the tradeoffs?

Overall, if anyone knows of any resources that compare/contrast the various
analyzers/filters/..., it would be very helpful.

Thanks for any advice/pointers,

--Mike

Michael_Sick · May 22, 2012, 3:27am

Ivan,

Thanks! The Analysis API is priceless. Thanks,

--Mike

On Sun, May 20, 2012 at 4:35 PM, Ivan Brusic ivan@brusic.com wrote:

The standard filter should remove punctuation from tokens.

You can use the analysis API to view the differences between analyzers
(and unfortunately not between tokenizers or filters). The Lucene in
Action book has a summary of the different classes.

--
Ivan

On Sat, May 19, 2012 at 8:14 AM, Michael Sick
michael.sick@serenesoftware.com wrote:

Hi All,

What's the best way (or tradeoffs) to exclude punctuation (or specific
characters) from certain fields during analysis and searching?

i.e.
In document, "P.F. Changs" would match a search for "P.F. Changs" or "PF
Changs".

It looks like I could do this with the Synonym filter and the ICU plugin.
Are there better options? Which is best or what are the tradeoffs?

Overall, if anyone knows of any resources that compare/contrast the
various
analyzers/filters/..., it would be very helpful.

Thanks for any advice/pointers,

--Mike

Michael_Sick · May 26, 2012, 6:41am

I'm still having no luck on this. I've created a more self contained
example for the behavior. In short, I'm storing a document with a field
containing:

"P.F. Changs Burgers"

Create & Run Test: Index / Search Document with Punctuation in ElasticSearch · GitHub
Delete Artifacts: Delete Example Alias, Index, Template · GitHub

I'd like ES to provide a match if I search on "p.f.", "p.f", "pf.", "pf"
with regard to case. Currently only the 1st two work. My approach relies on
using the Synonym filter for translating all forms above to "pf". I'd be
happy to fix this approach or, even better, to learn that there's a general
approach that will not require as much configuration.

Thanks! --Mike

On Sun, May 20, 2012 at 4:35 PM, Ivan Brusic ivan@brusic.com wrote:

The standard filter should remove punctuation from tokens.

You can use the analysis API to view the differences between analyzers
(and unfortunately not between tokenizers or filters). The Lucene in
Action book has a summary of the different classes.

--
Ivan

On Sat, May 19, 2012 at 8:14 AM, Michael Sick
michael.sick@serenesoftware.com wrote:

Hi All,

What's the best way (or tradeoffs) to exclude punctuation (or specific
characters) from certain fields during analysis and searching?

i.e.
In document, "P.F. Changs" would match a search for "P.F. Changs" or "PF
Changs".

It looks like I could do this with the Synonym filter and the ICU plugin.
Are there better options? Which is best or what are the tradeoffs?

Overall, if anyone knows of any resources that compare/contrast the
various
analyzers/filters/..., it would be very helpful.

Thanks for any advice/pointers,

--Mike

Ivan · May 30, 2012, 6:21pm

I just occurred to me as I was testing things that the standard filter
does NOT remove punctuation. My custom filter in Lucene was stripping
punctuation, not the standard filter.

I was able to remove punctuation by using a mapping char_filter. Mine
simply removes dots '.'

index :
analysis :
analyzer :
unstemmed :
type : custom
filter : [unique , standard, asciifolding, lowercase,]
char_filter : [punctuation]
char_filter :
punctuation :
type: mapping
mappings: [".=>"]

On Fri, May 25, 2012 at 11:41 PM, Michael Sick
michael.sick@serenesoftware.com wrote:

I'm still having no luck on this. I've created a more self contained example
for the behavior. In short, I'm storing a document with a field containing:

"P.F. Changs Burgers"

Create & Run Test: Index / Search Document with Punctuation in ElasticSearch · GitHub
Delete Artifacts: Delete Example Alias, Index, Template · GitHub

I'd like ES to provide a match if I search on "p.f.", "p.f", "pf.", "pf"
with regard to case. Currently only the 1st two work. My approach relies on
using the Synonym filter for translating all forms above to "pf". I'd be
happy to fix this approach or, even better, to learn that there's a general
approach that will not require as much configuration.

Thanks! --Mike

On Sun, May 20, 2012 at 4:35 PM, Ivan Brusic ivan@brusic.com wrote:

The standard filter should remove punctuation from tokens.

You can use the analysis API to view the differences between analyzers
(and unfortunately not between tokenizers or filters). The Lucene in
Action book has a summary of the different classes.

--
Ivan

On Sat, May 19, 2012 at 8:14 AM, Michael Sick
michael.sick@serenesoftware.com wrote:

Hi All,

What's the best way (or tradeoffs) to exclude punctuation (or specific
characters) from certain fields during analysis and searching?

i.e.
In document, "P.F. Changs" would match a search for "P.F. Changs" or "PF
Changs".

It looks like I could do this with the Synonym filter and the ICU
plugin.
Are there better options? Which is best or what are the tradeoffs?

Overall, if anyone knows of any resources that compare/contrast the
various
analyzers/filters/..., it would be very helpful.

Thanks for any advice/pointers,

--Mike

Michael_Sick · June 1, 2012, 4:44pm

Thanks Ivan - I'll give that a shot.

On Wed, May 30, 2012 at 2:21 PM, Ivan Brusic ivan@brusic.com wrote:

I just occurred to me as I was testing things that the standard filter
does NOT remove punctuation. My custom filter in Lucene was stripping
punctuation, not the standard filter.

I was able to remove punctuation by using a mapping char_filter. Mine
simply removes dots '.'

index :
analysis :
analyzer :
unstemmed :
type : custom
filter : [unique , standard, asciifolding, lowercase,]
char_filter : [punctuation]
char_filter :
punctuation :
type: mapping
mappings: [".=>"]

On Fri, May 25, 2012 at 11:41 PM, Michael Sick
michael.sick@serenesoftware.com wrote:

I'm still having no luck on this. I've created a more self contained
example
for the behavior. In short, I'm storing a document with a field
containing:

"P.F. Changs Burgers"

Create & Run Test: Index / Search Document with Punctuation in ElasticSearch · GitHub
Delete Artifacts: Delete Example Alias, Index, Template · GitHub

I'd like ES to provide a match if I search on "p.f.", "p.f", "pf.", "pf"
with regard to case. Currently only the 1st two work. My approach relies
on
using the Synonym filter for translating all forms above to "pf". I'd be
happy to fix this approach or, even better, to learn that there's a
general
approach that will not require as much configuration.

Thanks! --Mike

On Sun, May 20, 2012 at 4:35 PM, Ivan Brusic ivan@brusic.com wrote:

The standard filter should remove punctuation from tokens.

You can use the analysis API to view the differences between analyzers
(and unfortunately not between tokenizers or filters). The Lucene in
Action book has a summary of the different classes.

--
Ivan

On Sat, May 19, 2012 at 8:14 AM, Michael Sick
michael.sick@serenesoftware.com wrote:

Hi All,

What's the best way (or tradeoffs) to exclude punctuation (or specific
characters) from certain fields during analysis and searching?

i.e.
In document, "P.F. Changs" would match a search for "P.F. Changs" or
"PF
Changs".

It looks like I could do this with the Synonym filter and the ICU
plugin.
Are there better options? Which is best or what are the tradeoffs?

Overall, if anyone knows of any resources that compare/contrast the
various
analyzers/filters/..., it would be very helpful.

Thanks for any advice/pointers,

--Mike

Topic		Replies	Views
Handling Punctuation in multi_match query Elasticsearch	2	958	October 6, 2019
Making more specific queries match more specific things, with regard to punctuation Elasticsearch	1	523	July 5, 2017
How to filter punctuation in fields when I do aggregate queries Elasticsearch	1	331	August 21, 2019
How does Elasticsearch treat punctuation marks on index? Elasticsearch	1	2538	July 6, 2017
Standard tokenizer punctuation symbols removed Elasticsearch	4	1209	April 10, 2020

Excluding punctuation from fields

Related topics