Removing whitespace around a delimiter in a custom anaylzer

Rick_Thomas · February 7, 2012, 4:25pm

I'm having difficulty with a custom analyzer. I have a field in my
index that looks like this [I am a token, I'm a token too, Tokenize me,This is a token,Tokenize me]

Instead of creating term facets based on spaces, I want to create term
facets based on commas. I also need to remove any whitespace around
the comma. Here is my analyzer:

"analysis":{"analyzer":{"comma":{"type":"pattern","pattern":"\s*,
\s*"}}}

It works in that it tokenizes string based on commas, but it is
including trailing and leading whitespace in the tokens. I need to
get rid of that whitespace. The regex I'm using is supposed to do
that (it should pick up 0 to n spaces on either side of the comma),
but it is not doing that.

Any thoughts on how I can force the regex engine to be greedier in its
analysis?

Thanks,

Rick

Karussell1 · February 7, 2012, 8:15pm

probably have a look into worddelimiter filter

Peter.

On 7 Feb., 17:25, Rick Thomas rtho...@igodigital.com wrote:

I'm having difficulty with a custom analyzer. I have a field in my
index that looks like this [I am a token, I'm a token too, Tokenize me,This is a token,Tokenize me]

Instead of creating term facets based on spaces, I want to create term
facets based on commas. I also need to remove any whitespace around
the comma. Here is my analyzer:

"analysis":{"analyzer":{"comma":{"type":"pattern","pattern":"\s*,
\s*"}}}

It works in that it tokenizes string based on commas, but it is
including trailing and leading whitespace in the tokens. I need to
get rid of that whitespace. The regex I'm using is supposed to do
that (it should pick up 0 to n spaces on either side of the comma),
but it is not doing that.

Any thoughts on how I can force the regex engine to be greedier in its
analysis?

Thanks,

Rick

Rick_Thomas · February 8, 2012, 1:14am

That appears to do the opposite of what I need.

On Feb 7, 3:15 pm, Karussell tableyourt...@googlemail.com wrote:

probably have a look into worddelimiter filter

Peter.

On 7 Feb., 17:25, Rick Thomas rtho...@igodigital.com wrote:

I'm having difficulty with a custom analyzer. I have a field in my
index that looks like this [I am a token, I'm a token too, Tokenize me,This is a token,Tokenize me]

Instead of creating term facets based on spaces, I want to create term
facets based on commas. I also need to remove anywhitespacearound
the comma. Here is my analyzer:

"analysis":{"analyzer":{"comma":{"type":"pattern","pattern":"\s*,
\s*"}}}

It works in that it tokenizes string based on commas, but it is
including trailing and leadingwhitespacein the tokens. I need to
get rid of thatwhitespace. The regex I'm using is supposed to do
that (it should pick up 0 to n spaces on either side of the comma),
but it is not doing that.

Any thoughts on how I can force the regex engine to be greedier in its
analysis?

Thanks,

Rick

Karussell1 · February 8, 2012, 1:26pm

That appears to do the opposite of what I need.

I think you can hack this worddelimiter thing a lot. E.g. overwriting
the comma char to be recognized as SUBWORD_DELIM (see type_table)

I have a field in my index that looks like this

do you mean the field or the original data?

Any thoughts on how I can force the regex engine to be greedier in its analysis?

No idea, I'm avoiding regex when and where I can

so I would do this via a custom WhitespaceTokenizer which overwrites
isTokenChar

Peter.

Rick_Thomas · February 8, 2012, 5:38pm

The original data looks like this:
I am a token, I'm a token too, Tokenize me,This is a token,Tokenize me

Your guidance is very much appreciated.

I tried this WordDelimiter filter as part of a custom analyzer, and
all it did was tokenize based on whitespace. Is there more
information on how to use the type_table field? What tokenizer should
a custom analyzer that specifies a filter use?

"filter":{"comma_delimiter":{"type":"word_delimited","type_table":
{",":"SUBWORD_DELIM"}}}

I feel like the solution should be easier than we're making it.

On Feb 8, 8:26 am, Karussell tableyourt...@googlemail.com wrote:

That appears to do the opposite of what I need.

I think you can hack this worddelimiter thing a lot. E.g. overwriting
the comma char to be recognized as SUBWORD_DELIM (see type_table)

I have a field in my index that looks like this

do you mean the field or the original data?

Any thoughts on how I can force the regex engine to be greedier in its analysis?

No idea, I'm avoiding regex when and where I can

so I would do this via a custom WhitespaceTokenizer which overwrites
isTokenChar

Peter.

Clinton_Gormley · February 8, 2012, 5:55pm

On Wed, 2012-02-08 at 09:38 -0800, Rick Thomas wrote:

The original data looks like this:
I am a token, I'm a token too, Tokenize me,This is a token,Tokenize me

your original example works for me:

curl -XPUT 'http://127.0.0.1:9200/foo/?pretty=1' -d '
{
"settings" : {
"analysis" : {
"analyzer" : {
"comma" : {
"pattern" : "\s*,\s*",
"type" : "pattern"
}
}
}
}
}
'

curl -XGET 'http://127.0.0.1:9200/foo/_analyze?pretty=1&text=I+am+a+token%2C+I'm+a+token+too%2C+Tokenize+me%2CThis+is+a+token%2CTokenize+me&analyzer=comma'

[Wed Feb 8 18:54:41 2012] Response:

{

"tokens" : [

{

"end_offset" : 12,

"position" : 1,

"start_offset" : 0,

"type" : "word",

"token" : "i am a token"

},

{

"end_offset" : 29,

"position" : 2,

"start_offset" : 14,

"type" : "word",

"token" : "i'm a token too"

},

{

"end_offset" : 42,

"position" : 3,

"start_offset" : 31,

"type" : "word",

"token" : "tokenize me"

},

{

"end_offset" : 58,

"position" : 4,

"start_offset" : 43,

"type" : "word",

"token" : "this is a token"

},

{

"end_offset" : 70,

"position" : 5,

"start_offset" : 59,

"type" : "word",

"token" : "tokenize me"

}

]

}

Perhaps you need to give a working example (as above) showing exactly
what you are doing, the results you are getting and what is wrong with
those results

clint

Rick_Thomas · February 8, 2012, 8:41pm

Everything seems to work fine with the basic analysis, but when you
introduce faceting, you get extra whitespace around the tokens that
you don't get when you call _analyze. To use real world data:

item 1: category: "foo bar "
item 2: category: "foo bar, ding bar "

This will create 3 distinct tokens: "foo bar", "foo bar ", "ding bar
"

In reality, I need 2 distinct tokens "foo bar" "ding bar"

I've put together a gist to recreate the issue:

gist.github.com

https://gist.github.com/rickthomasjr/1773423

facet_test

curl -XDELETE 'http://127.0.0.1:9200/foo/?pretty=1'

curl -XPUT 'http://127.0.0.1:9200/foo/?pretty=1'  -d ' 
{ 
   "settings" : { 
      "analysis" : { 
         "analyzer" : { 
            "comma" : { 
               "pattern" : "\\s*,\\s*", 
               "type" : "pattern"

This file has been truncated. show original

Any help getting rid of the whitespace around the tokens would be much
appreciated.

Rick_Thomas · February 9, 2012, 9:44pm

Is there anything I can do to simplify recreating the problem for
those who are more knowledgable?

On Feb 8, 3:41 pm, Rick Thomas rtho...@igodigital.com wrote:

Everything seems to work fine with the basic analysis, but when you
introduce faceting, you get extrawhitespacearound the tokens that
you don't get when you call _analyze. To use real world data:

item 1: category: "foo bar "
item 2: category: "foo bar, ding bar "

This will create 3 distinct tokens: "foo bar", "foo bar ", "ding bar
"

In reality, I need 2 distinct tokens "foo bar" "ding bar"

I've put together a gist to recreate the issue:

Faceting with commas and extra spaces · GitHub

Any help getting rid of thewhitespacearound the tokens would be much
appreciated.

kimchy · February 12, 2012, 11:33am

On my end, not a regex expert, so not sure why whitepsaces are not removed based on your regular expression. Require some playing to get it done properly. One thing we can do is add an analyzer token filter that can trim whitespaces, might make things simpler...

On Thursday, February 9, 2012 at 11:44 PM, Rick Thomas wrote:

Is there anything I can do to simplify recreating the problem for
those who are more knowledgable?

On Feb 8, 3:41 pm, Rick Thomas <rtho...@igodigital.com (http://igodigital.com)> wrote:

Everything seems to work fine with the basic analysis, but when you
introduce faceting, you get extrawhitespacearound the tokens that
you don't get when you call _analyze. To use real world data:

item 1: category: "foo bar "
item 2: category: "foo bar, ding bar "

This will create 3 distinct tokens: "foo bar", "foo bar ", "ding bar
"

In reality, I need 2 distinct tokens "foo bar" "ding bar"

I've put together a gist to recreate the issue:

Faceting with commas and extra spaces · GitHub

Any help getting rid of thewhitespacearound the tokens would be much
appreciated.

kimchy · February 12, 2012, 11:42am

I added an issue here: Analysis: Add trim token filter that trims whitespaces · Issue #1693 · elastic/elasticsearch · GitHub.

On Sunday, February 12, 2012 at 1:33 PM, Shay Banon wrote:

On my end, not a regex expert, so not sure why whitepsaces are not removed based on your regular expression. Require some playing to get it done properly. One thing we can do is add an analyzer token filter that can trim whitespaces, might make things simpler...

On Thursday, February 9, 2012 at 11:44 PM, Rick Thomas wrote:

Is there anything I can do to simplify recreating the problem for
those who are more knowledgable?

On Feb 8, 3:41 pm, Rick Thomas <rtho...@igodigital.com (http://igodigital.com)> wrote:

Everything seems to work fine with the basic analysis, but when you
introduce faceting, you get extrawhitespacearound the tokens that
you don't get when you call _analyze. To use real world data:

item 1: category: "foo bar "
item 2: category: "foo bar, ding bar "

This will create 3 distinct tokens: "foo bar", "foo bar ", "ding bar
"

In reality, I need 2 distinct tokens "foo bar" "ding bar"

I've put together a gist to recreate the issue:

Faceting with commas and extra spaces · GitHub

Any help getting rid of thewhitespacearound the tokens would be much
appreciated.

Clinton_Gormley · February 13, 2012, 10:05am

I've put together a gist to recreate the issue:

Faceting with commas and extra spaces · GitHub

Any help getting rid of the whitespace around the tokens would be much
appreciated.

OK - so the comma analyzer is actually removing whitespace around the
comma. The problem is that you have whitespace at the beginning or end
of your strings, where no commas are involved - that's where the
whitespace is coming from.

This works:

curl -XPUT 'http://127.0.0.1:9200/foo/?pretty=1' -d '
{
"settings" : {
"analysis" : {
"analyzer" : {
"comma" : {
"pattern" : "^\s+|\s*,\s*|\s+$",
"type" : "pattern"
}
}
}
}
}
'

clint

Rick_Thomas · February 13, 2012, 8:40pm

Thanks so much for the assistance!

On Feb 13, 5:05 am, Clinton Gormley cl...@traveljury.com wrote:

I've put together a gist to recreate the issue:

Faceting with commas and extra spaces · GitHub

Any help getting rid of thewhitespacearound the tokens would be much
appreciated.

OK - so the comma analyzer is actually removingwhitespacearound the
comma. The problem is that you havewhitespaceat the beginning or end
of your strings, where no commas are involved - that's where thewhitespaceis coming from.

This works:

curl -XPUT 'http://127.0.0.1:9200/foo/?pretty=1' -d '
{
"settings" : {
"analysis" : {
"analyzer" : {
"comma" : {
"pattern" : "^\s+|\s*,\s*|\s+$",
"type" : "pattern"
}
}
}
}}

'

clint

Topic		Replies	Views
Analyzer settings for breaking up words on hyphens Elasticsearch	4	2218	July 6, 2017
WhiteSpaceTokenizer buffer_size Elasticsearch	6	1272	July 5, 2017
Keyword analyzer but allow redundant white spaces Elasticsearch	3	4092	January 15, 2018
Whitespace Tokenizer dont works as expected Elasticsearch	2	449	December 19, 2018
Whitespace analyzer (char-filter And token-filter) Elasticsearch	7	1217	November 27, 2019

Removing whitespace around a delimiter in a custom anaylzer

[Wed Feb 8 18:54:41 2012] Response:

{

"tokens" : [

{

"end_offset" : 12,

"position" : 1,

"start_offset" : 0,

"type" : "word",

"token" : "i am a token"

},

{

"end_offset" : 29,

"position" : 2,

"start_offset" : 14,

"type" : "word",

"token" : "i'm a token too"

},

{

"end_offset" : 42,

"position" : 3,

"start_offset" : 31,

"type" : "word",

"token" : "tokenize me"

},

{

"end_offset" : 58,

"position" : 4,

"start_offset" : 43,

"type" : "word",

"token" : "this is a token"

},

{

"end_offset" : 70,

"position" : 5,

"start_offset" : 59,

"type" : "word",

"token" : "tokenize me"

}

]

}

Related topics