Removing whitespace around a delimiter in a custom anaylzer


(Rick Thomas) #1

I'm having difficulty with a custom analyzer. I have a field in my
index that looks like this [I am a token, I'm a token too, Tokenize me,This is a token,Tokenize me]

Instead of creating term facets based on spaces, I want to create term
facets based on commas. I also need to remove any whitespace around
the comma. Here is my analyzer:

"analysis":{"analyzer":{"comma":{"type":"pattern","pattern":"\s*,
\s*"}}}

It works in that it tokenizes string based on commas, but it is
including trailing and leading whitespace in the tokens. I need to
get rid of that whitespace. The regex I'm using is supposed to do
that (it should pick up 0 to n spaces on either side of the comma),
but it is not doing that.

Any thoughts on how I can force the regex engine to be greedier in its
analysis?

Thanks,

Rick


(Karussell) #2

probably have a look into worddelimiter filter

Peter.

On 7 Feb., 17:25, Rick Thomas rtho...@igodigital.com wrote:

I'm having difficulty with a custom analyzer. I have a field in my
index that looks like this [I am a token, I'm a token too, Tokenize me,This is a token,Tokenize me]

Instead of creating term facets based on spaces, I want to create term
facets based on commas. I also need to remove any whitespace around
the comma. Here is my analyzer:

"analysis":{"analyzer":{"comma":{"type":"pattern","pattern":"\s*,
\s*"}}}

It works in that it tokenizes string based on commas, but it is
including trailing and leading whitespace in the tokens. I need to
get rid of that whitespace. The regex I'm using is supposed to do
that (it should pick up 0 to n spaces on either side of the comma),
but it is not doing that.

Any thoughts on how I can force the regex engine to be greedier in its
analysis?

Thanks,

Rick


(Rick Thomas) #3

That appears to do the opposite of what I need.

On Feb 7, 3:15 pm, Karussell tableyourt...@googlemail.com wrote:

probably have a look into worddelimiter filter

Peter.

On 7 Feb., 17:25, Rick Thomas rtho...@igodigital.com wrote:

I'm having difficulty with a custom analyzer. I have a field in my
index that looks like this [I am a token, I'm a token too, Tokenize me,This is a token,Tokenize me]

Instead of creating term facets based on spaces, I want to create term
facets based on commas. I also need to remove anywhitespacearound
the comma. Here is my analyzer:

"analysis":{"analyzer":{"comma":{"type":"pattern","pattern":"\s*,
\s*"}}}

It works in that it tokenizes string based on commas, but it is
including trailing and leadingwhitespacein the tokens. I need to
get rid of thatwhitespace. The regex I'm using is supposed to do
that (it should pick up 0 to n spaces on either side of the comma),
but it is not doing that.

Any thoughts on how I can force the regex engine to be greedier in its
analysis?

Thanks,

Rick


(Karussell) #4

That appears to do the opposite of what I need.

I think you can hack this worddelimiter thing a lot. E.g. overwriting
the comma char to be recognized as SUBWORD_DELIM (see type_table)

I have a field in my index that looks like this

do you mean the field or the original data?

Any thoughts on how I can force the regex engine to be greedier in its analysis?

No idea, I'm avoiding regex when and where I can :slight_smile:

so I would do this via a custom WhitespaceTokenizer which overwrites
isTokenChar

Peter.


(Rick Thomas) #5

The original data looks like this:
I am a token, I'm a token too, Tokenize me,This is a token,Tokenize me

Your guidance is very much appreciated.

I tried this WordDelimiter filter as part of a custom analyzer, and
all it did was tokenize based on whitespace. Is there more
information on how to use the type_table field? What tokenizer should
a custom analyzer that specifies a filter use?

"filter":{"comma_delimiter":{"type":"word_delimited","type_table":
{",":"SUBWORD_DELIM"}}}

I feel like the solution should be easier than we're making it.

On Feb 8, 8:26 am, Karussell tableyourt...@googlemail.com wrote:

That appears to do the opposite of what I need.

I think you can hack this worddelimiter thing a lot. E.g. overwriting
the comma char to be recognized as SUBWORD_DELIM (see type_table)

I have a field in my index that looks like this

do you mean the field or the original data?

Any thoughts on how I can force the regex engine to be greedier in its analysis?

No idea, I'm avoiding regex when and where I can :slight_smile:

so I would do this via a custom WhitespaceTokenizer which overwrites
isTokenChar

Peter.


(Clinton Gormley) #6

On Wed, 2012-02-08 at 09:38 -0800, Rick Thomas wrote:

The original data looks like this:
I am a token, I'm a token too, Tokenize me,This is a token,Tokenize me

your original example works for me:

curl -XPUT 'http://127.0.0.1:9200/foo/?pretty=1' -d '
{
"settings" : {
"analysis" : {
"analyzer" : {
"comma" : {
"pattern" : "\s*,\s*",
"type" : "pattern"
}
}
}
}
}
'

curl -XGET 'http://127.0.0.1:9200/foo/_analyze?pretty=1&text=I+am+a+token%2C+I'm+a+token+too%2C+Tokenize+me%2CThis+is+a+token%2CTokenize+me&analyzer=comma'

[Wed Feb 8 18:54:41 2012] Response:

{

"tokens" : [

{

"end_offset" : 12,

"position" : 1,

"start_offset" : 0,

"type" : "word",

"token" : "i am a token"

},

{

"end_offset" : 29,

"position" : 2,

"start_offset" : 14,

"type" : "word",

"token" : "i'm a token too"

},

{

"end_offset" : 42,

"position" : 3,

"start_offset" : 31,

"type" : "word",

"token" : "tokenize me"

},

{

"end_offset" : 58,

"position" : 4,

"start_offset" : 43,

"type" : "word",

"token" : "this is a token"

},

{

"end_offset" : 70,

"position" : 5,

"start_offset" : 59,

"type" : "word",

"token" : "tokenize me"

}

]

}

Perhaps you need to give a working example (as above) showing exactly
what you are doing, the results you are getting and what is wrong with
those results

clint


(Rick Thomas) #7

Everything seems to work fine with the basic analysis, but when you
introduce faceting, you get extra whitespace around the tokens that
you don't get when you call _analyze. To use real world data:

item 1: category: "foo bar "
item 2: category: "foo bar, ding bar "

This will create 3 distinct tokens: "foo bar", "foo bar ", "ding bar
"

In reality, I need 2 distinct tokens "foo bar" "ding bar"

I've put together a gist to recreate the issue:

Any help getting rid of the whitespace around the tokens would be much
appreciated.


(Rick Thomas) #8

Is there anything I can do to simplify recreating the problem for
those who are more knowledgable?

On Feb 8, 3:41 pm, Rick Thomas rtho...@igodigital.com wrote:

Everything seems to work fine with the basic analysis, but when you
introduce faceting, you get extrawhitespacearound the tokens that
you don't get when you call _analyze. To use real world data:

item 1: category: "foo bar "
item 2: category: "foo bar, ding bar "

This will create 3 distinct tokens: "foo bar", "foo bar ", "ding bar
"

In reality, I need 2 distinct tokens "foo bar" "ding bar"

I've put together a gist to recreate the issue:

https://gist.github.com/1773423

Any help getting rid of thewhitespacearound the tokens would be much
appreciated.


(Shay Banon) #9

On my end, not a regex expert, so not sure why whitepsaces are not removed based on your regular expression. Require some playing to get it done properly. One thing we can do is add an analyzer token filter that can trim whitespaces, might make things simpler...

On Thursday, February 9, 2012 at 11:44 PM, Rick Thomas wrote:

Is there anything I can do to simplify recreating the problem for
those who are more knowledgable?

On Feb 8, 3:41 pm, Rick Thomas <rtho...@igodigital.com (http://igodigital.com)> wrote:

Everything seems to work fine with the basic analysis, but when you
introduce faceting, you get extrawhitespacearound the tokens that
you don't get when you call _analyze. To use real world data:

item 1: category: "foo bar "
item 2: category: "foo bar, ding bar "

This will create 3 distinct tokens: "foo bar", "foo bar ", "ding bar
"

In reality, I need 2 distinct tokens "foo bar" "ding bar"

I've put together a gist to recreate the issue:

https://gist.github.com/1773423

Any help getting rid of thewhitespacearound the tokens would be much
appreciated.


(Shay Banon) #10

I added an issue here: https://github.com/elasticsearch/elasticsearch/issues/1693.

On Sunday, February 12, 2012 at 1:33 PM, Shay Banon wrote:

On my end, not a regex expert, so not sure why whitepsaces are not removed based on your regular expression. Require some playing to get it done properly. One thing we can do is add an analyzer token filter that can trim whitespaces, might make things simpler...

On Thursday, February 9, 2012 at 11:44 PM, Rick Thomas wrote:

Is there anything I can do to simplify recreating the problem for
those who are more knowledgable?

On Feb 8, 3:41 pm, Rick Thomas <rtho...@igodigital.com (http://igodigital.com)> wrote:

Everything seems to work fine with the basic analysis, but when you
introduce faceting, you get extrawhitespacearound the tokens that
you don't get when you call _analyze. To use real world data:

item 1: category: "foo bar "
item 2: category: "foo bar, ding bar "

This will create 3 distinct tokens: "foo bar", "foo bar ", "ding bar
"

In reality, I need 2 distinct tokens "foo bar" "ding bar"

I've put together a gist to recreate the issue:

https://gist.github.com/1773423

Any help getting rid of thewhitespacearound the tokens would be much
appreciated.


(Clinton Gormley) #11

I've put together a gist to recreate the issue:

https://gist.github.com/1773423

Any help getting rid of the whitespace around the tokens would be much
appreciated.

OK - so the comma analyzer is actually removing whitespace around the
comma. The problem is that you have whitespace at the beginning or end
of your strings, where no commas are involved - that's where the
whitespace is coming from.

This works:

curl -XPUT 'http://127.0.0.1:9200/foo/?pretty=1' -d '
{
"settings" : {
"analysis" : {
"analyzer" : {
"comma" : {
"pattern" : "^\s+|\s*,\s*|\s+$",
"type" : "pattern"
}
}
}
}
}
'

clint


(Rick Thomas) #12

Thanks so much for the assistance!

On Feb 13, 5:05 am, Clinton Gormley cl...@traveljury.com wrote:

I've put together a gist to recreate the issue:

https://gist.github.com/1773423

Any help getting rid of thewhitespacearound the tokens would be much
appreciated.

OK - so the comma analyzer is actually removingwhitespacearound the
comma. The problem is that you havewhitespaceat the beginning or end
of your strings, where no commas are involved - that's where thewhitespaceis coming from.

This works:

curl -XPUT 'http://127.0.0.1:9200/foo/?pretty=1' -d '
{
"settings" : {
"analysis" : {
"analyzer" : {
"comma" : {
"pattern" : "^\s+|\s*,\s*|\s+$",
"type" : "pattern"
}
}
}
}}

'

clint


(system) #13