I'm having difficulty with a custom analyzer. I have a field in my
index that looks like this [I am a token, I'm a token too, Tokenize
me,This is a token,Tokenize me]
Instead of creating term facets based on spaces, I want to create term
facets based on commas. I also need to remove any whitespace around
the comma. Here is my analyzer:
It works in that it tokenizes string based on commas, but it is
including trailing and leading whitespace in the tokens. I need to
get rid of that whitespace. The regex I'm using is supposed to do
that (it should pick up 0 to n spaces on either side of the comma),
but it is not doing that.
Any thoughts on how I can force the regex engine to be greedier in its
analysis?
I'm having difficulty with a custom analyzer. I have a field in my
index that looks like this [I am a token, I'm a token too, Tokenize
me,This is a token,Tokenize me]
Instead of creating term facets based on spaces, I want to create term
facets based on commas. I also need to remove any whitespace around
the comma. Here is my analyzer:
It works in that it tokenizes string based on commas, but it is
including trailing and leading whitespace in the tokens. I need to
get rid of that whitespace. The regex I'm using is supposed to do
that (it should pick up 0 to n spaces on either side of the comma),
but it is not doing that.
Any thoughts on how I can force the regex engine to be greedier in its
analysis?
I'm having difficulty with a custom analyzer. I have a field in my
index that looks like this [I am a token, I'm a token too, Tokenize
me,This is a token,Tokenize me]
Instead of creating term facets based on spaces, I want to create term
facets based on commas. I also need to remove anywhitespacearound
the comma. Here is my analyzer:
It works in that it tokenizes string based on commas, but it is
including trailing and leadingwhitespacein the tokens. I need to
get rid of thatwhitespace. The regex I'm using is supposed to do
that (it should pick up 0 to n spaces on either side of the comma),
but it is not doing that.
Any thoughts on how I can force the regex engine to be greedier in its
analysis?
The original data looks like this:
I am a token, I'm a token too, Tokenize me,This is a token,Tokenize me
Your guidance is very much appreciated.
I tried this WordDelimiter filter as part of a custom analyzer, and
all it did was tokenize based on whitespace. Is there more
information on how to use the type_table field? What tokenizer should
a custom analyzer that specifies a filter use?
Perhaps you need to give a working example (as above) showing exactly
what you are doing, the results you are getting and what is wrong with
those results
Everything seems to work fine with the basic analysis, but when you
introduce faceting, you get extra whitespace around the tokens that
you don't get when you call _analyze. To use real world data:
item 1: category: "foo bar "
item 2: category: "foo bar, ding bar "
This will create 3 distinct tokens: "foo bar", "foo bar ", "ding bar
"
In reality, I need 2 distinct tokens "foo bar" "ding bar"
I've put together a gist to recreate the issue:
Any help getting rid of the whitespace around the tokens would be much
appreciated.
Everything seems to work fine with the basic analysis, but when you
introduce faceting, you get extrawhitespacearound the tokens that
you don't get when you call _analyze. To use real world data:
item 1: category: "foo bar "
item 2: category: "foo bar, ding bar "
This will create 3 distinct tokens: "foo bar", "foo bar ", "ding bar
"
In reality, I need 2 distinct tokens "foo bar" "ding bar"
On my end, not a regex expert, so not sure why whitepsaces are not removed based on your regular expression. Require some playing to get it done properly. One thing we can do is add an analyzer token filter that can trim whitespaces, might make things simpler...
On Thursday, February 9, 2012 at 11:44 PM, Rick Thomas wrote:
Is there anything I can do to simplify recreating the problem for
those who are more knowledgable?
Everything seems to work fine with the basic analysis, but when you
introduce faceting, you get extrawhitespacearound the tokens that
you don't get when you call _analyze. To use real world data:
item 1: category: "foo bar "
item 2: category: "foo bar, ding bar "
This will create 3 distinct tokens: "foo bar", "foo bar ", "ding bar
"
In reality, I need 2 distinct tokens "foo bar" "ding bar"
On Sunday, February 12, 2012 at 1:33 PM, Shay Banon wrote:
On my end, not a regex expert, so not sure why whitepsaces are not removed based on your regular expression. Require some playing to get it done properly. One thing we can do is add an analyzer token filter that can trim whitespaces, might make things simpler...
On Thursday, February 9, 2012 at 11:44 PM, Rick Thomas wrote:
Is there anything I can do to simplify recreating the problem for
those who are more knowledgable?
Everything seems to work fine with the basic analysis, but when you
introduce faceting, you get extrawhitespacearound the tokens that
you don't get when you call _analyze. To use real world data:
item 1: category: "foo bar "
item 2: category: "foo bar, ding bar "
This will create 3 distinct tokens: "foo bar", "foo bar ", "ding bar
"
In reality, I need 2 distinct tokens "foo bar" "ding bar"
Any help getting rid of the whitespace around the tokens would be much
appreciated.
OK - so the comma analyzer is actually removing whitespace around the
comma. The problem is that you have whitespace at the beginning or end
of your strings, where no commas are involved - that's where the
whitespace is coming from.
Any help getting rid of thewhitespacearound the tokens would be much
appreciated.
OK - so the comma analyzer is actually removingwhitespacearound the
comma. The problem is that you havewhitespaceat the beginning or end
of your strings, where no commas are involved - that's where thewhitespaceis coming from.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.