Allow a dot at the beginning of a token, like '.Net'


(AGuereca) #1

I have all day trying to find a way to make the word “.Net” being considered as a token instead of just “Net” without the dot; Because I’m getting noise in my searches when the term .Net is required with dot (which mean something completely different without).

So far I haven’t been successful the closest result has been the withspace tokenizer, the problem with that is sentences like “This features: ” are being tokenized like “this”, “features:” and of course any search by term “features” will ignore the document because the “:”.

I’m hoping there is something else besides the “Pattern Analyzer”; which I don’t have a clear idea of how make it work properly.

Any ideas will be greatly appreciated since I’m out of them after all day knocking my head with this.


(Clinton Gormley) #2

On Wed, 2011-04-06 at 19:13 -0700, AGuereca wrote:

I have all day trying to find a way to make the word “.Net” being considered
as a token instead of just “Net” without the dot; Because I’m getting noise
in my searches when the term .Net is required with dot (which mean something
completely different without).

This is tricky to do in the middle of a blob of text, because there are
loads of other uses of '.' where you do want to ignore the '.' and other
punctuation.

One thing you could do, short of writing a custom tokenizer in Java, is
to preprocess both your document text and your query strings to replace
occurrences of ".Net" with something like "dotNet"

It's a bit manual, but at least this way you won't lose the benefits of
the standard analyzer.

hth

clint


(Karussell) #3

I've did similar things for @user and #hashtag.

Therefor I've stolen the WordDelimiterFilter from solr:

Then I've extended it:

Now you could then specify handleAsChar = ".";

I've used handleAsDigit = "@" and the solr setting
splitOnNumerics=true so that @user is indexed as two terms: user and
@user

Regards,
Peter.

--
http://jetwick.com/ Personalized Twitter Search

On 7 Apr., 04:13, AGuereca aguer...@gmail.com wrote:

I have all day trying to find a way to make the word “.Net” being considered
as a token instead of just “Net” without the dot; Because I’m getting noise
in my searches when the term .Net is required with dot (which mean something
completely different without).

So far I haven’t been successful the closest result has been the withspace
tokenizer, the problem with that is sentences like “This features: ” are
being tokenized like “this”, “features:” and of course any search by term
“features” will ignore the document because the “:”.

I’m hoping there is something else besides the “Pattern Analyzer”; which I
don’t have a clear idea of how make it work properly.

Any ideas will be greatly appreciated since I’m out of them after all day
knocking my head with this.

--
View this message in context:http://elasticsearch-users.115913.n3.nabble.com/Allow-a-dot-at-the-be...
Sent from the ElasticSearch Users mailing list archive at Nabble.com.


(Gabriel) #4

Is there any way to do that without create a java class?


(system) #5