I have all day trying to find a way to make the word “.Net” being considered as a token instead of just “Net” without the dot; Because I’m getting noise in my searches when the term .Net is required with dot (which mean something completely different without).
So far I haven’t been successful the closest result has been the withspace tokenizer, the problem with that is sentences like “This features: ” are being tokenized like “this”, “features:” and of course any search by term “features” will ignore the document because the “:”.
I’m hoping there is something else besides the “Pattern Analyzer”; which I don’t have a clear idea of how make it work properly.
Any ideas will be greatly appreciated since I’m out of them after all day knocking my head with this.
On Wed, 2011-04-06 at 19:13 -0700, AGuereca wrote:
I have all day trying to find a way to make the word â.Netâ being considered
as a token instead of just âNetâ without the dot; Because Iâm getting noise
in my searches when the term .Net is required with dot (which mean something
completely different without).
This is tricky to do in the middle of a blob of text, because there are
loads of other uses of '.' where you do want to ignore the '.' and other
punctuation.
One thing you could do, short of writing a custom tokenizer in Java, is
to preprocess both your document text and your query strings to replace
occurrences of ".Net" with something like "dotNet"
It's a bit manual, but at least this way you won't lose the benefits of
the standard analyzer.
I have all day trying to find a way to make the word “.Net” being considered
as a token instead of just “Net” without the dot; Because I’m getting noise
in my searches when the term .Net is required with dot (which mean something
completely different without).
So far I haven’t been successful the closest result has been the withspace
tokenizer, the problem with that is sentences like “This features: ” are
being tokenized like “this”, “features:” and of course any search by term
“features” will ignore the document because the “:”.
I’m hoping there is something else besides the “Pattern Analyzer”; which I
don’t have a clear idea of how make it work properly.
Any ideas will be greatly appreciated since I’m out of them after all day
knocking my head with this.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.