Hi I know that this might sound funny, but I try to extract football
scores from text fields.
For example I have Man United - Man City 1:6. I want to make a term
facet which groups the documents by scores:
1 : 1 (339 documents)
2 : 1 (564 documents)
...
....
So I want "1 :1" to be analyzed as a single term. Anyone having idea
how to do this with the analyzer, tokenizer, filters?
The regular expression for terms would be something like \d\s:\s\d.
The thing is that the pattern analyzer expects a regular expression
for the separator not for the term.
So I want "1 :1" to be analyzed as a single term. Anyone having idea
how to do this with the analyzer, tokenizer, filters?
The regular expression for terms would be something like \d\s:\s\d.
The thing is that the pattern analyzer expects a regular expression
for the separator not for the term.
You can build a custom analyzer using the pattern TOKENIZER, which
allows you to specify a 'group' number, so that you can capture tokens,
instead of matching on separators
I think I get it, but what confuses me is the comment at the bottom,
of the link you provided:
"IMPORTANT: The regular expression should match the token separators,
not the tokens themselves."
On the other side if I look at the following example from Shay :
It looks like the pattern is exactly for the tokens, not for the
separators?
So I want "1 :1" to be analyzed as a single term. Anyone having idea
how to do this with the analyzer, tokenizer, filters?
The regular expression for terms would be something like \d\s:\s\d.
The thing is that the pattern analyzer expects a regular expression
for the separator not for the term.
You can build a custom analyzer using the pattern TOKENIZER, which
allows you to specify a 'group' number, so that you can capture tokens,
instead of matching on separators
I think I get it, but what confuses me is the comment at the bottom,
of the link you provided:
"IMPORTANT: The regular expression should match the token separators,
not the tokens themselves."
I think that's just a bad copy-paste from the analyzer docs.
As it says, if "group" is -1 then it acts as a 'split' on the regex (ie
your regex should match the token separators), but if group is > 0 then
it returns what is matched. For example:
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.