Analyzer for football scores


(Ridvan Gyundogan) #1

Hi I know that this might sound funny, but I try to extract football
scores from text fields.
For example I have Man United - Man City 1:6. I want to make a term
facet which groups the documents by scores:
1 : 1 (339 documents)
2 : 1 (564 documents)
...
....

So I want "1 :1" to be analyzed as a single term. Anyone having idea
how to do this with the analyzer, tokenizer, filters?

The regular expression for terms would be something like \d\s:\s\d.
The thing is that the pattern analyzer expects a regular expression
for the separator not for the term.


(Clinton Gormley) #2

Hi Ridvan

So I want "1 :1" to be analyzed as a single term. Anyone having idea
how to do this with the analyzer, tokenizer, filters?

The regular expression for terms would be something like \d\s:\s\d.
The thing is that the pattern analyzer expects a regular expression
for the separator not for the term.

You can build a custom analyzer using the pattern TOKENIZER, which
allows you to specify a 'group' number, so that you can capture tokens,
instead of matching on separators

http://www.elasticsearch.org/guide/reference/index-modules/analysis/pattern-tokenizer.html

clint


(Ridvan Gyundogan) #3

Hi Clint, thanks for the answer.

I think I get it, but what confuses me is the comment at the bottom,
of the link you provided:
"IMPORTANT: The regular expression should match the token separators,
not the tokens themselves."

On the other side if I look at the following example from Shay :


It looks like the pattern is exactly for the tokens, not for the
separators?

On Oct 26, 5:21 pm, Clinton Gormley cl...@traveljury.com wrote:

Hi Ridvan

So I want "1 :1" to be analyzed as a single term. Anyone having idea
how to do this with the analyzer, tokenizer, filters?

The regular expression for terms would be something like \d\s:\s\d.
The thing is that the pattern analyzer expects a regular expression
for the separator not for the term.

You can build a custom analyzer using the pattern TOKENIZER, which
allows you to specify a 'group' number, so that you can capture tokens,
instead of matching on separators

http://www.elasticsearch.org/guide/reference/index-modules/analysis/p...

clint


(Clinton Gormley) #4

Hi Ridavan

I think I get it, but what confuses me is the comment at the bottom,
of the link you provided:
"IMPORTANT: The regular expression should match the token separators,
not the tokens themselves."

I think that's just a bad copy-paste from the analyzer docs.

On the other side if I look at the following example from Shay :
https://github.com/elasticsearch/elasticsearch/issues/928
It looks like the pattern is exactly for the tokens, not for the
separators?

As it says, if "group" is -1 then it acts as a 'split' on the regex (ie
your regex should match the token separators), but if group is > 0 then
it returns what is matched. For example:

For text 'foobar' and regex /(o(b))a/

group: tokens:

-1 fo,r
0 oba
1 ob
2 b

clint


(system) #5