Analyzer for football scores

Ridvan_Gyundogan · October 26, 2011, 2:06pm

Hi I know that this might sound funny, but I try to extract football
scores from text fields.
For example I have Man United - Man City 1:6. I want to make a term
facet which groups the documents by scores:
1 : 1 (339 documents)
2 : 1 (564 documents)
...
....

So I want "1 :1" to be analyzed as a single term. Anyone having idea
how to do this with the analyzer, tokenizer, filters?

The regular expression for terms would be something like \d\s:\s\d.
The thing is that the pattern analyzer expects a regular expression
for the separator not for the term.

Clinton_Gormley · October 26, 2011, 2:21pm

Hi Ridvan

So I want "1 :1" to be analyzed as a single term. Anyone having idea
how to do this with the analyzer, tokenizer, filters?

The regular expression for terms would be something like \d\s:\s\d.
The thing is that the pattern analyzer expects a regular expression
for the separator not for the term.

You can build a custom analyzer using the pattern TOKENIZER, which
allows you to specify a 'group' number, so that you can capture tokens,
instead of matching on separators

clint

Ridvan_Gyundogan · October 26, 2011, 3:16pm

Hi Clint, thanks for the answer.

I think I get it, but what confuses me is the comment at the bottom,
of the link you provided:
"IMPORTANT: The regular expression should match the token separators,
not the tokens themselves."

On the other side if I look at the following example from Shay :

github.com/elastic/elasticsearch

Analysis: Pattern Tokenizer

opened 10:19PM - 12 May 11 UTC

closed 10:22PM - 12 May 11 UTC

kimchy

>enhancement v0.17.0 v0.16.2

Pattern tokenizer allows to define a tokenizer that uses regex to break text int…o tokens. The `pattern` parameter accepts the regex expression (and flags the common ES level regex flags). It also accepts `group` (defaults to -1), from teh docs: group=-1 (the default) is equivalent to "split". In this case, the tokens will be equivalent to the output from (without empty tokens):String#split(java.lang.String) Using group >= 0 selects the matching group as the token. For example, if you have: ``` pattern = \'([^\']+)\' group = 0 input = aaa 'bbb' 'ccc' ``` the output will be two tokens: 'bbb' and 'ccc' (including the ' marks). With the same input but using group=1, the output would be: bbb and ccc (no ' marks).

It looks like the pattern is exactly for the tokens, not for the
separators?

On Oct 26, 5:21 pm, Clinton Gormley cl...@traveljury.com wrote:

Hi Ridvan

So I want "1 :1" to be analyzed as a single term. Anyone having idea
how to do this with the analyzer, tokenizer, filters?

The regular expression for terms would be something like \d\s:\s\d.
The thing is that the pattern analyzer expects a regular expression
for the separator not for the term.

You can build a custom analyzer using the pattern TOKENIZER, which
allows you to specify a 'group' number, so that you can capture tokens,
instead of matching on separators

Elasticsearch Platform — Find real-time answers at scale | Elastic...

clint

Clinton_Gormley · October 26, 2011, 3:25pm

Hi Ridavan

I think I get it, but what confuses me is the comment at the bottom,
of the link you provided:
"IMPORTANT: The regular expression should match the token separators,
not the tokens themselves."

I think that's just a bad copy-paste from the analyzer docs.

On the other side if I look at the following example from Shay :
Analysis: Pattern Tokenizer · Issue #928 · elastic/elasticsearch · GitHub
It looks like the pattern is exactly for the tokens, not for the
separators?

As it says, if "group" is -1 then it acts as a 'split' on the regex (ie
your regex should match the token separators), but if group is > 0 then
it returns what is matched. For example:

For text 'foobar' and regex /(o(b))a/

group: tokens:

-1 fo,r
0 oba
1 ob
2 b

clint

Topic		Replies	Views
Pattern analyzer regex help Elasticsearch	3	253	August 24, 2022
Help with custom analyzer/tokenizer Elasticsearch	2	997	July 5, 2017
Pattern Analyzer with separator tokens Elasticsearch	1	320	July 6, 2017
Problem with token delimiter and regular expression Elasticsearch	2	600	July 6, 2017
Field Analyser vs _all Analyser and Query String Analyser Elasticsearch	1	491	October 2, 2017

Analyzer for football scores

group: tokens:

Related topics