Differences with standard analyzer in 16.2 vs 14.2

ppearcy · June 23, 2011, 12:05am

Hey,
Probably documented in the release notes, but I wanted to point out
a change with the standard analyzer between 14.2 and 16.2 that doesn't
give great results. This is definitely an edge case and I have ~8
other examples where the results are better, so a win overall.

The old behavior of standard analyzer correctly recognizes AT&T:

curl -XGET 'localhost:9200/index19/_analyze?analyzer=standard' -d
'AT&T'
{"tokens":[{"token":"at&t","start_offset":0,"end_offset":
4,"type":"","position":1}]}

The new behavior ends up removing AT as a stop word:

curl -XGET 'localhost:9200/index19/_analyze?analyzer=standard' -d
'AT&T'
{"tokens":[{"token":"t","start_offset":3,"end_offset":
4,"type":"","position":2}]}

Just wanted to point this out.

Thanks!
Paul

Igor_Motov · June 23, 2011, 3:53am

Yes, standard tokenizer was changed in Lucene 3.1 (
[LUCENE-2167] Implement StandardTokenizer with the UAX#29 Standard - ASF JIRA). I think you can bring
old behavior back by specifying analyzer version in config file:

index.analysis.analyzer.standard.type: standard
index.analysis.analyzer.standard.version: 3.0

On Wed, Jun 22, 2011 at 8:05 PM, Paul ppearcy@gmail.com wrote:

Hey,
Probably documented in the release notes, but I wanted to point out
a change with the standard analyzer between 14.2 and 16.2 that doesn't
give great results. This is definitely an edge case and I have ~8
other examples where the results are better, so a win overall.

The old behavior of standard analyzer correctly recognizes AT&T:

curl -XGET 'localhost:9200/index19/_analyze?analyzer=standard' -d
'AT&T'
{"tokens":[{"token":"at&t","start_offset":0,"end_offset":
4,"type":"","position":1}]}

The new behavior ends up removing AT as a stop word:

curl -XGET 'localhost:9200/index19/_analyze?analyzer=standard' -d
'AT&T'
{"tokens":[{"token":"t","start_offset":3,"end_offset":
4,"type":"","position":2}]}

Just wanted to point this out.

Thanks!
Paul