Did ES / Lucene start tokenizing fields differently in 0.17.0?

tsuna · May 11, 2011, 5:03pm

Hi all,
I'm testing my app on elasticsearch-0.17.0-SNAPSHOT (built from
563ad625c0f69f3ff0f4c39f46421b1dc2c91b6f) and in my app I'm doing a
term facet on a field. I noticed a difference in behavior. If the
field contains "foo_bar", in 0.16 it would be tokenized as 2 tokens
["foo", "bar"], but in 0.17 it remains a single token ["foo_bar"]. I
have absolutely zero configuration change on my ES instance, it's a
complete vanilla install from the commit above. My mapping is created
dynamically without me specifying anything about it.

Hence my question: Did ES / Lucene start tokenizing fields differently?

--
Benoit "tsuna" Sigoure
Software Engineer @ www.StumbleUpon.com

kimchy · May 11, 2011, 5:11pm

Are you sure that in 0.16 it gets tokenized into 2 tokens? I ran the following on 0.15.2, 0.16.0 (where some analysis behavior changed when upgrading to Lucene 0.16.0), and master, and in all of them, it tokenizes into a single token (using the default, standard analyzer).

curl -XPUT localhost:9200/test
curl -XPOST localhost:9200/test/_analyze -d 'foo_bar'

On Wednesday, May 11, 2011 at 8:03 PM, tsuna wrote:

Hi all,
I'm testing my app on elasticsearch-0.17.0-SNAPSHOT (built from
563ad625c0f69f3ff0f4c39f46421b1dc2c91b6f) and in my app I'm doing a
term facet on a field. I noticed a difference in behavior. If the
field contains "foo_bar", in 0.16 it would be tokenized as 2 tokens
["foo", "bar"], but in 0.17 it remains a single token ["foo_bar"]. I
have absolutely zero configuration change on my ES instance, it's a
complete vanilla install from the commit above. My mapping is created
dynamically without me specifying anything about it.

Hence my question: Did ES / Lucene start tokenizing fields differently?

--
Benoit "tsuna" Sigoure
Software Engineer @ www.StumbleUpon.com