Difference in analyzer between 1.3.4 and 0.20.2

Ben_George · November 6, 2014, 12:31pm

I am in process of upgrading ES from 0.20.2 to 1.3.4. Below are two
requests to test an analyzer / filter, and although the mapping files are
semantically the same the results are slightly different.

Can anyone provide some insight as to why the differ (the start_offest,
end_offset and position) ? Also does it matter ? The reason I noticed
this is because I'm trying to debug some unexpected behaviour with a query
where the result set for "a" are same for "aa" or even "axxxxxxxxxx".

The filter config is:

            "filter_edge_ngram_front": {
                "type": "edgeNGram",
                "max_gram": "20",
                "min_gram": "1",
                "side": "front"
            }

v.20.2/_analyze?text=aa+b&filters=filter_edge_ngram_front&tokenizer=standard

{

tokens:
[

{
- token: "a",
- start_offset: 0,
- end_offset: 1,
- type: "word",
- position: 1
},

{
- token: "aa",
- start_offset: 0,
- end_offset: 2,
- type: "word",
- position: 2
},
{
- token: "b",
- start_offset: 3,
- end_offset: 4,
- type: "word",
- position: 3
}
]

}

v1.3.4/_analyze?text=aa+b&filters=filter_edge_ngram_front&tokenizer=standard

{

tokens:
[

{
- token: "a",
- start_offset: 0,
- end_offset: 2,
- type: "word",
- position: 1
},

{
- token: "aa",
- start_offset: 0,
- end_offset: 2,
- type: "word",
- position: 1
},
{
- token: "b",
- start_offset: 3,
- end_offset: 4,
- type: "word",
- position: 2
}
]

}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/2e58239a-0091-4d8b-872a-e5b5414b72ad%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

simonw_2 · November 6, 2014, 7:18pm

We fixed EdgeNGram tokenizer / filter in the 1.x series but don't ask me
when exactly I think it was lucene 4.4 or so. Those offsets are now correct
while they where broken before.
not sure if this helps you to debug your problem

On Thursday, November 6, 2014 1:31:22 PM UTC+1, Ben George wrote:

I am in process of upgrading ES from 0.20.2 to 1.3.4. Below are two
requests to test an analyzer / filter, and although the mapping files are
semantically the same the results are slightly different.

Can anyone provide some insight as to why the differ (the start_offest,
end_offset and position) ? Also does it matter ? The reason I noticed
this is because I'm trying to debug some unexpected behaviour with a query
where the result set for "a" are same for "aa" or even "axxxxxxxxxx".

The filter config is:
            "filter_edge_ngram_front": {
                "type": "edgeNGram",
                "max_gram": "20",
                "min_gram": "1",
                "side": "front"
            }
v.20.2/_analyze?text=aa+b&filters=filter_edge_ngram_front&tokenizer=standard

{

tokens:
[

{
- token: "a",
- start_offset: 0,
- end_offset: 1,
- type: "word",
- position: 1
},

{
- token: "aa",
- start_offset: 0,
- end_offset: 2,
- type: "word",
- position: 2
},
{
- token: "b",
- start_offset: 3,
- end_offset: 4,
- type: "word",
- position: 3
}
]

}

v1.3.4/_analyze?text=aa+b&filters=filter_edge_ngram_front&tokenizer=standard

{

tokens:
[

{
- token: "a",
- start_offset: 0,
- end_offset: 2,
- type: "word",
- position: 1
},

{
- token: "aa",
- start_offset: 0,
- end_offset: 2,
- type: "word",
- position: 1
},
{
- token: "b",
- start_offset: 3,
- end_offset: 4,
- type: "word",
- position: 2
}
]

}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ab062e84-b429-40d7-bb8b-bb94e9ec9316%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Analyzer change in behaviour in 0.16 - bug? feature? Elasticsearch	2	302	July 6, 2017
Analyzer API does not work for Elasticsearch 1.7 Elasticsearch	3	573	May 3, 2017
Edge_ngram tokenizer and edge_ngram filter don't behave the same? Elasticsearch	1	356	December 30, 2020
ES 7.12 - Different analyzers are getting used for indexing and searching Elasticsearch	4	464	June 8, 2021
Help with ngram analyzer after migrating to ES 1.5 Elasticsearch	1	338	July 6, 2017

Difference in analyzer between 1.3.4 and 0.20.2

tokens: [

{ - token: "a", - start_offset: 0, - end_offset: 1, - type: "word", - position: 1 },

{ - token: "aa", - start_offset: 0, - end_offset: 2, - type: "word", - position: 2 },

tokens: [

{ - token: "a", - start_offset: 0, - end_offset: 2, - type: "word", - position: 1 },

{ - token: "aa", - start_offset: 0, - end_offset: 2, - type: "word", - position: 1 },

tokens: [

{ - token: "a", - start_offset: 0, - end_offset: 1, - type: "word", - position: 1 },

{ - token: "aa", - start_offset: 0, - end_offset: 2, - type: "word", - position: 2 },

tokens: [

{ - token: "a", - start_offset: 0, - end_offset: 2, - type: "word", - position: 1 },

{ - token: "aa", - start_offset: 0, - end_offset: 2, - type: "word", - position: 1 },

Related topics

tokens:
[

{
- token: "a",
- start_offset: 0,
- end_offset: 1,
- type: "word",
- position: 1
},

{
- token: "aa",
- start_offset: 0,
- end_offset: 2,
- type: "word",
- position: 2
},

tokens:
[

{
- token: "a",
- start_offset: 0,
- end_offset: 2,
- type: "word",
- position: 1
},

{
- token: "aa",
- start_offset: 0,
- end_offset: 2,
- type: "word",
- position: 1
},

tokens:
[

{
- token: "a",
- start_offset: 0,
- end_offset: 1,
- type: "word",
- position: 1
},

{
- token: "aa",
- start_offset: 0,
- end_offset: 2,
- type: "word",
- position: 2
},

tokens:
[

{
- token: "a",
- start_offset: 0,
- end_offset: 2,
- type: "word",
- position: 1
},

{
- token: "aa",
- start_offset: 0,
- end_offset: 2,
- type: "word",
- position: 1
},