Difference in analyzer between 1.3.4 and 0.20.2

I am in process of upgrading ES from 0.20.2 to 1.3.4. Below are two
requests to test an analyzer / filter, and although the mapping files are
semantically the same the results are slightly different.

Can anyone provide some insight as to why the differ (the start_offest,
end_offset and position) ? Also does it matter ? The reason I noticed
this is because I'm trying to debug some unexpected behaviour with a query
where the result set for "a" are same for "aa" or even "axxxxxxxxxx".

The filter config is:

            "filter_edge_ngram_front": {
                "type": "edgeNGram",
                "max_gram": "20",
                "min_gram": "1",
                "side": "front"
            }

v.20.2/_analyze?text=aa+b&filters=filter_edge_ngram_front&tokenizer=standard

{

  • tokens:
    [

    {
    - token: "a",
    - start_offset: 0,
    - end_offset: 1,
    - type: "word",
    - position: 1
    },

    {
    - token: "aa",
    - start_offset: 0,
    - end_offset: 2,
    - type: "word",
    - position: 2
    },

    {
    - token: "b",
    - start_offset: 3,
    - end_offset: 4,
    - type: "word",
    - position: 3
    }
    ]

}

v1.3.4/_analyze?text=aa+b&filters=filter_edge_ngram_front&tokenizer=standard

{

  • tokens:
    [

    {
    - token: "a",
    - start_offset: 0,
    - end_offset: 2,
    - type: "word",
    - position: 1
    },

    {
    - token: "aa",
    - start_offset: 0,
    - end_offset: 2,
    - type: "word",
    - position: 1
    },

    {
    - token: "b",
    - start_offset: 3,
    - end_offset: 4,
    - type: "word",
    - position: 2
    }
    ]

}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/2e58239a-0091-4d8b-872a-e5b5414b72ad%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

We fixed EdgeNGram tokenizer / filter in the 1.x series but don't ask me
when exactly I think it was lucene 4.4 or so. Those offsets are now correct
while they where broken before.
not sure if this helps you to debug your problem

On Thursday, November 6, 2014 1:31:22 PM UTC+1, Ben George wrote:

I am in process of upgrading ES from 0.20.2 to 1.3.4. Below are two
requests to test an analyzer / filter, and although the mapping files are
semantically the same the results are slightly different.

Can anyone provide some insight as to why the differ (the start_offest,
end_offset and position) ? Also does it matter ? The reason I noticed
this is because I'm trying to debug some unexpected behaviour with a query
where the result set for "a" are same for "aa" or even "axxxxxxxxxx".

The filter config is:

            "filter_edge_ngram_front": {
                "type": "edgeNGram",
                "max_gram": "20",
                "min_gram": "1",
                "side": "front"
            }

v.20.2/_analyze?text=aa+b&filters=filter_edge_ngram_front&tokenizer=standard

{

  • tokens:
    [

    {
    - token: "a",
    - start_offset: 0,
    - end_offset: 1,
    - type: "word",
    - position: 1
    },

    {
    - token: "aa",
    - start_offset: 0,
    - end_offset: 2,
    - type: "word",
    - position: 2
    },

    {
    - token: "b",
    - start_offset: 3,
    - end_offset: 4,
    - type: "word",
    - position: 3
    }
    ]

}

v1.3.4/_analyze?text=aa+b&filters=filter_edge_ngram_front&tokenizer=standard

{

  • tokens:
    [

    {
    - token: "a",
    - start_offset: 0,
    - end_offset: 2,
    - type: "word",
    - position: 1
    },

    {
    - token: "aa",
    - start_offset: 0,
    - end_offset: 2,
    - type: "word",
    - position: 1
    },

    {
    - token: "b",
    - start_offset: 3,
    - end_offset: 4,
    - type: "word",
    - position: 2
    }
    ]

}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ab062e84-b429-40d7-bb8b-bb94e9ec9316%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.