Exceptions during Highlighting: InvalidTokenOffsetsException

Hello,

im currently working on an autocompletion for a large dataset
(geonames).
My Schema: https://gist.github.com/424ce0205a9a16e7afe1

I've imported a few Countries to run some tests. Most Queries are
successfull, but sometimes the following exception rises:

[...]
Fetch Failed [Failed to highlight field [name.partial]]]; nested:
InvalidTokenOffsetsException[Token dussvitz exceeds length of provided
text sized 7];
[...]

The Query looks like (see my comment in gist):

Exception ONLY rises when highlighting on Fields like "name.partial or
name.partial_non_ascii or alternateNames.partial etc."

I've no idea anymore and hope that someone can help me out.

I found it :), it seems like a bug in "edgeNGram" Filter, i switched
the position of my filters from:

index.analysis.analyzer.partial.filter.3: name_ngrams
index.analysis.analyzer.partial.filter.2: asciifolding
index.analysis.analyzer.partial.filter.1: lowercase
index.analysis.analyzer.partial.filter.0: standard

to

index.analysis.analyzer.partial.filter.3: asciifolding
index.analysis.analyzer.partial.filter.2: name_ngrams
index.analysis.analyzer.partial.filter.1: lowercase
index.analysis.analyzer.partial.filter.0: standard

an it works, could be that problem:

http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201112.mbox/<CAOdYfZU_pe3P-xspsACOhuJwNYTj+=K47uE6a4LYa7=jabB+2A@mail.gmail.com>

https://issues.apache.org/jira/browse/LUCENE-1500

Maybe i should file a bug report

On 13 Dez., 18:06, BowlingX heidrich.da...@googlemail.com wrote:

Hello,

im currently working on an autocompletion for a large dataset
(geonames).
My Schema:https://gist.github.com/424ce0205a9a16e7afe1

I've imported a few Countries to run some tests. Most Queries are
successfull, but sometimes the following exception rises:

[...]
Fetch Failed [Failed to highlight field [name.partial]]]; nested:
InvalidTokenOffsetsException[Token dussvitz exceeds length of provided
text sized 7];
[...]

The Query looks like (see my comment in gist):https://gist.github.com/424ce0205a9a16e7afe1#comments

Exception ONLY rises when highlighting on Fields like "name.partial or
name.partial_non_ascii or alternateNames.partial etc."

I've no idea anymore and hope that someone can help me out.

Nice catch!

On Wed, Dec 14, 2011 at 12:12 PM, BowlingX heidrich.david@googlemail.comwrote:

I found it :), it seems like a bug in "edgeNGram" Filter, i switched
the position of my filters from:

index.analysis.analyzer.partial.filter.3: name_ngrams
index.analysis.analyzer.partial.filter.2: asciifolding
index.analysis.analyzer.partial.filter.1: lowercase
index.analysis.analyzer.partial.filter.0: standard

to

index.analysis.analyzer.partial.filter.3: asciifolding
index.analysis.analyzer.partial.filter.2: name_ngrams
index.analysis.analyzer.partial.filter.1: lowercase
index.analysis.analyzer.partial.filter.0: standard

an it works, could be that problem:

http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201112.mbox/<CAOdYfZU_pe3P-xspsACOhuJwNYTj+=K47uE6a4LYa7=jabB+2A@mail.gmail.com>

https://issues.apache.org/jira/browse/LUCENE-1500

Maybe i should file a bug report

On 13 Dez., 18:06, BowlingX heidrich.da...@googlemail.com wrote:

Hello,

im currently working on an autocompletion for a large dataset
(geonames).
My Schema:https://gist.github.com/424ce0205a9a16e7afe1

I've imported a few Countries to run some tests. Most Queries are
successfull, but sometimes the following exception rises:

[...]
Fetch Failed [Failed to highlight field [name.partial]]]; nested:
InvalidTokenOffsetsException[Token dussvitz exceeds length of provided
text sized 7];
[...]

The Query looks like (see my comment in gist):
https://gist.github.com/424ce0205a9a16e7afe1#comments

Exception ONLY rises when highlighting on Fields like "name.partial or
name.partial_non_ascii or alternateNames.partial etc."

I've no idea anymore and hope that someone can help me out.

So, is it a bug? Or am I doing anythin wrong?

2011/12/16 Shay Banon kimchy@gmail.com

Nice catch!

On Wed, Dec 14, 2011 at 12:12 PM, BowlingX heidrich.david@googlemail.comwrote:

I found it :), it seems like a bug in "edgeNGram" Filter, i switched
the position of my filters from:

index.analysis.analyzer.partial.filter.3: name_ngrams
index.analysis.analyzer.partial.filter.2: asciifolding
index.analysis.analyzer.partial.filter.1: lowercase
index.analysis.analyzer.partial.filter.0: standard

to

index.analysis.analyzer.partial.filter.3: asciifolding
index.analysis.analyzer.partial.filter.2: name_ngrams
index.analysis.analyzer.partial.filter.1: lowercase
index.analysis.analyzer.partial.filter.0: standard

an it works, could be that problem:

http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201112.mbox/<CAOdYfZU_pe3P-xspsACOhuJwNYTj+=K47uE6a4LYa7=jabB+2A@mail.gmail.com>

https://issues.apache.org/jira/browse/LUCENE-1500

Maybe i should file a bug report

On 13 Dez., 18:06, BowlingX heidrich.da...@googlemail.com wrote:

Hello,

im currently working on an autocompletion for a large dataset
(geonames).
My Schema:https://gist.github.com/424ce0205a9a16e7afe1

I've imported a few Countries to run some tests. Most Queries are
successfull, but sometimes the following exception rises:

[...]
Fetch Failed [Failed to highlight field [name.partial]]]; nested:
InvalidTokenOffsetsException[Token dussvitz exceeds length of provided
text sized 7];
[...]

The Query looks like (see my comment in gist):
https://gist.github.com/424ce0205a9a16e7afe1#comments

Exception ONLY rises when highlighting on Fields like "name.partial or
name.partial_non_ascii or alternateNames.partial etc."

I've no idea anymore and hope that someone can help me out.

I don't know, need to check in Lucene.

On Sat, Dec 17, 2011 at 1:23 AM, David Heidrich <
heidrich.david@googlemail.com> wrote:

So, is it a bug? Or am I doing anythin wrong?

2011/12/16 Shay Banon kimchy@gmail.com

Nice catch!

On Wed, Dec 14, 2011 at 12:12 PM, BowlingX <heidrich.david@googlemail.com

wrote:

I found it :), it seems like a bug in "edgeNGram" Filter, i switched
the position of my filters from:

index.analysis.analyzer.partial.filter.3: name_ngrams
index.analysis.analyzer.partial.filter.2: asciifolding
index.analysis.analyzer.partial.filter.1: lowercase
index.analysis.analyzer.partial.filter.0: standard

to

index.analysis.analyzer.partial.filter.3: asciifolding
index.analysis.analyzer.partial.filter.2: name_ngrams
index.analysis.analyzer.partial.filter.1: lowercase
index.analysis.analyzer.partial.filter.0: standard

an it works, could be that problem:

http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201112.mbox/<CAOdYfZU_pe3P-xspsACOhuJwNYTj+=K47uE6a4LYa7=jabB+2A@mail.gmail.com>

https://issues.apache.org/jira/browse/LUCENE-1500

Maybe i should file a bug report

On 13 Dez., 18:06, BowlingX heidrich.da...@googlemail.com wrote:

Hello,

im currently working on an autocompletion for a large dataset
(geonames).
My Schema:https://gist.github.com/424ce0205a9a16e7afe1

I've imported a few Countries to run some tests. Most Queries are
successfull, but sometimes the following exception rises:

[...]
Fetch Failed [Failed to highlight field [name.partial]]]; nested:
InvalidTokenOffsetsException[Token dussvitz exceeds length of provided
text sized 7];
[...]

The Query looks like (see my comment in gist):
https://gist.github.com/424ce0205a9a16e7afe1#comments

Exception ONLY rises when highlighting on Fields like "name.partial or
name.partial_non_ascii or alternateNames.partial etc."

I've no idea anymore and hope that someone can help me out.

The bug is caused from a wrong calculation of the tokens' offsets. There are some filters that generate additional tokens with a text.length longer than the original token (the one before the (re-)analyzing).
This provokes the wrong calculation of the start and end offsets, ending in a literal hell when trying to highlight.

In fact it will try to add some highlighting tags before the start offset and after the end offset of the token.
But if the token has wrong offset calculated and it is in the end of the field value, then with a high probability the highlighter tries to write the tags over the field length....... raising the exception.

Am having the same problem with an additional plug-in for decompounding German words.

And actually I don't have idea of how to solve it (but I wonder if others filters, like the stemmers, generate longer tokens, and how they manage the offsets correctly).

It's an old post, but maybe can be useful to someone...