Stop filter problem: enablePositionIncrements=false is not supported anymore as of Lucene 4.4 as it can create broken token streams

I setup a stop filter as follows:

                "filter_stop": {
                    "type": "stop",
                    "enable_position_increments":"false"
                }

However when I try run my application that then indexes my data, I get the
following error:

Caused by IllegalArgumentException: enablePositionIncrements=false is notsupported anymore
as of Lucene 4.4 as it can create broken token streams
->> 40 | checkPositionIncrement in org.apache.lucene.analysis.
util.FilteringTokenFilter

Indeed, checking the 4.4 API docs, setEnablePositionIncrements() is
deprecated so my question is:

How do I get rid of underscores '_' in a shingle filter without this

setting?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

2 Likes

Drop enable_position_increments parameter or set it to true.

In shingle filters, you should set min_shingle_size to 2.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Jörg,

The problem is that the default is set to true, and with it set to true, my
shingle filter results include underscores because of the stop filter in
use, which I don't want. Traditionally the way to get rid of this was to
set enablePositionIncrements to false in the stop filter. This is no longer
possible, hence my predicament. :frowning:

On Wednesday, 4 September 2013 14:49:44 UTC+2, Jörg Prante wrote:

Drop enable_position_increments parameter or set it to true.

In shingle filters, you should set min_shingle_size to 2.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Jondow,

Is there any progress on the issue?

среда, 4 сентября 2013 г., 17:16:47 UTC+3 пользователь Jondow написал:

Hi Jörg,

The problem is that the default is set to true, and with it set to true,
my shingle filter results include underscores because of the stop filter in
use, which I don't want. Traditionally the way to get rid of this was to
set enablePositionIncrements to false in the stop filter. This is no longer
possible, hence my predicament. :frowning:

On Wednesday, 4 September 2013 14:49:44 UTC+2, Jörg Prante wrote:

Drop enable_position_increments parameter or set it to true.

In shingle filters, you should set min_shingle_size to 2.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/30411acd-1a9d-4332-a3bf-13e7249d91a8%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

the old disabling of position increments was bogus.
for example a stop filter could remove a token and "move" a synonym
from one word to another.

so this option conflated two unrelated things: whether or not a "gap"
should be introduced when a word is removed, and whether any existing
positions (e.g. from synonyms) should be respected.

in my opinion (but i have not thought it over in a while, look at the
issue age) its possible to prevent the introduction of gaps while
still respecting existing ones:
https://issues.apache.org/jira/browse/LUCENE-4065

On Wed, Dec 18, 2013 at 11:54 PM, Michael Cheremuhin micherr@gmail.com wrote:

Hi Jondow,

Is there any progress on the issue?

среда, 4 сентября 2013 г., 17:16:47 UTC+3 пользователь Jondow написал:

Hi Jörg,

The problem is that the default is set to true, and with it set to true,
my shingle filter results include underscores because of the stop filter in
use, which I don't want. Traditionally the way to get rid of this was to set
enablePositionIncrements to false in the stop filter. This is no longer
possible, hence my predicament. :frowning:

On Wednesday, 4 September 2013 14:49:44 UTC+2, Jörg Prante wrote:

Drop enable_position_increments parameter or set it to true.

In shingle filters, you should set min_shingle_size to 2.

Jörg

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/30411acd-1a9d-4332-a3bf-13e7249d91a8%40googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAOdYfZVVDM6MjjS5E%2Bx68B8PXOkBRsjeZuRE8831frcS6CR7Fw%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Any updates from Elastic team on this? It seems that there's no reliable way to remove "_" so far...

There is a 'fillter_token' that is configurable in the shingle token filter. The default is "_". I believe we can change it to "" so that it's empty if it encounters any stop words. Not sure if this is the best practice though.

1 Like