Stopword functionality suppression


(John Ohno) #1

We are trying to make Elastic work under constraints inherited from an
older system that didn't support stopwords. We tried setting the stopword
list in JSON to a null string both at index creation time and at search
time, but we have discovered that this doesn't allow one to search the
index for english stopwords. Is there a straightforward way to turn off
stopwords globally, aside from diking out the code that handles them?
Alternately, were we to use the method of setting the stopword list to a
null string, would there be yet a third place to set the stopword list in
order to suppress this behavior?


(David Pilato) #2

You can define your own mapping and use a standard analyzer with an empty
stopwords collection.

http://www.elasticsearch.org/guide/reference/index-modules/analysis/standard-analyzer.html
http://www.elasticsearch.org/guide/reference/index-modules/analysis/standard-analyzer.html

I suppose it could work. I never tested an empty stopword list.

HTH
David

Le 6 août 2012 à 15:52, John Ohno john.ohno@gmail.com a écrit :

We are trying to make Elastic work under constraints inherited from an older
system that didn't support stopwords. We tried setting the stopword list in
JSON to a null string both at index creation time and at search time, but we
have discovered that this doesn't allow one to search the index for english
stopwords. Is there a straightforward way to turn off stopwords globally,
aside from diking out the code that handles them? Alternately, were we to use
the method of setting the stopword list to a null string, would there be yet a
third place to set the stopword list in order to suppress this behavior?

--
David Pilato
http://www.scrutmydocs.org/
http://dev.david.pilato.fr/
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs


(John Ohno) #3

I finally got that to work, but it turned out in order to do it, you have
to specify the custom analyzer for the index and for every field in the
mapping individually, then add a blank stopwords list to the query at
search-time.

While this is feasible for me (I generate the searches, the mappings, and
the index requests), it seems a bit strange that specifying it top-level
doesn't trickle-down. For instance, the analyzer attribute for the field
mapping does not appear to cascade to the index_analyzer attribute for the
same field's mapping, and even if you specify a stopword-less
search_analyzer for every field, stopwords are not excluded unless the
query passes a blank stopword list.

Surely there must be other people who want to remove stopwords or change
them globally?

On Monday, August 6, 2012 10:26:45 AM UTC-4, David Pilato wrote:

You can define your own mapping and use a standard analyzer with an
empty stopwords collection.

http://www.elasticsearch.org/guide/reference/index-modules/analysis/standard-analyzer.html

I suppose it could work. I never tested an empty stopword list.

HTH

David

Le 6 août 2012 à 15:52, John Ohno john.ohno@gmail.com a écrit :

We are trying to make Elastic work under constraints inherited from an
older system that didn't support stopwords. We tried setting the stopword
list in JSON to a null string both at index creation time and at search
time, but we have discovered that this doesn't allow one to search the
index for english stopwords. Is there a straightforward way to turn off
stopwords globally, aside from diking out the code that handles them?
Alternately, were we to use the method of setting the stopword list to a
null string, would there be yet a third place to set the stopword list in
order to suppress this behavior?

--
David Pilato
http://www.scrutmydocs.org/
http://dev.david.pilato.fr/
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs


(Igor Motov) #4

You can override default analyzer globally by adding these lines to
elasticsearch.yml:

index.analysis.analyzer.default:
type: custom
tokenizer: standard
filter: standard, lowercase

This will create a custom analyzer with Standard Tokenized and two filters:
Standard and Lowercase. This way your custom analyzer will be identical to
the standard analyzer but it will not use the stopword filter. Because it's
named "default", elasticsearch will use it everywhere where analyzer is
not explicitly set.

Igor

On Monday, August 6, 2012 12:35:02 PM UTC-4, John Ohno wrote:

I finally got that to work, but it turned out in order to do it, you have
to specify the custom analyzer for the index and for every field in the
mapping individually, then add a blank stopwords list to the query at
search-time.

While this is feasible for me (I generate the searches, the mappings, and
the index requests), it seems a bit strange that specifying it top-level
doesn't trickle-down. For instance, the analyzer attribute for the field
mapping does not appear to cascade to the index_analyzer attribute for the
same field's mapping, and even if you specify a stopword-less
search_analyzer for every field, stopwords are not excluded unless the
query passes a blank stopword list.

Surely there must be other people who want to remove stopwords or change
them globally?

On Monday, August 6, 2012 10:26:45 AM UTC-4, David Pilato wrote:

You can define your own mapping and use a standard analyzer with an
empty stopwords collection.

http://www.elasticsearch.org/guide/reference/index-modules/analysis/standard-analyzer.html

I suppose it could work. I never tested an empty stopword list.

HTH

David

Le 6 août 2012 à 15:52, John Ohno john.ohno@gmail.com a écrit :

We are trying to make Elastic work under constraints inherited from an
older system that didn't support stopwords. We tried setting the stopword
list in JSON to a null string both at index creation time and at search
time, but we have discovered that this doesn't allow one to search the
index for english stopwords. Is there a straightforward way to turn off
stopwords globally, aside from diking out the code that handles them?
Alternately, were we to use the method of setting the stopword list to a
null string, would there be yet a third place to set the stopword list in
order to suppress this behavior?

--
David Pilato
http://www.scrutmydocs.org/
http://dev.david.pilato.fr/
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs


(John Ohno) #5

This looks like exactly what I'm looking for. Thanks!

On Monday, August 6, 2012 12:59:37 PM UTC-4, Igor Motov wrote:

You can override default analyzer globally by adding these lines to
elasticsearch.yml:

index.analysis.analyzer.default:
type: custom
tokenizer: standard
filter: standard, lowercase

This will create a custom analyzer with Standard Tokenized and two
filters: Standard and Lowercase. This way your custom analyzer will be
identical to the standard analyzer but it will not use the stopword filter.
Because it's named "default", elasticsearch will use it everywhere where
analyzer is not explicitly set.

Igor

On Monday, August 6, 2012 12:35:02 PM UTC-4, John Ohno wrote:

I finally got that to work, but it turned out in order to do it, you have
to specify the custom analyzer for the index and for every field in the
mapping individually, then add a blank stopwords list to the query at
search-time.

While this is feasible for me (I generate the searches, the mappings, and
the index requests), it seems a bit strange that specifying it top-level
doesn't trickle-down. For instance, the analyzer attribute for the field
mapping does not appear to cascade to the index_analyzer attribute for the
same field's mapping, and even if you specify a stopword-less
search_analyzer for every field, stopwords are not excluded unless the
query passes a blank stopword list.

Surely there must be other people who want to remove stopwords or change
them globally?

On Monday, August 6, 2012 10:26:45 AM UTC-4, David Pilato wrote:

You can define your own mapping and use a standard analyzer with an
empty stopwords collection.

http://www.elasticsearch.org/guide/reference/index-modules/analysis/standard-analyzer.html

I suppose it could work. I never tested an empty stopword list.

HTH

David

Le 6 août 2012 à 15:52, John Ohno john.ohno@gmail.com a écrit :

We are trying to make Elastic work under constraints inherited from an
older system that didn't support stopwords. We tried setting the stopword
list in JSON to a null string both at index creation time and at search
time, but we have discovered that this doesn't allow one to search the
index for english stopwords. Is there a straightforward way to turn off
stopwords globally, aside from diking out the code that handles them?
Alternately, were we to use the method of setting the stopword list to a
null string, would there be yet a third place to set the stopword list in
order to suppress this behavior?

--
David Pilato
http://www.scrutmydocs.org/
http://dev.david.pilato.fr/
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs


(Schmurfy) #6

Hi, I was looking for exactly the same thing but I have no idea what to put
in the config file to keep the default behavior but with an empty stopwords
list.
Could you post what you ended up adding in your config file please ?

Thanks.

On Monday, 6 August 2012 19:37:02 UTC+2, John Ohno wrote:

This looks like exactly what I'm looking for. Thanks!

On Monday, August 6, 2012 12:59:37 PM UTC-4, Igor Motov wrote:

You can override default analyzer globally by adding these lines to
elasticsearch.yml:

index.analysis.analyzer.default:
type: custom
tokenizer: standard
filter: standard, lowercase

This will create a custom analyzer with Standard Tokenized and two
filters: Standard and Lowercase. This way your custom analyzer will be
identical to the standard analyzer but it will not use the stopword filter.
Because it's named "default", elasticsearch will use it everywhere where
analyzer is not explicitly set.

Igor

On Monday, August 6, 2012 12:35:02 PM UTC-4, John Ohno wrote:

I finally got that to work, but it turned out in order to do it, you
have to specify the custom analyzer for the index and for every field in
the mapping individually, then add a blank stopwords list to the query at
search-time.

While this is feasible for me (I generate the searches, the mappings,
and the index requests), it seems a bit strange that specifying it
top-level doesn't trickle-down. For instance, the analyzer attribute for
the field mapping does not appear to cascade to the index_analyzer
attribute for the same field's mapping, and even if you specify a
stopword-less search_analyzer for every field, stopwords are not excluded
unless the query passes a blank stopword list.

Surely there must be other people who want to remove stopwords or change
them globally?

On Monday, August 6, 2012 10:26:45 AM UTC-4, David Pilato wrote:

You can define your own mapping and use a standard analyzer with an
empty stopwords collection.

http://www.elasticsearch.org/guide/reference/index-modules/analysis/standard-analyzer.html

I suppose it could work. I never tested an empty stopword list.

HTH

David

Le 6 août 2012 à 15:52, John Ohno <john...@gmail.com <javascript:>> a
écrit :

We are trying to make Elastic work under constraints inherited from
an older system that didn't support stopwords. We tried setting the
stopword list in JSON to a null string both at index creation time and at
search time, but we have discovered that this doesn't allow one to search
the index for english stopwords. Is there a straightforward way to turn off
stopwords globally, aside from diking out the code that handles them?
Alternately, were we to use the method of setting the stopword list to a
null string, would there be yet a third place to set the stopword list in
order to suppress this behavior?

--
David Pilato
http://www.scrutmydocs.org/
http://dev.david.pilato.fr/
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

--


(Igor Motov) #7

Hi Julien,

The fragment in my post is exactly what you need to add to the
elasticsearch.yml file:

index.analysis.analyzer.default:
type: custom
tokenizer: standard
filter: standard, lowercase

Igor

On Tuesday, August 14, 2012 10:27:18 AM UTC-4, Julien Ammous wrote:

Hi, I was looking for exactly the same thing but I have no idea what to
put in the config file to keep the default behavior but with an empty
stopwords list.
Could you post what you ended up adding in your config file please ?

Thanks.

On Monday, 6 August 2012 19:37:02 UTC+2, John Ohno wrote:

This looks like exactly what I'm looking for. Thanks!

On Monday, August 6, 2012 12:59:37 PM UTC-4, Igor Motov wrote:

You can override default analyzer globally by adding these lines to
elasticsearch.yml:

index.analysis.analyzer.default:
type: custom
tokenizer: standard
filter: standard, lowercase

This will create a custom analyzer with Standard Tokenized and two
filters: Standard and Lowercase. This way your custom analyzer will be
identical to the standard analyzer but it will not use the stopword filter.
Because it's named "default", elasticsearch will use it everywhere where
analyzer is not explicitly set.

Igor

On Monday, August 6, 2012 12:35:02 PM UTC-4, John Ohno wrote:

I finally got that to work, but it turned out in order to do it, you
have to specify the custom analyzer for the index and for every field in
the mapping individually, then add a blank stopwords list to the query at
search-time.

While this is feasible for me (I generate the searches, the mappings,
and the index requests), it seems a bit strange that specifying it
top-level doesn't trickle-down. For instance, the analyzer attribute for
the field mapping does not appear to cascade to the index_analyzer
attribute for the same field's mapping, and even if you specify a
stopword-less search_analyzer for every field, stopwords are not excluded
unless the query passes a blank stopword list.

Surely there must be other people who want to remove stopwords or
change them globally?

On Monday, August 6, 2012 10:26:45 AM UTC-4, David Pilato wrote:

You can define your own mapping and use a standard analyzer with an
empty stopwords collection.

http://www.elasticsearch.org/guide/reference/index-modules/analysis/standard-analyzer.html

I suppose it could work. I never tested an empty stopword list.

HTH

David

Le 6 août 2012 à 15:52, John Ohno john...@gmail.com a écrit :

We are trying to make Elastic work under constraints inherited from
an older system that didn't support stopwords. We tried setting the
stopword list in JSON to a null string both at index creation time and at
search time, but we have discovered that this doesn't allow one to search
the index for english stopwords. Is there a straightforward way to turn off
stopwords globally, aside from diking out the code that handles them?
Alternately, were we to use the method of setting the stopword list to a
null string, would there be yet a third place to set the stopword list in
order to suppress this behavior?

--
David Pilato
http://www.scrutmydocs.org/
http://dev.david.pilato.fr/
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

--


(Schmurfy) #8

Thanks !

On 14 August 2012 23:46, Igor Motov imotov@gmail.com wrote:

Hi Julien,

The fragment in my post is exactly what you need to add to the
elasticsearch.yml file:

index.analysis.analyzer.**default:
type: custom
tokenizer: standard
filter: standard, lowercase

Igor

On Tuesday, August 14, 2012 10:27:18 AM UTC-4, Julien Ammous wrote:

Hi, I was looking for exactly the same thing but I have no idea what to
put in the config file to keep the default behavior but with an empty
stopwords list.
Could you post what you ended up adding in your config file please ?

Thanks.

On Monday, 6 August 2012 19:37:02 UTC+2, John Ohno wrote:

This looks like exactly what I'm looking for. Thanks!

On Monday, August 6, 2012 12:59:37 PM UTC-4, Igor Motov wrote:

You can override default analyzer globally by adding these lines to
elasticsearch.yml:

index.analysis.analyzer.**default:
type: custom
tokenizer: standard
filter: standard, lowercase

This will create a custom analyzer with Standard Tokenized and two
filters: Standard and Lowercase. This way your custom analyzer will be
identical to the standard analyzer but it will not use the stopword filter.
Because it's named "default", elasticsearch will use it everywhere where
analyzer is not explicitly set.

Igor

On Monday, August 6, 2012 12:35:02 PM UTC-4, John Ohno wrote:

I finally got that to work, but it turned out in order to do it, you
have to specify the custom analyzer for the index and for every field in
the mapping individually, then add a blank stopwords list to the query at
search-time.

While this is feasible for me (I generate the searches, the mappings,
and the index requests), it seems a bit strange that specifying it
top-level doesn't trickle-down. For instance, the analyzer attribute for
the field mapping does not appear to cascade to the index_analyzer
attribute for the same field's mapping, and even if you specify a
stopword-less search_analyzer for every field, stopwords are not excluded
unless the query passes a blank stopword list.

Surely there must be other people who want to remove stopwords or
change them globally?

On Monday, August 6, 2012 10:26:45 AM UTC-4, David Pilato wrote:

You can define your own mapping and use a standard analyzer with an
empty stopwords collection.

http://www.elasticsearch.org/guide/reference/index-modules/
analysis/standard-analyzer.**htmlhttp://www.elasticsearch.org/guide/reference/index-modules/analysis/standard-analyzer.html

I suppose it could work. I never tested an empty stopword list.

HTH

David

Le 6 août 2012 à 15:52, John Ohno john...@gmail.com a écrit :

We are trying to make Elastic work under constraints inherited from
an older system that didn't support stopwords. We tried setting the
stopword list in JSON to a null string both at index creation time and at
search time, but we have discovered that this doesn't allow one to search
the index for english stopwords. Is there a straightforward way to turn off
stopwords globally, aside from diking out the code that handles them?
Alternately, were we to use the method of setting the stopword list to a
null string, would there be yet a third place to set the stopword list in
order to suppress this behavior?

--
David Pilato
http://www.scrutmydocs.org/
http://dev.david.pilato.fr/
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

--

--


(Brian Yoder) #9

I added these lines to my elasticsearch.yml configuration file, and
restarted ElasticSearch.

But still no joy. When I try to use the default (or standard) analyzer, it
still throws away the built-in English stop words. For example:

curl -XGET 'localhost:9200/_analyze?analyzer=default&pretty=true' -d 'This
is a Test' && echo
{
"tokens" : [ {
"token" : "test",
"start_offset" : 10,
"end_offset" : 14,
"type" : "",
"position" : 4
} ]
}

So close, and yet so far. The database I'm creating consists entirely of
names (people, businesses, places) and stop words just get in the way. A J
FOYT's first initial should not be considered a stop word:

curl -XGET 'localhost:9200/_analyze?analyzer=default&pretty=true' -d 'A J
Foyt' && echo
{
"tokens" : [ {
"token" : "j",
"start_offset" : 2,
"end_offset" : 3,
"type" : "",
"position" : 2
}, {
"token" : "foyt",
"start_offset" : 4,
"end_offset" : 8,
"type" : "",
"position" : 3
} ]
}

On Monday, August 6, 2012 12:59:37 PM UTC-4, Igor Motov wrote:

You can override default analyzer globally by adding these lines to
elasticsearch.yml:

index.analysis.analyzer.default:
type: custom
tokenizer: standard
filter: standard, lowercase

This will create a custom analyzer with Standard Tokenized and two
filters: Standard and Lowercase. This way your custom analyzer will be
identical to the standard analyzer but it will not use the stopword filter.
Because it's named "default", elasticsearch will use it everywhere where
analyzer is not explicitly set.

Igor

On Monday, August 6, 2012 12:35:02 PM UTC-4, John Ohno wrote:

I finally got that to work, but it turned out in order to do it, you have
to specify the custom analyzer for the index and for every field in the
mapping individually, then add a blank stopwords list to the query at
search-time.

While this is feasible for me (I generate the searches, the mappings, and
the index requests), it seems a bit strange that specifying it top-level
doesn't trickle-down. For instance, the analyzer attribute for the field
mapping does not appear to cascade to the index_analyzer attribute for the
same field's mapping, and even if you specify a stopword-less
search_analyzer for every field, stopwords are not excluded unless the
query passes a blank stopword list.

Surely there must be other people who want to remove stopwords or change
them globally?

On Monday, August 6, 2012 10:26:45 AM UTC-4, David Pilato wrote:

You can define your own mapping and use a standard analyzer with an
empty stopwords collection.

http://www.elasticsearch.org/guide/reference/index-modules/analysis/standard-analyzer.html

I suppose it could work. I never tested an empty stopword list.

HTH

David

Le 6 août 2012 à 15:52, John Ohno <john...@gmail.com <javascript:>> a
écrit :

We are trying to make Elastic work under constraints inherited from an
older system that didn't support stopwords. We tried setting the stopword
list in JSON to a null string both at index creation time and at search
time, but we have discovered that this doesn't allow one to search the
index for english stopwords. Is there a straightforward way to turn off
stopwords globally, aside from diking out the code that handles them?
Alternately, were we to use the method of setting the stopword list to a
null string, would there be yet a third place to set the stopword list in
order to suppress this behavior?

--
David Pilato
http://www.scrutmydocs.org/
http://dev.david.pilato.fr/
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

--


(Igor Motov) #10

If you don't specify an index name in the analyze API request, it can only
work with built-in analyzers. So, when you are asking for "default"
analyzer it returns you built-in default analyzer instead of one that you
configured in the config file. Try using _analyze with an index:

curl -XGET 'localhost:9200/my_index/_analyze?analyzer=default&pretty=true'
-d 'This is a Test' && echo

On Thursday, November 1, 2012 5:29:48 PM UTC-4, InquiringMind wrote:

I added these lines to my elasticsearch.yml configuration file, and
restarted ElasticSearch.

But still no joy. When I try to use the default (or standard) analyzer, it
still throws away the built-in English stop words. For example:

curl -XGET 'localhost:9200/_analyze?analyzer=default&pretty=true' -d 'This
is a Test' && echo
{
"tokens" : [ {
"token" : "test",
"start_offset" : 10,
"end_offset" : 14,
"type" : "",
"position" : 4
} ]
}

So close, and yet so far. The database I'm creating consists entirely of
names (people, businesses, places) and stop words just get in the way. A J
FOYT's first initial should not be considered a stop word:

curl -XGET 'localhost:9200/_analyze?analyzer=default&pretty=true' -d 'A J
Foyt' && echo
{
"tokens" : [ {
"token" : "j",
"start_offset" : 2,
"end_offset" : 3,
"type" : "",
"position" : 2
}, {
"token" : "foyt",
"start_offset" : 4,
"end_offset" : 8,
"type" : "",
"position" : 3
} ]
}

On Monday, August 6, 2012 12:59:37 PM UTC-4, Igor Motov wrote:

You can override default analyzer globally by adding these lines to
elasticsearch.yml:

index.analysis.analyzer.default:
type: custom
tokenizer: standard
filter: standard, lowercase

This will create a custom analyzer with Standard Tokenized and two
filters: Standard and Lowercase. This way your custom analyzer will be
identical to the standard analyzer but it will not use the stopword filter.
Because it's named "default", elasticsearch will use it everywhere where
analyzer is not explicitly set.

Igor

On Monday, August 6, 2012 12:35:02 PM UTC-4, John Ohno wrote:

I finally got that to work, but it turned out in order to do it, you
have to specify the custom analyzer for the index and for every field in
the mapping individually, then add a blank stopwords list to the query at
search-time.

While this is feasible for me (I generate the searches, the mappings,
and the index requests), it seems a bit strange that specifying it
top-level doesn't trickle-down. For instance, the analyzer attribute for
the field mapping does not appear to cascade to the index_analyzer
attribute for the same field's mapping, and even if you specify a
stopword-less search_analyzer for every field, stopwords are not excluded
unless the query passes a blank stopword list.

Surely there must be other people who want to remove stopwords or change
them globally?

On Monday, August 6, 2012 10:26:45 AM UTC-4, David Pilato wrote:

You can define your own mapping and use a standard analyzer with an
empty stopwords collection.

http://www.elasticsearch.org/guide/reference/index-modules/analysis/standard-analyzer.html

I suppose it could work. I never tested an empty stopword list.

HTH

David

Le 6 août 2012 à 15:52, John Ohno john...@gmail.com a écrit :

We are trying to make Elastic work under constraints inherited from
an older system that didn't support stopwords. We tried setting the
stopword list in JSON to a null string both at index creation time and at
search time, but we have discovered that this doesn't allow one to search
the index for english stopwords. Is there a straightforward way to turn off
stopwords globally, aside from diking out the code that handles them?
Alternately, were we to use the method of setting the stopword list to a
null string, would there be yet a third place to set the stopword list in
order to suppress this behavior?

--
David Pilato
http://www.scrutmydocs.org/
http://dev.david.pilato.fr/
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

--


(Brian Yoder) #11

Thanks, Igor! That was exactly what I needed to do! Adding the index name
caused the configured default to be used and no more stop words!

Then deleting and loading the data re-created the index (it fails, of
course, if the specified index doesn't exist yet), and now the queries
(which I've directed to analyzer("default") just in case it needs some
help! also find those stop words as separate terms (and not just within
phrases). Very cool!

On Friday, November 2, 2012 9:14:09 AM UTC-4, Igor Motov wrote:

If you don't specify an index name in the analyze API request, it can only
work with built-in analyzers. So, when you are asking for "default"
analyzer it returns you built-in default analyzer instead of one that you
configured in the config file. Try using _analyze with an index:

curl -XGET 'localhost:9200/my_index/_analyze?analyzer=default&pretty=true'
-d 'This is a Test' && echo

--


(system) #12