Stop words filter not working

Hi,

I have the following problem: I have a list of company names but want to
exclude the "form of the organization" (like Limited, LLC etc.) by using my
own stopwords filter. So far so good, but I can't get it to work. It does
the indexing, everything is searchable, but when searching for "LLC" etc I
am still getting matches. Here is my config (I am using PHP syntax here,
but I guess the values are obvious):

'analysis' => array(
'analyzer' => array(
'name_analyzer' => array(
'type' => 'custom',
'tokenizer' => 'standard',
'filter' => array('my_stopwords','lowercase','icu_normalizer','ngram')
),
'address_analyzer' => array(
'type' => 'custom',
'tokenizer' => 'standard',
'filter' => array('standard','ngram')
),
'country_analyzer' => array(
'type' => 'custom',
'tokenizer' => 'lowercase',
'filter' => array('country_synonyms')
),
),
'filter' => array(
'ngram' => array(
'type' => 'nGram',
'min_gram' => 1,
'max_gram' => 5,
),
'country_synonyms' => array(
'type' => 'synonym',
'synonyms' => array('some synonyms that work perfectly')
),
'my_stopwords' => array(
'type' => 'stop',
'stopwords' => array('llc','gmbh',etc.etc.),
'ignore_case' => true
)
)
)

And here is my mapping:

'names' => array(
'type' => 'string',
'analyzer' => 'name_analyzer',
'index_analyzer' => 'name_analyzer',
'search_analyzer' => 'name_analyzer',
'include_in_all' => true
),
'addresses' => array(
'dynamic' => false,
'analyzer' => 'address_analyzer',
'index_analyzer' => 'address_analyzer',
'search_analyzer' => 'address_analyzer',
'properties' => array(
'street' => array(
'type' => 'string',
'analyzer' => 'address_analyzer',
'index_analyzer' => 'address_analyzer',
'search_analyzer' => 'address_analyzer',
'include_in_all' => true
),
'city' => array(
'type' => 'string',
'analyzer' => 'address_analyzer',
'index_analyzer' => 'address_analyzer',
'search_analyzer' => 'address_analyzer',
'include_in_all' => true
),
'state' => array(
'type' => 'string',
'analyzer' => 'address_analyzer',
'index_analyzer' => 'address_analyzer',
'search_analyzer' => 'address_analyzer',
'include_in_all' => true
),
'country' => array(
'type' => 'string',
'analyzer' => 'country_analyzer',
'index_analyzer' => 'country_analyzer',
'search_analyzer' => 'country_analyzer',
'include_in_all' => true
)
)
)

A GET request to myindex/_settings shows that the values are set correctly.
Example:

"index.analysis.filter.my_stopwords.stopwords.90":"llc"

I feel pretty lost here. So any help would be really appreciated!

Thanks in advance,

Hannes

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group, send email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Hannes,

How does the query look like?

Martijn

On 28 January 2013 20:31, Haensel thehaensel@gmail.com wrote:

Hi,

I have the following problem: I have a list of company names but want to
exclude the "form of the organization" (like Limited, LLC etc.) by using my
own stopwords filter. So far so good, but I can't get it to work. It does
the indexing, everything is searchable, but when searching for "LLC" etc I
am still getting matches. Here is my config (I am using PHP syntax here, but
I guess the values are obvious):

'analysis' => array(
'analyzer' => array(
'name_analyzer' => array(
'type' => 'custom',
'tokenizer' => 'standard',
'filter' => array('my_stopwords','lowercase','icu_normalizer','ngram')
),
'address_analyzer' => array(
'type' => 'custom',
'tokenizer' => 'standard',
'filter' => array('standard','ngram')
),
'country_analyzer' => array(
'type' => 'custom',
'tokenizer' => 'lowercase',
'filter' => array('country_synonyms')
),
),
'filter' => array(
'ngram' => array(
'type' => 'nGram',
'min_gram' => 1,
'max_gram' => 5,
),
'country_synonyms' => array(
'type' => 'synonym',
'synonyms' => array('some synonyms that work perfectly')
),
'my_stopwords' => array(
'type' => 'stop',
'stopwords' => array('llc','gmbh',etc.etc.),
'ignore_case' => true
)
)
)

And here is my mapping:

'names' => array(
'type' => 'string',
'analyzer' => 'name_analyzer',
'index_analyzer' => 'name_analyzer',
'search_analyzer' => 'name_analyzer',
'include_in_all' => true
),
'addresses' => array(
'dynamic' => false,
'analyzer' => 'address_analyzer',
'index_analyzer' => 'address_analyzer',
'search_analyzer' => 'address_analyzer',
'properties' => array(
'street' => array(
'type' => 'string',
'analyzer' => 'address_analyzer',
'index_analyzer' => 'address_analyzer',
'search_analyzer' => 'address_analyzer',
'include_in_all' => true
),
'city' => array(
'type' => 'string',
'analyzer' => 'address_analyzer',
'index_analyzer' => 'address_analyzer',
'search_analyzer' => 'address_analyzer',
'include_in_all' => true
),
'state' => array(
'type' => 'string',
'analyzer' => 'address_analyzer',
'index_analyzer' => 'address_analyzer',
'search_analyzer' => 'address_analyzer',
'include_in_all' => true
),
'country' => array(
'type' => 'string',
'analyzer' => 'country_analyzer',
'index_analyzer' => 'country_analyzer',
'search_analyzer' => 'country_analyzer',
'include_in_all' => true
)
)
)

A GET request to myindex/_settings shows that the values are set correctly.
Example:

"index.analysis.filter.my_stopwords.stopwords.90":"llc"

I feel pretty lost here. So any help would be really appreciated!

Thanks in advance,

Hannes

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group, send email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Met vriendelijke groet,

Martijn van Groningen

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi,

I am using a combination of a query string (user searches via a "Google
Search" like textbox) and a fuzzy query to be able to find misspelled names
etc. Maybe the fuzzy search makes the stopwords useless? And if so, would
there be a way around that?

{"bool":
{"should":[
{"query_string":{"query":"gmbh","default_operator":"AND"}},

{"fuzzy_like_this":{"boost":1,"like_text":"gmbh","min_similarity":0.5,"prefix_length":0,"max_query_terms":25}}
]}
}

Thanks,

Hannes

On Tuesday, January 29, 2013 9:40:57 AM UTC+1, Martijn v Groningen wrote:

Hi Hannes,

How does the query look like?

Martijn

On 28 January 2013 20:31, Haensel <theha...@gmail.com <javascript:>>
wrote:

Hi,

I have the following problem: I have a list of company names but want to
exclude the "form of the organization" (like Limited, LLC etc.) by using
my
own stopwords filter. So far so good, but I can't get it to work. It
does
the indexing, everything is searchable, but when searching for "LLC" etc
I
am still getting matches. Here is my config (I am using PHP syntax here,
but
I guess the values are obvious):

'analysis' => array(
'analyzer' => array(
'name_analyzer' => array(
'type' => 'custom',
'tokenizer' => 'standard',
'filter' => array('my_stopwords','lowercase','icu_normalizer','ngram')
),
'address_analyzer' => array(
'type' => 'custom',
'tokenizer' => 'standard',
'filter' => array('standard','ngram')
),
'country_analyzer' => array(
'type' => 'custom',
'tokenizer' => 'lowercase',
'filter' => array('country_synonyms')
),
),
'filter' => array(
'ngram' => array(
'type' => 'nGram',
'min_gram' => 1,
'max_gram' => 5,
),
'country_synonyms' => array(
'type' => 'synonym',
'synonyms' => array('some synonyms that work perfectly')
),
'my_stopwords' => array(
'type' => 'stop',
'stopwords' => array('llc','gmbh',etc.etc.),
'ignore_case' => true
)
)
)

And here is my mapping:

'names' => array(
'type' => 'string',
'analyzer' => 'name_analyzer',
'index_analyzer' => 'name_analyzer',
'search_analyzer' => 'name_analyzer',
'include_in_all' => true
),
'addresses' => array(
'dynamic' => false,
'analyzer' => 'address_analyzer',
'index_analyzer' => 'address_analyzer',
'search_analyzer' => 'address_analyzer',
'properties' => array(
'street' => array(
'type' => 'string',
'analyzer' => 'address_analyzer',
'index_analyzer' => 'address_analyzer',
'search_analyzer' => 'address_analyzer',
'include_in_all' => true
),
'city' => array(
'type' => 'string',
'analyzer' => 'address_analyzer',
'index_analyzer' => 'address_analyzer',
'search_analyzer' => 'address_analyzer',
'include_in_all' => true
),
'state' => array(
'type' => 'string',
'analyzer' => 'address_analyzer',
'index_analyzer' => 'address_analyzer',
'search_analyzer' => 'address_analyzer',
'include_in_all' => true
),
'country' => array(
'type' => 'string',
'analyzer' => 'country_analyzer',
'index_analyzer' => 'country_analyzer',
'search_analyzer' => 'country_analyzer',
'include_in_all' => true
)
)
)

A GET request to myindex/_settings shows that the values are set
correctly.
Example:

"index.analysis.filter.my_stopwords.stopwords.90":"llc"

I feel pretty lost here. So any help would be really appreciated!

Thanks in advance,

Hannes

--
You received this message because you are subscribed to the Google
Groups
"elasticsearch" group.
To unsubscribe from this group, send email to
elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
Met vriendelijke groet,

Martijn van Groningen

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Hannes,

You're not specifying a field for both the query_string and
fuzzy_like_this queries. The behaviour is then to use the _all
field.
I think if you specify the names field then it should work. For the
query_string query use the default_field option and for the
fuzzy_like_this you need to use the fields option to use the your
names field.

Martijn

On 29 January 2013 11:37, Haensel thehaensel@gmail.com wrote:

Hi,

I am using a combination of a query string (user searches via a "Google
Search" like textbox) and a fuzzy query to be able to find misspelled names
etc. Maybe the fuzzy search makes the stopwords useless? And if so, would
there be a way around that?

{"bool":
{"should":[
{"query_string":{"query":"gmbh","default_operator":"AND"}},

{"fuzzy_like_this":{"boost":1,"like_text":"gmbh","min_similarity":0.5,"prefix_length":0,"max_query_terms":25}}
]}
}

Thanks,

Hannes

On Tuesday, January 29, 2013 9:40:57 AM UTC+1, Martijn v Groningen wrote:

Hi Hannes,

How does the query look like?

Martijn

On 28 January 2013 20:31, Haensel theha...@gmail.com wrote:

Hi,

I have the following problem: I have a list of company names but want to
exclude the "form of the organization" (like Limited, LLC etc.) by using
my
own stopwords filter. So far so good, but I can't get it to work. It
does
the indexing, everything is searchable, but when searching for "LLC" etc
I
am still getting matches. Here is my config (I am using PHP syntax here,
but
I guess the values are obvious):

'analysis' => array(
'analyzer' => array(
'name_analyzer' => array(
'type' => 'custom',
'tokenizer' => 'standard',
'filter' => array('my_stopwords','lowercase','icu_normalizer','ngram')
),
'address_analyzer' => array(
'type' => 'custom',
'tokenizer' => 'standard',
'filter' => array('standard','ngram')
),
'country_analyzer' => array(
'type' => 'custom',
'tokenizer' => 'lowercase',
'filter' => array('country_synonyms')
),
),
'filter' => array(
'ngram' => array(
'type' => 'nGram',
'min_gram' => 1,
'max_gram' => 5,
),
'country_synonyms' => array(
'type' => 'synonym',
'synonyms' => array('some synonyms that work perfectly')
),
'my_stopwords' => array(
'type' => 'stop',
'stopwords' => array('llc','gmbh',etc.etc.),
'ignore_case' => true
)
)
)

And here is my mapping:

'names' => array(
'type' => 'string',
'analyzer' => 'name_analyzer',
'index_analyzer' => 'name_analyzer',
'search_analyzer' => 'name_analyzer',
'include_in_all' => true
),
'addresses' => array(
'dynamic' => false,
'analyzer' => 'address_analyzer',
'index_analyzer' => 'address_analyzer',
'search_analyzer' => 'address_analyzer',
'properties' => array(
'street' => array(
'type' => 'string',
'analyzer' => 'address_analyzer',
'index_analyzer' => 'address_analyzer',
'search_analyzer' => 'address_analyzer',
'include_in_all' => true
),
'city' => array(
'type' => 'string',
'analyzer' => 'address_analyzer',
'index_analyzer' => 'address_analyzer',
'search_analyzer' => 'address_analyzer',
'include_in_all' => true
),
'state' => array(
'type' => 'string',
'analyzer' => 'address_analyzer',
'index_analyzer' => 'address_analyzer',
'search_analyzer' => 'address_analyzer',
'include_in_all' => true
),
'country' => array(
'type' => 'string',
'analyzer' => 'country_analyzer',
'index_analyzer' => 'country_analyzer',
'search_analyzer' => 'country_analyzer',
'include_in_all' => true
)
)
)

A GET request to myindex/_settings shows that the values are set
correctly.
Example:

"index.analysis.filter.my_stopwords.stopwords.90":"llc"

I feel pretty lost here. So any help would be really appreciated!

Thanks in advance,

Hannes

--
You received this message because you are subscribed to the Google
Groups
"elasticsearch" group.
To unsubscribe from this group, send email to
elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Met vriendelijke groet,

Martijn van Groningen

--
Met vriendelijke groet,

Martijn van Groningen

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks for your help but it still doesn't work :frowning: Here is the new query:

"query":{"bool":
{"should":[
{"query_string":{"query":"test
gmbh","fuzzy_prefix_length":3,"default_operator":"AND","default_field":"names"}},
{"fuzzy_like_this":{"fields":["names"],"boost":1,"like_text":"test
gmbh","min_similarity":0.2,"prefix_length":0,"max_query_terms":25}}]
}}
}

Here's an example: When searching for a company that was indexed as
"TestCompany" a query like "TestComp Limited" (notice the "misspelled"
company name) will find a lot of "Limited" companies while "TestCompany" is
the string I am really interested in. I simply want "TestCompany" (or a
similar name via fuzzy search) to be first in the list, effectively
ignoring all legal forms defined in my stopwords list.

On Tuesday, January 29, 2013 1:12:59 PM UTC+1, Martijn v Groningen wrote:

Hi Hannes,

You're not specifying a field for both the query_string and
fuzzy_like_this queries. The behaviour is then to use the _all
field.
I think if you specify the names field then it should work. For the
query_string query use the default_field option and for the
fuzzy_like_this you need to use the fields option to use the your
names field.

Martijn

On 29 January 2013 11:37, Haensel <theha...@gmail.com <javascript:>>
wrote:

Hi,

I am using a combination of a query string (user searches via a "Google
Search" like textbox) and a fuzzy query to be able to find misspelled
names
etc. Maybe the fuzzy search makes the stopwords useless? And if so,
would
there be a way around that?

{"bool":
{"should":[
{"query_string":{"query":"gmbh","default_operator":"AND"}},

{"fuzzy_like_this":{"boost":1,"like_text":"gmbh","min_similarity":0.5,"prefix_length":0,"max_query_terms":25}}

]}
}

Thanks,

Hannes

On Tuesday, January 29, 2013 9:40:57 AM UTC+1, Martijn v Groningen
wrote:

Hi Hannes,

How does the query look like?

Martijn

On 28 January 2013 20:31, Haensel theha...@gmail.com wrote:

Hi,

I have the following problem: I have a list of company names but want
to

exclude the "form of the organization" (like Limited, LLC etc.) by
using

my
own stopwords filter. So far so good, but I can't get it to work. It
does
the indexing, everything is searchable, but when searching for "LLC"
etc

I
am still getting matches. Here is my config (I am using PHP syntax
here,

but
I guess the values are obvious):

'analysis' => array(
'analyzer' => array(
'name_analyzer' => array(
'type' => 'custom',
'tokenizer' => 'standard',
'filter' =>
array('my_stopwords','lowercase','icu_normalizer','ngram')

),
'address_analyzer' => array(
'type' => 'custom',
'tokenizer' => 'standard',
'filter' => array('standard','ngram')
),
'country_analyzer' => array(
'type' => 'custom',
'tokenizer' => 'lowercase',
'filter' => array('country_synonyms')
),
),
'filter' => array(
'ngram' => array(
'type' => 'nGram',
'min_gram' => 1,
'max_gram' => 5,
),
'country_synonyms' => array(
'type' => 'synonym',
'synonyms' => array('some synonyms that work perfectly')
),
'my_stopwords' => array(
'type' => 'stop',
'stopwords' => array('llc','gmbh',etc.etc.),
'ignore_case' => true
)
)
)

And here is my mapping:

'names' => array(
'type' => 'string',
'analyzer' => 'name_analyzer',
'index_analyzer' => 'name_analyzer',
'search_analyzer' => 'name_analyzer',
'include_in_all' => true
),
'addresses' => array(
'dynamic' => false,
'analyzer' => 'address_analyzer',
'index_analyzer' => 'address_analyzer',
'search_analyzer' => 'address_analyzer',
'properties' => array(
'street' => array(
'type' => 'string',
'analyzer' => 'address_analyzer',
'index_analyzer' => 'address_analyzer',
'search_analyzer' => 'address_analyzer',
'include_in_all' => true
),
'city' => array(
'type' => 'string',
'analyzer' => 'address_analyzer',
'index_analyzer' => 'address_analyzer',
'search_analyzer' => 'address_analyzer',
'include_in_all' => true
),
'state' => array(
'type' => 'string',
'analyzer' => 'address_analyzer',
'index_analyzer' => 'address_analyzer',
'search_analyzer' => 'address_analyzer',
'include_in_all' => true
),
'country' => array(
'type' => 'string',
'analyzer' => 'country_analyzer',
'index_analyzer' => 'country_analyzer',
'search_analyzer' => 'country_analyzer',
'include_in_all' => true
)
)
)

A GET request to myindex/_settings shows that the values are set
correctly.
Example:

"index.analysis.filter.my_stopwords.stopwords.90":"llc"

I feel pretty lost here. So any help would be really appreciated!

Thanks in advance,

Hannes

--
You received this message because you are subscribed to the Google
Groups
"elasticsearch" group.
To unsubscribe from this group, send email to
elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Met vriendelijke groet,

Martijn van Groningen

--
Met vriendelijke groet,

Martijn van Groningen

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I think the reason you see other companies that have these stopwords
is not because you had a match on the stop filter, but because the
fuzzy_like_this query has a low min_similarity. I suggest that you
increase it to something like 0.7 You have to test and play around
with it a bit.

If you really think that you have a hit on a stopword (which I don't
think happens) you can verify this the explain option:
http://www.elasticsearch.org/guide/reference/api/search/explain.html

Martijn

On 29 January 2013 13:41, Haensel thehaensel@gmail.com wrote:

Thanks for your help but it still doesn't work :frowning: Here is the new query:

"query":{"bool":
{"should":[
{"query_string":{"query":"test
gmbh","fuzzy_prefix_length":3,"default_operator":"AND","default_field":"names"}},
{"fuzzy_like_this":{"fields":["names"],"boost":1,"like_text":"test
gmbh","min_similarity":0.2,"prefix_length":0,"max_query_terms":25}}]
}}
}

Here's an example: When searching for a company that was indexed as
"TestCompany" a query like "TestComp Limited" (notice the "misspelled"
company name) will find a lot of "Limited" companies while "TestCompany" is
the string I am really interested in. I simply want "TestCompany" (or a
similar name via fuzzy search) to be first in the list, effectively ignoring
all legal forms defined in my stopwords list.

On Tuesday, January 29, 2013 1:12:59 PM UTC+1, Martijn v Groningen wrote:

Hi Hannes,

You're not specifying a field for both the query_string and
fuzzy_like_this queries. The behaviour is then to use the _all
field.
I think if you specify the names field then it should work. For the
query_string query use the default_field option and for the
fuzzy_like_this you need to use the fields option to use the your
names field.

Martijn

On 29 January 2013 11:37, Haensel theha...@gmail.com wrote:

Hi,

I am using a combination of a query string (user searches via a "Google
Search" like textbox) and a fuzzy query to be able to find misspelled
names
etc. Maybe the fuzzy search makes the stopwords useless? And if so,
would
there be a way around that?

{"bool":
{"should":[
{"query_string":{"query":"gmbh","default_operator":"AND"}},

{"fuzzy_like_this":{"boost":1,"like_text":"gmbh","min_similarity":0.5,"prefix_length":0,"max_query_terms":25}}
]}
}

Thanks,

Hannes

On Tuesday, January 29, 2013 9:40:57 AM UTC+1, Martijn v Groningen
wrote:

Hi Hannes,

How does the query look like?

Martijn

On 28 January 2013 20:31, Haensel theha...@gmail.com wrote:

Hi,

I have the following problem: I have a list of company names but want
to
exclude the "form of the organization" (like Limited, LLC etc.) by
using
my
own stopwords filter. So far so good, but I can't get it to work. It
does
the indexing, everything is searchable, but when searching for "LLC"
etc
I
am still getting matches. Here is my config (I am using PHP syntax
here,
but
I guess the values are obvious):

'analysis' => array(
'analyzer' => array(
'name_analyzer' => array(
'type' => 'custom',
'tokenizer' => 'standard',
'filter' =>
array('my_stopwords','lowercase','icu_normalizer','ngram')
),
'address_analyzer' => array(
'type' => 'custom',
'tokenizer' => 'standard',
'filter' => array('standard','ngram')
),
'country_analyzer' => array(
'type' => 'custom',
'tokenizer' => 'lowercase',
'filter' => array('country_synonyms')
),
),
'filter' => array(
'ngram' => array(
'type' => 'nGram',
'min_gram' => 1,
'max_gram' => 5,
),
'country_synonyms' => array(
'type' => 'synonym',
'synonyms' => array('some synonyms that work perfectly')
),
'my_stopwords' => array(
'type' => 'stop',
'stopwords' => array('llc','gmbh',etc.etc.),
'ignore_case' => true
)
)
)

And here is my mapping:

'names' => array(
'type' => 'string',
'analyzer' => 'name_analyzer',
'index_analyzer' => 'name_analyzer',
'search_analyzer' => 'name_analyzer',
'include_in_all' => true
),
'addresses' => array(
'dynamic' => false,
'analyzer' => 'address_analyzer',
'index_analyzer' => 'address_analyzer',
'search_analyzer' => 'address_analyzer',
'properties' => array(
'street' => array(
'type' => 'string',
'analyzer' => 'address_analyzer',
'index_analyzer' => 'address_analyzer',
'search_analyzer' => 'address_analyzer',
'include_in_all' => true
),
'city' => array(
'type' => 'string',
'analyzer' => 'address_analyzer',
'index_analyzer' => 'address_analyzer',
'search_analyzer' => 'address_analyzer',
'include_in_all' => true
),
'state' => array(
'type' => 'string',
'analyzer' => 'address_analyzer',
'index_analyzer' => 'address_analyzer',
'search_analyzer' => 'address_analyzer',
'include_in_all' => true
),
'country' => array(
'type' => 'string',
'analyzer' => 'country_analyzer',
'index_analyzer' => 'country_analyzer',
'search_analyzer' => 'country_analyzer',
'include_in_all' => true
)
)
)

A GET request to myindex/_settings shows that the values are set
correctly.
Example:

"index.analysis.filter.my_stopwords.stopwords.90":"llc"

I feel pretty lost here. So any help would be really appreciated!

Thanks in advance,

Hannes

--
You received this message because you are subscribed to the Google
Groups
"elasticsearch" group.
To unsubscribe from this group, send email to
elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Met vriendelijke groet,

Martijn van Groningen

--
Met vriendelijke groet,

Martijn van Groningen

--
Met vriendelijke groet,

Martijn van Groningen

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I've played around with the min_similarity quite a bit and it is currently
set to 0.7 so this isn't the reason I suppose. But thanks for the hint
regarding the explanation. For the query "gmbh" (which is a stopword) it
returns the following

{"value":0.625,"description":"fieldWeight(names:gmbh in 4993), product of:",
"details":[{"value":1,"description":"tf(termFreq(names:gmbh)=1)"},{"value":1
,"description":"idf(docFreq=40, maxDocs=7428)"},{"value":0.625,"description"
:"fieldNorm(field=names, doc=4993)"}]}

I don't really get the meaning of this but it seems that there ARE hits on
gmbh as if it weren't a stopword although the _settings show that gmbh is
properly configured

"index.analysis.filter.my_stopwords.stopwords.19" : "gmbh"

I am getting nuts here :slight_smile:

On Tuesday, January 29, 2013 4:01:00 PM UTC+1, Martijn v Groningen wrote:

I think the reason you see other companies that have these stopwords
is not because you had a match on the stop filter, but because the
fuzzy_like_this query has a low min_similarity. I suggest that you
increase it to something like 0.7 You have to test and play around
with it a bit.

If you really think that you have a hit on a stopword (which I don't
think happens) you can verify this the explain option:
http://www.elasticsearch.org/guide/reference/api/search/explain.html

Martijn

On 29 January 2013 13:41, Haensel <theha...@gmail.com <javascript:>>
wrote:

Thanks for your help but it still doesn't work :frowning: Here is the new query:

"query":{"bool":
{"should":[
{"query_string":{"query":"test

gmbh","fuzzy_prefix_length":3,"default_operator":"AND","default_field":"names"}},

{"fuzzy_like_this":{"fields":["names"],"boost":1,"like_text":"test

gmbh","min_similarity":0.2,"prefix_length":0,"max_query_terms":25}}]
}}
}

Here's an example: When searching for a company that was indexed as
"TestCompany" a query like "TestComp Limited" (notice the "misspelled"
company name) will find a lot of "Limited" companies while "TestCompany"
is
the string I am really interested in. I simply want "TestCompany" (or a
similar name via fuzzy search) to be first in the list, effectively
ignoring
all legal forms defined in my stopwords list.

On Tuesday, January 29, 2013 1:12:59 PM UTC+1, Martijn v Groningen
wrote:

Hi Hannes,

You're not specifying a field for both the query_string and
fuzzy_like_this queries. The behaviour is then to use the _all
field.
I think if you specify the names field then it should work. For the
query_string query use the default_field option and for the
fuzzy_like_this you need to use the fields option to use the your
names field.

Martijn

On 29 January 2013 11:37, Haensel theha...@gmail.com wrote:

Hi,

I am using a combination of a query string (user searches via a
"Google

Search" like textbox) and a fuzzy query to be able to find misspelled
names
etc. Maybe the fuzzy search makes the stopwords useless? And if so,
would
there be a way around that?

{"bool":
{"should":[
{"query_string":{"query":"gmbh","default_operator":"AND"}},

{"fuzzy_like_this":{"boost":1,"like_text":"gmbh","min_similarity":0.5,"prefix_length":0,"max_query_terms":25}}

]}
}

Thanks,

Hannes

On Tuesday, January 29, 2013 9:40:57 AM UTC+1, Martijn v Groningen
wrote:

Hi Hannes,

How does the query look like?

Martijn

On 28 January 2013 20:31, Haensel theha...@gmail.com wrote:

Hi,

I have the following problem: I have a list of company names but
want

to
exclude the "form of the organization" (like Limited, LLC etc.) by
using
my
own stopwords filter. So far so good, but I can't get it to work.
It

does
the indexing, everything is searchable, but when searching for
"LLC"

etc
I
am still getting matches. Here is my config (I am using PHP syntax
here,
but
I guess the values are obvious):

'analysis' => array(
'analyzer' => array(
'name_analyzer' => array(
'type' => 'custom',
'tokenizer' => 'standard',
'filter' =>
array('my_stopwords','lowercase','icu_normalizer','ngram')
),
'address_analyzer' => array(
'type' => 'custom',
'tokenizer' => 'standard',
'filter' => array('standard','ngram')
),
'country_analyzer' => array(
'type' => 'custom',
'tokenizer' => 'lowercase',
'filter' => array('country_synonyms')
),
),
'filter' => array(
'ngram' => array(
'type' => 'nGram',
'min_gram' => 1,
'max_gram' => 5,
),
'country_synonyms' => array(
'type' => 'synonym',
'synonyms' => array('some synonyms that work perfectly')
),
'my_stopwords' => array(
'type' => 'stop',
'stopwords' => array('llc','gmbh',etc.etc.),
'ignore_case' => true
)
)
)

And here is my mapping:

'names' => array(
'type' => 'string',
'analyzer' => 'name_analyzer',
'index_analyzer' => 'name_analyzer',
'search_analyzer' => 'name_analyzer',
'include_in_all' => true
),
'addresses' => array(
'dynamic' => false,
'analyzer' => 'address_analyzer',
'index_analyzer' => 'address_analyzer',
'search_analyzer' => 'address_analyzer',
'properties' => array(
'street' => array(
'type' => 'string',
'analyzer' => 'address_analyzer',
'index_analyzer' => 'address_analyzer',
'search_analyzer' => 'address_analyzer',
'include_in_all' => true
),
'city' => array(
'type' => 'string',
'analyzer' => 'address_analyzer',
'index_analyzer' => 'address_analyzer',
'search_analyzer' => 'address_analyzer',
'include_in_all' => true
),
'state' => array(
'type' => 'string',
'analyzer' => 'address_analyzer',
'index_analyzer' => 'address_analyzer',
'search_analyzer' => 'address_analyzer',
'include_in_all' => true
),
'country' => array(
'type' => 'string',
'analyzer' => 'country_analyzer',
'index_analyzer' => 'country_analyzer',
'search_analyzer' => 'country_analyzer',
'include_in_all' => true
)
)
)

A GET request to myindex/_settings shows that the values are set
correctly.
Example:

"index.analysis.filter.my_stopwords.stopwords.90":"llc"

I feel pretty lost here. So any help would be really appreciated!

Thanks in advance,

Hannes

--
You received this message because you are subscribed to the Google
Groups
"elasticsearch" group.
To unsubscribe from this group, send email to
elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Met vriendelijke groet,

Martijn van Groningen

--
Met vriendelijke groet,

Martijn van Groningen

--
Met vriendelijke groet,

Martijn van Groningen

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ok, so after fiddling around a bit I seem to have found the reason for this
behaviour:

curl -XGET
'http://127.0.0.1:9200/sanctionlists/_analyze?pretty=1&text=gmbh&analyzer=name_analyzer'

results in:

{
"tokens" : [ ]
}

whereas

curl -XGET
'http://127.0.0.1:9200/sanctionlists/_analyze?pretty=1&text=Gmbh&analyzer=name_analyzer'

results in:

{
"tokens" : [ {
"token" : "g",
"start_offset" : 0,
"end_offset" : 1,
"type" : "word",
"position" : 1
}, {
"token" : "m",
"start_offset" : 1,
"end_offset" : 2,
"type" : "word",
"position" : 2
}, {
"token" : "b",
"start_offset" : 2,
"end_offset" : 3,
"type" : "word",
"position" : 3
}, {
"token" : "h",
"start_offset" : 3,
"end_offset" : 4,
"type" : "word",
"position" : 4
}, {
"token" : "gm",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 5
}, {
"token" : "mb",
"start_offset" : 1,
"end_offset" : 3,
"type" : "word",
"position" : 6
}, {
"token" : "bh",
"start_offset" : 2,
"end_offset" : 4,
"type" : "word",
"position" : 7
}, {
"token" : "gmb",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 8
}, {
"token" : "mbh",
"start_offset" : 1,
"end_offset" : 4,
"type" : "word",
"position" : 9
}, {
"token" : "gmbh",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 10
} ]

So the analyzer seems not to ignore the case of the stopword even if
ignore_case is set to true. So I set a lowercase filter in FRONT of the
stopword filter and this did the trick for the examples above. The next
thing was that the fuzzy_like_this query does not analyze a query
(splitting it into tokens) as it seems. So I tried a match query on the
names field with the proper fuzziness (0.5), setting the analyzer to
"name_analyzer" and this actually worked!

I am not sure wheter this is a bug or intended when using FLT queries. Any
thoughts?

Thanks,

Hannes

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

So the analyzer seems not to ignore the case of the stopword even if
ignore_case is set to true. So I set a lowercase filter in FRONT of the
stopword filter and this did the trick for the examples above.
Off course! Too bad I missed that one...

thing was that the fuzzy_like_this query does not analyze a query (splitting
it into tokens) as it seems. So I tried a match query on the names field
with the proper fuzziness (0.5), setting the analyzer to "name_analyzer" and
this actually worked!

I am not sure wheter this is a bug or intended when using FLT queries. Any
thoughts?
This is not a bug, but just how the fuzzy_like_this query works. It
does analyze the query, but uses a default analyzer (unless you
override it via the analyzer option). The match query uses by
default the analyzer of the specified field. Maybe in your case it is
better to use the fuzzy_like_this_field query instead of the
fuzzy_like_this query, that needs a field as required parameter.
This query by default uses the search analyzer of the specified field.

Martijn

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.