Ignore small words in queries

Hi,
I'd like to know how I could remove small words from search results. For
your information I am using a match query.
Thanks

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

On Fri, 2013-02-01 at 01:57 -0800, benkunz wrote:

Hi,
I'd like to know how I could remove small words from search results.
For your information I am using a match query.

Not sure what you want to achieve exactly. Could you explain more?

clint

Thanks

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

no problem, I'll explain :

I run the following command :

curl -XGET "http://localhost:9200/jdbc/article/_search?pretty=true" -d "{
"""query""" : {"""match""" : {"""designation""" : {"""query""":"""the
economy rocks""","""fuzziness""":"""0.5"""} }}}"

it returns the following results:

the economy sucks
the economy rocks
the business

the 3rd element is returned because "the" is found in the search string
"the economy rocks". What I want is to tell elascticsearch to ignore words
smaller than 3 characters so that my query would only return :

the economy sucks
the economy rocks

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

curl -XGET "http://localhost:9200/jdbc/article/_search?pretty=true" -d
"{ """query""" : {"""match""" : {"""designation""" :
{"""query""":"""the economy rocks""","""fuzziness""":"""0.5"""} }}}"

btw, using single quotes makes the above more readable:

curl -XGET 'http://localhost:9200/jdbc/article/_search?pretty=true' -d '
{
"query" : {
"match" : {
"designation" : {
"query":"the economy rocks",
"fuzziness":"0.5"
}
}
}
}'

it returns the following results:

the economy sucks
the economy rocks
the business

the 3rd element is returned because "the" is found in the search
string "the economy rocks". What I want is to tell elascticsearch to
ignore words smaller than 3 characters so that my query would only
return :

the economy sucks
the economy rocks

The problem here is not small words, but the fact that your query is
looking for "the OR economy OR sucks"

You could change that by setting the "operator" to "and", in which case
it will require all of the words.

Or you could set "minimum_should_match" to (eg) "70%" in which case it'd
require at least two of the three words in the query.

Also, words like 'the' are usually considered irrelevant to the query,
and are removed with the stopwords token filter by default. You have
obviously used a different analyzer from the default "standard"
analyzer, otherwise 'the' would have been removed.

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks for the advice.
How do I check what analyzer I am using ? I don't think I have modified it,
I just started playing with elasticsearch and haven't done much
configuration.
Beside the request I showed you, I only created a jdbc river that reads my
mysql table and feeds the index.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Clinton, I know you' ve given me workarounds, but do you know if there is a
way to tell elastic search to ignore words small than n characters?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

On Fri, 2013-02-01 at 03:47 -0800, benkunz wrote:

Thanks for the advice.
How do I check what analyzer I am using ? I don't think I have
modified it, I just started playing with elasticsearch and haven't
done much configuration.
Beside the request I showed you, I only created a jdbc river that
reads my mysql table and feeds the index.

You can use the index-settings API to retrieve info about what analyzers
you have configured, and the mappings API to retrieve info about how
your fields are mapped/configured

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

On Fri, 2013-02-01 at 04:41 -0800, benkunz wrote:

Clinton, I know you' ve given me workarounds, but do you know if there
is a way to tell Elasticsearch to ignore words small than n
characters?

You'd have to filter them out at index time or analysis time or both.

Probably the easiest in this case would be to just remove them from the
query string in your application

Alternatively, look at using stopwords, or creating a custom analyzer
which removes any tokens less than $length

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

There is a Length Token Filter:

If you filter out words at index time, you must filter them out at query.
You can get away with only using the filter at query time, depending on
other factors.

--
Ivan

On Fri, Feb 1, 2013 at 4:52 AM, Clinton Gormley clint@traveljury.comwrote:

On Fri, 2013-02-01 at 04:41 -0800, benkunz wrote:

Clinton, I know you' ve given me workarounds, but do you know if there
is a way to tell Elasticsearch to ignore words small than n
characters?

You'd have to filter them out at index time or analysis time or both.

Probably the easiest in this case would be to just remove them from the
query string in your application

Alternatively, look at using stopwords, or creating a custom analyzer
which removes any tokens less than $length

clint

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.