My use case relates to the percolator function in ES, but I imagine it's
just as valid for traditional document indexing.
If I set up a percolator for the query: ""empire"", i.e. empire with
quotations around it, I get matches back for documents that have the word
'empired'. For queries without quotations I need matches returned for the
plural forms so I can't remove the stemmer all together.
At the moment the only way I can theoretically achieve what I want is to
setup the percolators using different analyzers depending on whether I want
to match plurals or not, identified by the presence of quotations in the
query. I would then need to percolate two copies of every document, one
using a stemmer and one without. This will half the performance and also
doesn't allow for queries like: ""empire" AND fight", which would match
only the singular for empire but plural forms for fight. Is there a nicer
way to achieve the desired result? Thanks.
My use case relates to the percolator function in ES, but I imagine it's
just as valid for traditional document indexing.
If I set up a percolator for the query: ""empire"", i.e. empire with
quotations around it, I get matches back for documents that have the word
'empired'. For queries without quotations I need matches returned for the
plural forms so I can't remove the stemmer all together.
At the moment the only way I can theoretically achieve what I want is to
setup the percolators using different analyzers depending on whether I want
to match plurals or not, identified by the presence of quotations in the
query. I would then need to percolate two copies of every document, one
using a stemmer and one without. This will half the performance and also
doesn't allow for queries like: ""empire" AND fight", which would match
only the singular for empire but plural forms for fight. Is there a nicer
way to achieve the desired result? Thanks.
I setup a fresh index with the snowball stemmer. I then create a percolator
for the term "empire" (with the quotes). I then percolate a document with
the text 'empire', which correctly matches. I then percolate another
document with the text 'empires' and again the percolator matches. This
second example is matching a stemmed version of the original percolator,
however I was hoping that it wouldn't match since the percolator had the
search term in quotations, indicating the need for an exact match.
If you search Google for 'car' you will get matches for 'cars', however if
you search for "car" (with quotes) you will only get matches for 'car', not
the plural form. I was hoping to get this natural language functionality
out of the box with ES. I'm pretty sure Lucene doesn't natively support
this so it's a pretty tall order. As I said previously I can create two
sets of percolators, one with stemming and one without. Then I can
register queries that use quotes with the non-stemmed and all others with
the stemmed, then percolate each document against both sets. This is good
enough for the moment but it would be really great to handle mixed queries,
e.g. '"car" AND fight' matching 'car ... fights', whereby stemming has been
applied to the fight term but not the car term.
I'm just wondering if there is a cleaner way to achieve what I want with
the existing codebase, rather than specifying a feature requests.
Note though, there is a caveat here. Remember that when indexing data, the
analyzer is also applied, so 'empires' indexed will be indexed as 'empire'
(with stemming). And, if you don't do any stemming on it when searching,
"empires" (with non stemming analyzer) will not find anything.
I setup a fresh index with the snowball stemmer. I then create a
percolator for the term "empire" (with the quotes). I then percolate a
document with the text 'empire', which correctly matches. I then percolate
another document with the text 'empires' and again the percolator matches.
This second example is matching a stemmed version of the original
percolator, however I was hoping that it wouldn't match since the
percolator had the search term in quotations, indicating the need for an
exact match.
If you search Google for 'car' you will get matches for 'cars', however if
you search for "car" (with quotes) you will only get matches for 'car', not
the plural form. I was hoping to get this natural language functionality
out of the box with ES. I'm pretty sure Lucene doesn't natively support
this so it's a pretty tall order. As I said previously I can create two
sets of percolators, one with stemming and one without. Then I can
register queries that use quotes with the non-stemmed and all others with
the stemmed, then percolate each document against both sets. This is good
enough for the moment but it would be really great to handle mixed queries,
e.g. '"car" AND fight' matching 'car ... fights', whereby stemming has been
applied to the fight term but not the car term.
I'm just wondering if there is a cleaner way to achieve what I want with
the existing codebase, rather than specifying a feature requests.
Note though, there is a caveat here. Remember that when indexing data, the
analyzer is also applied, so 'empires' indexed will be indexed as 'empire'
(with stemming). And, if you don't do any stemming on it when searching,
"empires" (with non stemming analyzer) will not find anything.
I setup a fresh index with the snowball stemmer. I then create a
percolator for the term "empire" (with the quotes). I then percolate a
document with the text 'empire', which correctly matches. I then percolate
another document with the text 'empires' and again the percolator matches.
This second example is matching a stemmed version of the original
percolator, however I was hoping that it wouldn't match since the
percolator had the search term in quotations, indicating the need for an
exact match.
If you search Google for 'car' you will get matches for 'cars', however
if you search for "car" (with quotes) you will only get matches for 'car',
not the plural form. I was hoping to get this natural language
functionality out of the box with ES. I'm pretty sure Lucene doesn't
natively support this so it's a pretty tall order. As I said previously I
can create two sets of percolators, one with stemming and one without. Then
I can register queries that use quotes with the non-stemmed and all others
with the stemmed, then percolate each document against both sets. This is
good enough for the moment but it would be really great to handle mixed
queries, e.g. '"car" AND fight' matching 'car ... fights', whereby stemming
has been applied to the fight term but not the car term.
I'm just wondering if there is a cleaner way to achieve what I want with
the existing codebase, rather than specifying a feature requests.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.