Extracting fuzzy match terms


(Graham Turner) #1

Hi,

I'm working on a proof-of-concept for a client, replacing an existing
legacy search system with an elastic based alternative. One of the
requirements that comes from the existing system is that, when performing a
fuzzy or wildcard search, the user can view all the matching terms, and
include/exclude them manually from the subsequent search.

Thus, if a fuzzy search for 'graham' is submitted (or a wildcard like
'grm'), it might match grayam, graeme, grahum, grahem, etc. The users
want to be able to see this list of matched terms, then, for instance,
exclude 'grayam' from the expanded terms list, so that all the other
expansions are used, but not the specifically excluded one.

I’m struggling to retrieve this list of terms in the first place. Ideally
I’d like to submit a simple query for a fuzzy or wildcard term, and have it
return just the possible matching terms (up to a given limit).

I’ve had reasonable success using the term suggester for fuzzy-type
responses, but can’t use this for wildcard expansions.

Is there a good way to do this using 'out-of-the-box' elastic
functionality?

Any advice / hints gratefully accepted!

Thanks

Graham

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4b41c54b-f749-4cf1-902a-f3d0ce145d29%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Mark Harwood) #2

Hi Graham,
If you were to use the highlighter functionality you would essentially "see
what the search engine saw".
With some client-side coding you could parse out the expanded search terms
because they would be surrounded by tags in matching docs.
Of course this wouldn't provide a de-duped list of terms and would be
inefficient to return an exhaustive list of all expansions used but may be
an approach to investigate.

Cheers
Mark

On Monday, April 27, 2015 at 5:08:55 PM UTC+1, Graham Turner wrote:

Hi,

I'm working on a proof-of-concept for a client, replacing an existing
legacy search system with an elastic based alternative. One of the
requirements that comes from the existing system is that, when performing a
fuzzy or wildcard search, the user can view all the matching terms, and
include/exclude them manually from the subsequent search.

Thus, if a fuzzy search for 'graham' is submitted (or a wildcard like
'grm'), it might match grayam, graeme, grahum, grahem, etc. The users
want to be able to see this list of matched terms, then, for instance,
exclude 'grayam' from the expanded terms list, so that all the other
expansions are used, but not the specifically excluded one.

I’m struggling to retrieve this list of terms in the first place. Ideally
I’d like to submit a simple query for a fuzzy or wildcard term, and have it
return just the possible matching terms (up to a given limit).

I’ve had reasonable success using the term suggester for fuzzy-type
responses, but can’t use this for wildcard expansions.

Is there a good way to do this using 'out-of-the-box' elastic
functionality?

Any advice / hints gratefully accepted!

Thanks

Graham

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6ef4ede0-78d4-4aa2-82d6-4041cadce89d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Graham Turner) #3

Thanks Mark.

I did wonder about the highlighter, but using it would mean potentially
retrieving every hit and parsing it, which feels pretty impractical for
large searches.

Presumably the fuzzy query has to identify a full list of matching terms
internally - is there any way we could somehow hook into this, or retrieve
the list separately to the query results? A mechanism similar to the
suggester, just accepting a single fuzzy term or a wildcard term would be
perfect. I appreciate this probably isn't a common request, but I'm sure
it would have other use cases. Something to consider for a future release
perhaps? :slight_smile:

Cheers

Graham

On Monday, 27 April 2015 17:41:17 UTC+1, ma...@elastic.co wrote:

Hi Graham,
If you were to use the highlighter functionality you would essentially
"see what the search engine saw".
With some client-side coding you could parse out the expanded search terms
because they would be surrounded by tags in matching docs.
Of course this wouldn't provide a de-duped list of terms and would be
inefficient to return an exhaustive list of all expansions used but may be
an approach to investigate.

Cheers
Mark

On Monday, April 27, 2015 at 5:08:55 PM UTC+1, Graham Turner wrote:

Hi,

I'm working on a proof-of-concept for a client, replacing an existing
legacy search system with an elastic based alternative. One of the
requirements that comes from the existing system is that, when performing a
fuzzy or wildcard search, the user can view all the matching terms, and
include/exclude them manually from the subsequent search.

Thus, if a fuzzy search for 'graham' is submitted (or a wildcard like
'grm'), it might match grayam, graeme, grahum, grahem, etc. The users
want to be able to see this list of matched terms, then, for instance,
exclude 'grayam' from the expanded terms list, so that all the other
expansions are used, but not the specifically excluded one.

I’m struggling to retrieve this list of terms in the first place.
Ideally I’d like to submit a simple query for a fuzzy or wildcard term, and
have it return just the possible matching terms (up to a given limit).

I’ve had reasonable success using the term suggester for fuzzy-type
responses, but can’t use this for wildcard expansions.

Is there a good way to do this using 'out-of-the-box' elastic
functionality?

Any advice / hints gratefully accepted!

Thanks

Graham

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/bb544adc-bf72-4d9c-a000-2ce08604488c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Mark Harwood) #4

All Lucene queries implement extractTerms [1] and this API is used by
highlighter implementations to get the expanded set of terms in
wildcards/fuzzy etc.
This set of terms isn't exposed directly in elasticsearch today but you may
be able to hack something together using scripts or a custom Java plugin -
look at SearchContext.current().query().extractTerms().

Cheers
Mark

[1] http://lucene.apache.org/core/5_1_0/core/org/apache/lucene/search/Query.html#extractTerms(java.util.Set)

On Tuesday, April 28, 2015 at 12:00:49 PM UTC+1, Graham Turner wrote:

Thanks Mark.

I did wonder about the highlighter, but using it would mean potentially
retrieving every hit and parsing it, which feels pretty impractical for
large searches.

Presumably the fuzzy query has to identify a full list of matching terms
internally - is there any way we could somehow hook into this, or retrieve
the list separately to the query results? A mechanism similar to the
suggester, just accepting a single fuzzy term or a wildcard term would be
perfect. I appreciate this probably isn't a common request, but I'm sure
it would have other use cases. Something to consider for a future release
perhaps? :slight_smile:

Cheers

Graham

On Monday, 27 April 2015 17:41:17 UTC+1, ma...@elastic.co wrote:

Hi Graham,
If you were to use the highlighter functionality you would essentially
"see what the search engine saw".
With some client-side coding you could parse out the expanded search
terms because they would be surrounded by tags in matching docs.
Of course this wouldn't provide a de-duped list of terms and would be
inefficient to return an exhaustive list of all expansions used but may be
an approach to investigate.

Cheers
Mark

On Monday, April 27, 2015 at 5:08:55 PM UTC+1, Graham Turner wrote:

Hi,

I'm working on a proof-of-concept for a client, replacing an existing
legacy search system with an elastic based alternative. One of the
requirements that comes from the existing system is that, when performing a
fuzzy or wildcard search, the user can view all the matching terms, and
include/exclude them manually from the subsequent search.

Thus, if a fuzzy search for 'graham' is submitted (or a wildcard like
'grm'), it might match grayam, graeme, grahum, grahem, etc. The users
want to be able to see this list of matched terms, then, for instance,
exclude 'grayam' from the expanded terms list, so that all the other
expansions are used, but not the specifically excluded one.

I’m struggling to retrieve this list of terms in the first place.
Ideally I’d like to submit a simple query for a fuzzy or wildcard term, and
have it return just the possible matching terms (up to a given limit).

I’ve had reasonable success using the term suggester for fuzzy-type
responses, but can’t use this for wildcard expansions.

Is there a good way to do this using 'out-of-the-box' elastic
functionality?

Any advice / hints gratefully accepted!

Thanks

Graham

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d8672e94-9063-4005-9d53-15b5cd0c6beb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Graham Turner) #5

That sounds interesting - I'll have a look and see if we can pull something
together

Cheers!

On Tuesday, 28 April 2015 21:26:12 UTC+1, ma...@elastic.co wrote:

All Lucene queries implement extractTerms [1] and this API is used by
highlighter implementations to get the expanded set of terms in
wildcards/fuzzy etc.
This set of terms isn't exposed directly in elasticsearch today but you
may be able to hack something together using scripts or a custom Java
plugin - look at SearchContext.current().query().extractTerms().

Cheers
Mark

[1]
http://lucene.apache.org/core/5_1_0/core/org/apache/lucene/search/Query.html#extractTerms(java.util.Set)

On Tuesday, April 28, 2015 at 12:00:49 PM UTC+1, Graham Turner wrote:

Thanks Mark.

I did wonder about the highlighter, but using it would mean potentially
retrieving every hit and parsing it, which feels pretty impractical for
large searches.

Presumably the fuzzy query has to identify a full list of matching terms
internally - is there any way we could somehow hook into this, or retrieve
the list separately to the query results? A mechanism similar to the
suggester, just accepting a single fuzzy term or a wildcard term would be
perfect. I appreciate this probably isn't a common request, but I'm sure
it would have other use cases. Something to consider for a future release
perhaps? :slight_smile:

Cheers

Graham

On Monday, 27 April 2015 17:41:17 UTC+1, ma...@elastic.co wrote:

Hi Graham,
If you were to use the highlighter functionality you would essentially
"see what the search engine saw".
With some client-side coding you could parse out the expanded search
terms because they would be surrounded by tags in matching docs.
Of course this wouldn't provide a de-duped list of terms and would be
inefficient to return an exhaustive list of all expansions used but may be
an approach to investigate.

Cheers
Mark

On Monday, April 27, 2015 at 5:08:55 PM UTC+1, Graham Turner wrote:

Hi,

I'm working on a proof-of-concept for a client, replacing an existing
legacy search system with an elastic based alternative. One of the
requirements that comes from the existing system is that, when performing a
fuzzy or wildcard search, the user can view all the matching terms, and
include/exclude them manually from the subsequent search.

Thus, if a fuzzy search for 'graham' is submitted (or a wildcard like
'grm'), it might match grayam, graeme, grahum, grahem, etc. The users
want to be able to see this list of matched terms, then, for instance,
exclude 'grayam' from the expanded terms list, so that all the other
expansions are used, but not the specifically excluded one.

I’m struggling to retrieve this list of terms in the first place.
Ideally I’d like to submit a simple query for a fuzzy or wildcard term, and
have it return just the possible matching terms (up to a given limit).

I’ve had reasonable success using the term suggester for fuzzy-type
responses, but can’t use this for wildcard expansions.

Is there a good way to do this using 'out-of-the-box' elastic
functionality?

Any advice / hints gratefully accepted!

Thanks

Graham

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/29b83988-524f-47ff-bb3d-93f6685f58f3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #6