Identify similar/related keywords

Hello everyone.

First of all, ElasticSearch is amazing, thanks for that!

Getting down to business, I'm trying to use ElasticSearch for identifying
similar and related keywords. The objective is be able to tell with 100%
certainty that no similar exists.

So far, I have 28000 keywords uploaded to my local ElasticSearch setup to
check against.

The idea is to be able to tell if a given keyword has already a similar
form, in an automatic manner.

My keywords are composed of one or more words, and I created two indexes
with them, one being a copy of the other, differing only by the form they
are mapped.
(It is redundant indeed, but I could not find another form to use two
mappings. If there is an easier way for that, please let me know!)

So, the first index is index called 'keywords' and the second 'tokens', and
only one type for each 'pt_BR' containing the keyword (only working with
brazilian portuguese for now)

The mapping for both can be seen here https://gist.github.com/3170803,
and also the analyzer for tokens index, 'token_analyzer'.

As you can see, the keywords index uses a keyword analyzer.
I'm using the fuzzy query to get the similar keywords.
It is fair enough when there are indexed keywords which are in the same
order as the search term. Example https://gist.github.com/3170824.
Still, I would like to improve it so the most similar terms 'hotel são
paulo' and 'hotéis são paulo', which got listed respectively in 3rd and
10th positions, be listed first.

Even though it did not work for all cases. I have a problem when the
keyword comes in a different order, like this examplehttps://gist.github.com/3170862.
(Matches will show up in the next example)

For this reason, I created the tokens index, analyzing the keywords using
a stemmer and removing stop words.
I tried both fuzzy_like_this_field and text queries, and they are quite
good for the previous example. Look here https://gist.github.com/3170949.
The second search, with operator: and is even better, but also does not
match all entries.

If I use the former to search for 'hotel são paulo', it matches 'hotel são
paulo' but does not match 'hotéis em são paulo'.
If instead I search 'hotéis são paulo' it matches only 'hotéis em são
paulo', not matching 'hotel são paulo'. (Given that I'm using fuzziness =
0.6)

The reason for this I identified as being the stemmer used, which stems the
word hotéis to hot but does not stem the word hotel. (Forgot to
mention, hotéis is the plural version of hotel, portuguese for hotel...)

Does anyone have any suggestion of how to improve the map/search process so
I can be totally sure that when I got not results from a query, there is
really not a single keyword related already inserted?

If anything is not clear enough, just ask, I'll try harder.

Thanks,
Leandro

Hey,

so this is a tricky problem and might only be really solvable up to a
certain precision. I doubt you can get 100% here really. Yet, you have
multiple problems here is a small list:

  • normalization (domain specific, stemmer might not do well) -> "*hotéis"
    | "*hotel" -> "hotel"
  • character edit distance ("hotl" -> "hotel") == LD1
  • term edit distance ("hotel brasilia", "brasilia hotel")

I'd try to build a normalization dictionary based on your domain using
synonym filters etc. for Character edit distance you might want to use
fuzzy query but that can be very very slow. For the term edit distance you
can use MultiPhraseQuery with a certain max slop to get the best of both
worlds. There is no explicit support for this in Elasticsearch yet but I'd
be certainly interested to add this.

I added this issue to track
this: Add Explicit Multi & PhraseQuery support to REST & Java API · Issue #2118 · elastic/elasticsearch · GitHub

simon

On Tuesday, July 24, 2012 6:42:26 PM UTC+2, Leandro Boscariol wrote:

Hello everyone.

First of all, Elasticsearch is amazing, thanks for that!

Getting down to business, I'm trying to use Elasticsearch for identifying
similar and related keywords. The objective is be able to tell with 100%
certainty that no similar exists.

So far, I have 28000 keywords uploaded to my local Elasticsearch setup to
check against.

The idea is to be able to tell if a given keyword has already a similar
form, in an automatic manner.

My keywords are composed of one or more words, and I created two indexes
with them, one being a copy of the other, differing only by the form they
are mapped.
(It is redundant indeed, but I could not find another form to use two
mappings. If there is an easier way for that, please let me know!)

So, the first index is index called 'keywords' and the second 'tokens',
and only one type for each 'pt_BR' containing the keyword (only working
with brazilian portuguese for now)

The mapping for both can be seen here https://gist.github.com/3170803,
and also the analyzer for tokens index, 'token_analyzer'.

As you can see, the keywords index uses a keyword analyzer.
I'm using the fuzzy query to get the similar keywords.
It is fair enough when there are indexed keywords which are in the same
order as the search term. Example https://gist.github.com/3170824.
Still, I would like to improve it so the most similar terms 'hotel são
paulo' and 'hotéis são paulo', which got listed respectively in 3rd and
10th positions, be listed first.

Even though it did not work for all cases. I have a problem when the
keyword comes in a different order, like this examplehttps://gist.github.com/3170862.
(Matches will show up in the next example)

For this reason, I created the tokens index, analyzing the keywords
using a stemmer and removing stop words.
I tried both fuzzy_like_this_field and text queries, and they are quite
good for the previous example. Look here https://gist.github.com/3170949
.
The second search, with operator: and is even better, but also does not
match all entries.

If I use the former to search for 'hotel são paulo', it matches 'hotel são
paulo' but does not match 'hotéis em são paulo'.
If instead I search 'hotéis são paulo' it matches only 'hotéis em são
paulo', not matching 'hotel são paulo'. (Given that I'm using fuzziness =
0.6)

The reason for this I identified as being the stemmer used, which stems
the word hotéis to hot but does not stem the word hotel. (Forgot to
mention, hotéis is the plural version of hotel, portuguese for hotel...)

Does anyone have any suggestion of how to improve the map/search process
so I can be totally sure that when I got not results from a query, there is
really not a single keyword related already inserted?

If anything is not clear enough, just ask, I'll try harder.

Thanks,
Leandro

Hello Simon,

On Wed, Jul 25, 2012 at 12:40 PM, simonw
simon.willnauer@elasticsearch.comwrote:

Hey,

so this is a tricky problem and might only be really solvable up to a
certain precision. I doubt you can get 100% here really. Yet, you have
multiple problems here is a small list:

  • normalization (domain specific, stemmer might not do well) -> "*hotéis"
    | "*hotel" -> "hotel"
  • character edit distance ("hotl" -> "hotel") == LD1
  • term edit distance ("hotel brasilia", "brasilia hotel")

I'd try to build a normalization dictionary based on your domain using
synonym filters etc.

That's tough work, but can be done. Although, I'll be working with a huge
range of words that might be a too great effort.
But what about having a different stemmer? There is a very good one for
portuguese (RSLP http://www.inf.ufrgs.br/~viviane/rslp/index.htm) and as
I intend to work with more languages in the future, each case might suite a
different algorithm.

for Character edit distance you might want to use fuzzy query but that can
be very very slow.

Speed is not a concern in my case.

For the term edit distance you can use MultiPhraseQuery with a certain max
slop to get the best of both worlds. There is no explicit support for this
in Elasticsearch yet but I'd be certainly interested to add this.

By this you mean combining both Character edit and Term edit distance in
the same query or something else?

If there is nothing like that implemented as of now, would you suggest
another kind of chained queries? BTW, is is possible?

I added this issue to track this:
Add Explicit Multi & PhraseQuery support to REST & Java API · Issue #2118 · elastic/elasticsearch · GitHub

simon

Thanks a lot!

Leandro