Hello everyone.
First of all, ElasticSearch is amazing, thanks for that!
Getting down to business, I'm trying to use ElasticSearch for identifying
similar and related keywords. The objective is be able to tell with 100%
certainty that no similar exists.
So far, I have 28000 keywords uploaded to my local ElasticSearch setup to
check against.
The idea is to be able to tell if a given keyword has already a similar
form, in an automatic manner.
My keywords are composed of one or more words, and I created two indexes
with them, one being a copy of the other, differing only by the form they
are mapped.
(It is redundant indeed, but I could not find another form to use two
mappings. If there is an easier way for that, please let me know!)
So, the first index is index called 'keywords' and the second 'tokens', and
only one type for each 'pt_BR' containing the keyword (only working with
brazilian portuguese for now)
The mapping for both can be seen here https://gist.github.com/3170803,
and also the analyzer for tokens index, 'token_analyzer'.
As you can see, the keywords index uses a keyword analyzer.
I'm using the fuzzy query to get the similar keywords.
It is fair enough when there are indexed keywords which are in the same
order as the search term. Example https://gist.github.com/3170824.
Still, I would like to improve it so the most similar terms 'hotel são
paulo' and 'hotéis são paulo', which got listed respectively in 3rd and
10th positions, be listed first.
Even though it did not work for all cases. I have a problem when the
keyword comes in a different order, like this examplehttps://gist.github.com/3170862.
(Matches will show up in the next example)
For this reason, I created the tokens index, analyzing the keywords using
a stemmer and removing stop words.
I tried both fuzzy_like_this_field and text queries, and they are quite
good for the previous example. Look here https://gist.github.com/3170949.
The second search, with operator: and is even better, but also does not
match all entries.
If I use the former to search for 'hotel são paulo', it matches 'hotel são
paulo' but does not match 'hotéis em são paulo'.
If instead I search 'hotéis são paulo' it matches only 'hotéis em são
paulo', not matching 'hotel são paulo'. (Given that I'm using fuzziness =
0.6)
The reason for this I identified as being the stemmer used, which stems the
word hotéis to hot but does not stem the word hotel. (Forgot to
mention, hotéis is the plural version of hotel, portuguese for hotel...)
Does anyone have any suggestion of how to improve the map/search process so
I can be totally sure that when I got not results from a query, there is
really not a single keyword related already inserted?
If anything is not clear enough, just ask, I'll try harder.
Thanks,
Leandro