Mapping for finding keywords in domains

(Hacker 21) #1

hi there,

I recently ran a search called "toronto dental" on google and noticed it even matches for keywords inside of domain names too:

image

the domain is a single word, yet they are able to search for the two words (in any order) inside of the 2 domain results above.

I was just wondering if there is a similar mapping or querying approach that can accomplish this in Elasticsearch? Find keywords inside of a string with no spaces, and also find the words in any order as long as they are in the string.

Thanks and hope you guys are having a good week,

(Hacker 21) #2

to clarify, if the search query is toronto dental:

is there a mapping which can find these 2 keywords inside a "domain" field with values like:
torontobeachdental.com or citydentaltoronto.com

(Mark Harwood) #3

One approach is to use Ngrams - that's choosing to index arbitrary length substrings within a string rather than just the whole thing.

It doesn't identify particular words but that can be ambiguous anyway as the registered domain holders of the pen store PenIsland.com and the actor-agent search service WhoRepresents.com found out.

(Hacker 21) #4

haha thanks for sharing Mark.

I now have min_gram set to 1 and max_gram set to 1 as well on my new_domain field.

However, when I go into Kibana, I noticed the following:
image
this matches any doc where the new_domain field has the letters D, E, N, T, A, or L ... which is almost my entire dataset

image
this matches the "expected" way of looking across the new_domain field values for docs where there is a sequence of D-E-N-T-A-L producing a correct amount.

The problem is when I'm using Elasticsearch DSL (python) to make a search, looking up "dental" or even "'dental'" as the query produces results like the first time in Kibana. Instead of looking for the sequence of letters, it's matching any documents with those letters.

Is there anything I can do? How can I query elasticsearch to look for the sequence in the new_domain field? Thank you!

(Mark Harwood) #5

Normally you’d use ngrams of 3 or more

(Hacker 21) #6

But even in the above example, say with a mapping of 4, wouldn't it match "dent" with:
Dentureclinic.com
Cardentwizard.com

Or "tist" with:
Scientist.com
Cardentscientist.com

As well?

(Mark Harwood) #7

Right, but if the search is “Toronto dental” then those urls that contain more of the ngrams than others will rank highest in results.
The search engine rewards matches that contain more of the search terms than others and also ranks rarer terms highly (eg the ‘oro’ and ‘nto’ from Toronto)

(Hacker 21) #8

Thank you for clarifying Mark, I see what you're saying now. It has a "bigger picture" and multiple factors which end up creating the score, not just the domain.

What method of querying do you recommend? I have set minimum n-gram size to 3 and max to 7. Unfortunately, match_phrase for Toronto Dentist did not contribute to results (same overall score, same # of results) ... and a simple match is behaving oddly as well for the same query.

(Hacker 21) #9

Just re-visited the various options available in the ES docs ...

Instead of a match or math_phrase, do you think a term query would be more suitable in this scenario?