Mapping for finding keywords in domains

hacker_21 · May 1, 2019, 11:28pm

hi there,

I recently ran a search called "toronto dental" on google and noticed it even matches for keywords inside of domain names too:

the domain is a single word, yet they are able to search for the two words (in any order) inside of the 2 domain results above.

I was just wondering if there is a similar mapping or querying approach that can accomplish this in Elasticsearch? Find keywords inside of a string with no spaces, and also find the words in any order as long as they are in the string.

Thanks and hope you guys are having a good week,

hacker_21 · May 3, 2019, 2:48pm

to clarify, if the search query is toronto dental:

is there a mapping which can find these 2 keywords inside a "domain" field with values like:
torontobeachdental.com or citydentaltoronto.com

Mark_Harwood · May 3, 2019, 3:12pm

One approach is to use Ngrams - that's choosing to index arbitrary length substrings within a string rather than just the whole thing.

It doesn't identify particular words but that can be ambiguous anyway as the registered domain holders of the pen store PenIsland.com and the actor-agent search service WhoRepresents.com found out.

hacker_21 · May 11, 2019, 5:36pm

haha thanks for sharing Mark.

I now have min_gram set to 1 and max_gram set to 1 as well on my new_domain field.

However, when I go into Kibana, I noticed the following:

this matches any doc where the new_domain field has the letters D, E, N, T, A, or L ... which is almost my entire dataset

this matches the "expected" way of looking across the new_domain field values for docs where there is a sequence of D-E-N-T-A-L producing a correct amount.

The problem is when I'm using Elasticsearch DSL (python) to make a search, looking up "dental" or even "'dental'" as the query produces results like the first time in Kibana. Instead of looking for the sequence of letters, it's matching any documents with those letters.

Is there anything I can do? How can I query elasticsearch to look for the sequence in the new_domain field? Thank you!

Mark_Harwood · May 12, 2019, 8:07am

Normally you’d use ngrams of 3 or more

hacker_21 · May 13, 2019, 7:03pm

But even in the above example, say with a mapping of 4, wouldn't it match "dent" with:
Dentureclinic.com
Cardentwizard.com

Or "tist" with:
Scientist.com
Cardentscientist.com

As well?

Mark_Harwood · May 13, 2019, 8:34pm

Right, but if the search is “Toronto dental” then those urls that contain more of the ngrams than others will rank highest in results.
The search engine rewards matches that contain more of the search terms than others and also ranks rarer terms highly (eg the ‘oro’ and ‘nto’ from Toronto)

hacker_21 · May 15, 2019, 12:28am

Thank you for clarifying Mark, I see what you're saying now. It has a "bigger picture" and multiple factors which end up creating the score, not just the domain.

What method of querying do you recommend? I have set minimum n-gram size to 3 and max to 7. Unfortunately, match_phrase for Toronto Dentist did not contribute to results (same overall score, same # of results) ... and a simple match is behaving oddly as well for the same query.

hacker_21 · May 15, 2019, 3:40pm

Just re-visited the various options available in the ES docs ...

Instead of a match or math_phrase, do you think a term query would be more suitable in this scenario?

system · June 12, 2019, 3:40pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Can't get nGram indexing / querying to work as expected Elasticsearch	10	406	July 6, 2017
Advice about mapping Elasticsearch	3	345	July 6, 2017
Searching by ngrams Elasticsearch elastic-stack-monitoring	10	287	June 16, 2023
What is best way to search a part in one term? Elasticsearch	4	1017	July 5, 2017
I'm new to ES, and struggling with something simple? Elasticsearch	8	433	July 6, 2017

Mapping for finding keywords in domains

Related topics