the domain is a single word, yet they are able to search for the two words (in any order) inside of the 2 domain results above.
I was just wondering if there is a similar mapping or querying approach that can accomplish this in Elasticsearch? Find keywords inside of a string with no spaces, and also find the words in any order as long as they are in the string.
One approach is to use Ngrams - that's choosing to index arbitrary length substrings within a string rather than just the whole thing.
It doesn't identify particular words but that can be ambiguous anyway as the registered domain holders of the pen store PenIsland.com and the actor-agent search service WhoRepresents.com found out.
I now have min_gram set to 1 and max_gram set to 1 as well on my new_domain field.
However, when I go into Kibana, I noticed the following:
this matches any doc where the new_domain field has the letters D, E, N, T, A, or L ... which is almost my entire dataset
this matches the "expected" way of looking across the new_domain field values for docs where there is a sequence of D-E-N-T-A-L producing a correct amount.
The problem is when I'm using Elasticsearch DSL (python) to make a search, looking up "dental" or even "'dental'" as the query produces results like the first time in Kibana. Instead of looking for the sequence of letters, it's matching any documents with those letters.
Is there anything I can do? How can I query elasticsearch to look for the sequence in the new_domain field? Thank you!
Right, but if the search is “Toronto dental” then those urls that contain more of the ngrams than others will rank highest in results.
The search engine rewards matches that contain more of the search terms than others and also ranks rarer terms highly (eg the ‘oro’ and ‘nto’ from Toronto)
Thank you for clarifying Mark, I see what you're saying now. It has a "bigger picture" and multiple factors which end up creating the score, not just the domain.
What method of querying do you recommend? I have set minimum n-gram size to 3 and max to 7. Unfortunately, match_phrase for Toronto Dentist did not contribute to results (same overall score, same # of results) ... and a simple match is behaving oddly as well for the same query.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.