ES partial matching (ngram) use case


(mehmet) #1

hi everybody
I have an index for keeping book records such as;
ElasticSearch Cookbook
ElasticSearch Server
Mastering ElasticSearch
ElasticSearch

i have more than 2M records.
search cases:

search term --- expected result --- (case) elastic cook ---
search cook --- ElasticSearch Cookbook --- (partial match)
ElasticSearhCookBook --- ElasticSearch Cookbook --- (no space)
ekasticsearch --- ElasticSearch --- (typo)

etc.

I try to write whole problem in here but there is a character limit for topics so;
to check analyzer, mapping and query for this problem pls look following link?

whole problem definition

So, I am doing something wrong or is it normal?


(Nik Everett) #2

You are better of linking to some gists.

In the link you ask about response times. Fuzzy matching is my guess for what is taking the time. Phrase matching on ngrams is also expensive because you end up with lots and lots of tokens. So I'm not surprised its slow.

There are lots of things you can do about it - don't use fuzzy matching at all, for one. You could try looking into the term or phrase or completion suggester for spelling correction. Or you could get spelling results by going from a phrase query on those ngrams to a terms query. I'm not sure how you'd word it to elasticsearch but if you were to ngram the input and require only one of the terms to match you'd still find the books even with spelling errors. But you'd find too many books. Hopefully scoring would make the one you wanted come back higher.

Another other option is to index the books with a title but also with common misspellings. When you search you search both those fields.


(mehmet) #3

thanks a lot,
after removing fuzziness and phrase matching from must query (which i was applying on ngram) , response time is became much much better. for 14 character max response time= 100 ms.
Thanks a lot
new query


 {
  "bool": {
    "must": {
      "match": {
        "name": {
          "query": "elastic cook",
          "type": "boolean",
          "operator": "OR",
          "minimum_should_match": "1",
          "cutoff_frequency": 0.01
        }
      }
    },
    "should": [
      {
        "match": {
          "name.exact": {
            "query": "elastic cook",
            "type": "phrase",
            "boost": 4
          }
        }
      },
      {
        "match": {
          "name.token": {
            "query": "elastic cook",
            "type": "phrase"
          }
        }
      },
      {
        "match": {
          "name.edgeNGnoSplit": {
            "query": "elastic cook",
            "type": "phrase",
            "fuzziness": "1",
            "max_expansions": 8
          }
        }
      },
      {
        "match": {
          "name.edgeNG": {
            "query": "elastic cook",
            "type": "phrase",
            "fuzziness": "1",
            "max_expansions": 4
          }
        }
      }
    ]
  }
}

(system) #4