Search_analyzer

Heya,

So I am trying to follow the guide on this page in relation to setting a different analyser for index and search time.

However I want to set the standard analyser for index and set an edge n-gram analyser for the search request. However the below does not work (i.e. the last query should get a hit logically in my mind). Its basically exactly the same as the worked example in the elasticsearch documentation just the index time and search time analysers swapped around.

DELETE my_index

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_analyzer",
          "filter": [
            "lowercase"
          ]
        }
      },
      "tokenizer": {
        "my_analyzer": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 30,
          "token_chars": [
            "letter"
          ]
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "title": {
          "type": "text",
          "analyzer": "standard",
          "search_analyzer": "my_analyzer"
        }
      }
    }
  }
}

PUT my_index/_doc/1
{
  "title": "Quick F" 
}

POST my_index/_refresh

GET my_index/_search
{
  "query": {
    "match": {
      "title": {
        "query": "Quick Foxe",
        "operator": "and",
        "analyzer": "my_analyzer"
      }
    }
  }
}`

I am expecting a search for "Quick Foxe" to be "analysed" to "Quick F" and therefore match the data stored in the index. Is this not possible or have I just got the syntax wrong ?

The problem is this part:

     "token_chars": [
        "letter"
      ]

as it will disregard space characters and Quick Foxe will be first tokenized into two tokens Quick and Foxe and only then ngramed.

If you remove token_chars in your tokenizer and operator: and in your query, it will work out.

Thanks Val, I have remove the token_chars.

However I want to be able to keep the AND operator as there is no benefit it using the n-gram analyser if

"Quick Foxe" equals "Quick F" (just because Quick and Quick match).

I'm starting to see my problem now. If you ngram the search its going to explode the search terms, and if I use the AND operator its going to require all of those search terms to be present. Maybe I need to play around with "minimum should match" I wonder if that uses the input search terms or the exploded out search terms.

If I put a minimum should match on - it actually does not work still. So there must be something else I am doing wrong.

GET my_index/_search
{
  "query": {
"match": {
  "title": {
    "query": "Quick Foxe",
    "analyzer": "my_analyzer",
    "minimum_should_match": 2
  }
}
  }
}

Usually, we do it the other way around, i.e. you use ngram at indexing time (i.e. analyzer) and you use the standard analyzer at search time. Using ngram at search time can produce all kinds of undesired results. This is exactly what the example you linked to does, but you seem to be willing to do it the other way around, not sure why.

Also it is weird to index Quick F, is it faithful to the real documents you'll index?

If we get into the details:

  • At indexing time, Quick F will get indexed to quick and f
  • At search time, Quick Foxe will get analyzed to q, qu, qui, quic, quick, quick, quick f, quick fo, quick fox, quick foxe

As you can see, given your settings, the only correspondance is the token quick

The operator and means that each of the search time tokens should match, which you can quickly see is not possible.

That's why I think you have the problem upside-down...
You should index Quick Foxe with my_analyzer and search Quick F with the standard one, and there you'll get a match (and probably in this case, token_chars: letter made sense).

yep - I am trying to do it the other way around. And the reason is like everybody we are trying to keep our storage costs down. Searching for ten extra name variations has no real impact at search time. Storing millions of ngramed names has a financial cost.

Going back to my question. I actually want the terms split on whitespace so I have added token_char back in (cause the order of terms is not important in my scenario). I pretty much understand how it works now.

What would be ideal if you could search for term A (and its edge-ngrams) AND term B(and its edge-ngrams). Is this possible?

Ok, fair enough. What I would do then is keep all you have now, but simply replace operator: and with minimum_should_match: x where x is the number of tokens in your search. In this case 2 (=> quick and f), that way it will work.

But be advised that you'll get a lot of unexpected false positive once you index more documents.

Thanks Val, you know your stuff. I've learnt a fair bit in the last hour.

I agree this could easily go off the rails depending on the search terms (particularly as the number of search terms increases).

I'll keep playing around with it - thinking I might just do a must boolean search on each of the terms (and their edge ngrams) individually. Wonder if I can get a search template to split the terms for me to create each one of the must conditions.

Thanks again.

Thanks again.

Happy to help !!

That's an interesting point you bring here. In 6.1, they introduced a soft limit between min_gram and max_gram and decided that by default the difference should not be bigger than 1, to prevent an explosion in the number of tokens being indexed.

But as you can see in the linked issue, that was only for ngram and they explicitly decided to not set any soft limit for edge-ngram as it was deemed useful to leave a bigger min/max range for those.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.