Edge_ngram tokenizer and edge_ngram filter don't behave the same?

redec · December 2, 2020, 12:10am

I'm trying to convert a field which currently uses the edge_ngram tokenizer to use the edge_ngram filter instead, and tokens created by the edge_ngram filter are not scoring the same as the same tokens created by the edge_ngram tokenizer. Here's what I'm seeing....an example index with just 2 fields...1 using the edge_ngram tokenizer and 1 using the edge_ngram filter...

DELETE /test1
PUT /test1
{
  "settings": {
    "analysis": {
      "analyzer": {
        "edge_tokenizer": {
          "tokenizer": "my_edge_ngram_tokenizer"
        },
        "edge_filter": {
          "tokenizer": "standard",
          "filter": ["my_edge_ngram_filter"]
        }
      },
      "filter": {
        "my_edge_ngram_filter": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 10        
        }
      },
      "tokenizer": {
        "my_edge_ngram_tokenizer": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 10
        }
      }    
    }
  },
  "mappings": {
    "properties": {
      "tok": {
        "type": "text",
        "similarity": "boolean",
        "analyzer": "edge_tokenizer"
      },
      "filt": {
        "type": "text",
        "similarity": "boolean",
        "analyzer": "edge_filter"
      }      
    }
  }
}
PUT /test1/_doc/1
{
  "tok": "foobar",
  "filt": "foobar"
}
PUT /test1/_doc/2
{
  "tok": "flub",
  "filt": "flub"
}

Now if I run a search using the token field, I see the results I expect - the doc with the 4-character match scores "4", the doc with a 1-character match scores "1":

GET /test1/_search
{
  "query": { "match": { "tok": "flub" }}
}

Response:
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 4.0,
    "hits" : [
      {
        "_index" : "test1",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 4.0,
        "_source" : {
          "tok" : "flub",
          "filt" : "flub"
        }
      },
      {
        "_index" : "test1",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "tok" : "foobar",
          "filt" : "foobar"
        }
      }
    ]
  }
}

However if I query using the "filt" field, both documents score "1":

GET /test1/_search
{
  "query": { "match": { "filt": "flub" }}
}

Response: 
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "test1",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "tok" : "foobar",
          "filt" : "foobar"
        }
      },
      {
        "_index" : "test1",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.0,
        "_source" : {
          "tok" : "flub",
          "filt" : "flub"
        }
      }
    ]
  }
}

which I don't understand at all. The same tokens are generated using both analyzers, however with different types and positions

GET /test1/_analyze
{
  "analyzer": "edge_tokenizer",
  "text": "flub"
}

{
  "tokens" : [
    {
      "token" : "f",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "fl",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "flu",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "flub",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 3
    }
  ]
}


GET /test1/_analyze
{
  "analyzer": "edge_filter",
  "text": "flub"
}

{
  "tokens" : [
    {
      "token" : "f",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "fl",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "flu",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "flub",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 0
    }
  ]
}

I don't really understand how the type+position affects the scoring. I thought it just takes each token from the query and looks for a matching one in the index....if there is a match then score+1 (with boolean similarity). All the google hits I can find say the edge_ngram tokenizer and filter do the same thing, just at different points in the pipeline. Can someone explain to me what's going on here? Thanks

system · December 30, 2020, 12:10am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Edge nGram token filter doesn't seem to work Elasticsearch	2	2383	July 5, 2017
Does Edge Ngram Token filter creates Synonym for tokens? Elasticsearch	1	186	July 19, 2023
Ngram and edgeNgram combined for _all field; or different token filters per field for _all Elasticsearch	1	582	July 6, 2017
Synonym_filter and edge_ngram token filter not working together Elasticsearch	3	644	May 2, 2018
Remove duplicate tokens after edge_ngram on array Elasticsearch	3	453	February 17, 2022

Edge_ngram tokenizer and edge_ngram filter don't behave the same?

Related topics