Edge n-gram search for terms with optional spaces

kedomingo · April 14, 2023, 1:48pm

Short version: I have "Pentium 3" and "Pentium4", in the index. I want to be able to search "Pentium 4" and get the record for "Pentium4". I want to be able to search "Pentium3" and get the record for "Pentium 3"

I want to use edge-ngram because this index will be used to auto-complete search terms. Right now, both records are being returned for both searches.

I used the Edge ngram page to setup the index as follows:

curl -X PUT "localhost:9200/searches?pretty" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "autocomplete": {
          "tokenizer": "autocomplete",
          "filter": [
            "lowercase"
          ]
        },
        "autocomplete_search": {
          "tokenizer": "lowercase"
        }
      },
      "tokenizer": {
        "autocomplete": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 10,
          "token_chars": []
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "term": {
        "type": "text",
        "analyzer": "autocomplete",
        "search_analyzer": "autocomplete_search"
      }
    }
  }
}
'

Added both documents

curl -X PUT "localhost:9200/searches/_doc/1?pretty" -H 'Content-Type: application/json' -d'
{
  "term": "Pentium 4" 
}
'

curl -X PUT "localhost:9200/searches/_doc/2?pretty" -H 'Content-Type: application/json' -d'
{
  "term": "Pentium3" 
}
'

and the output

curl -X GET "localhost:9200/searches/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "match": {
      "term": {
        "query": "pentium 4",
        "operator": "and"
      }
    }
  }
}
'
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.18681718,
    "hits" : [
      {
        "_index" : "searches",
        "_id" : "2",
        "_score" : 0.18681718,
        "_source" : {
          "term" : "Pentium3"
        }
      },
      {
        "_index" : "searches",
        "_id" : "1",
        "_score" : 0.17803724,
        "_source" : {
          "term" : "Pentium 4"
        }
      }
    ]
  }
}

I also tried this with min_gram = 2 and I was getting the same result: both docs are being returned.

I also tried keeping 2 copies of the terms: one for the original (term), and a second one (term_no_space) with the spaces removed. This way searches coming in will also have spaces removed. But still both records come back

curl -X GET "localhost:9200/searches/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "match": {
      "term_no_space": {
        "query": "pentium4",
        "operator": "and"
      }
    }
  }
}
'
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.18232156,
    "hits" : [
      {
        "_index" : "searches",
        "_id" : "1",
        "_score" : 0.18232156,
        "_source" : {
          "term" : "Pentium 4",
          "term_no_space" : "pentium4"
        }
      },
      {
        "_index" : "searches",
        "_id" : "2",
        "_score" : 0.18232156,
        "_source" : {
          "term" : "Pentium 3",
          "term_no_space" : "pentium3"
        }
      }
    ]
  }
}

Any hints, or links to existing discussions are welcome! Unfortunately I don't know the exact term for this scenario so I can't find them.
Thank you!

dadoonet · April 14, 2023, 2:56pm

I'd do something like this:

PUT /searches
{
  "settings": {
    "analysis": {
      "analyzer": {
        "autocomplete": {
          "tokenizer": "autocomplete",
          "filter": [
            "lowercase"
          ]
        },
        "autocomplete_search": {
          "tokenizer": "lowercase"
        }
      },
      "tokenizer": {
        "autocomplete": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 10,
          "token_chars": []
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "term": {
        "type": "text",
        "fields": {
          "autocomplete": {
            "type": "text",
            "analyzer": "autocomplete",
            "search_analyzer": "autocomplete_search"
          }
        }
      }
    }
  }
}

PUT /searches/_doc/1
{
  "term": "Pentium 4" 
}

PUT /searches/_doc/2
{
  "term": "Pentium3" 
}

GET /searches/_search
{
  "query": {
    "multi_match" : {
      "query":    "pentium 4", 
      "fields": [ "term^3", "term.autocomplete" ] 
    }
  }
}

Not tested but that will normally give you both results with pentium 4 on the top of the list.

I have a more complete example at:

gist.github.com

https://gist.github.com/dadoonet/5179ee72ecbf08f12f53d4bda1b76bab

search_kibana_console.txt

### REINIT
DELETE user
PUT user
{
  "mappings": {
    "properties": {
      "name": {
        "type": "text"
      },
      "comments": {

This file has been truncated. show original

kedomingo · April 14, 2023, 3:44pm

Ah right, boosting the fields made a difference on the scores. The scores from my original attempts that only considers the autocomplete field are equal. Taking the original into account as well is a good idea.

kedomingo · April 14, 2023, 4:08pm

I got another insight from another thread here: Search by digits doesn't work with edge_ngram - #2 by dadoonet

I tried analyze and it is removing the numbers from my query. I replaced the tokenizer with "whitespace" instead

I also decided to have 2 fields: original and term, and keeping term.autocomplete as in the example. Both fields are searched with no boosting and it seems to work as expected now.

PUT /searches

{
  "settings": {
    "analysis": {
      "analyzer": {
        "autocomplete": {
          "tokenizer": "autocomplete",
          "filter": [
            "lowercase"
          ]
        },
        "autocomplete_search": {
          "type" : "custom",
          "tokenizer": "whitespace",
          "filter": "lowercase"
        }
      },
      "tokenizer": {
        "autocomplete": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 10,
          "token_chars": []
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "original": {
        "type": "text"
      },
      "term": {
        "type": "text",
        "fields": {
          "autocomplete": {
            "type": "text",
            "analyzer": "autocomplete",
            "search_analyzer": "autocomplete_search"
          }
        }
      }
    }
  }
}

PUT /searches/_doc/1
{
  "original": "Pentium 4",
  "term": "pentium4"
}

PUT /searches/_doc/2
{
  "original": "Pentium 3 Intel Processor",
  "term": "pentium3intelprocessor"
}


GET /searches/_search
{
  "query": {
    "multi_match": {
      "query": "pentium3",
      "fields": ["term.autocomplete", "original"],
      "operator": "and"
    }
  }
}

system · May 12, 2023, 4:09pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Search by digits doesn't work with edge_ngram Elasticsearch	3	1587	July 3, 2018
Edge_ngram results Elasticsearch	4	342	July 6, 2017
Issue with Edge NGram Tokenizer in elastic search Elasticsearch	2	649	January 13, 2017
Edge NGram Tokenizer not Tokenizing Digits? Elasticsearch	3	1009	September 20, 2019
Relevant data not coming in elasticsearch Elasticsearch	2	265	June 28, 2021

Edge n-gram search for terms with optional spaces

Related topics