Performance issue by using edge ngram

snigdharao · August 9, 2018, 3:04am

I am implementing autocomplete for agency name , civil service title and fiscal year fields.I have no issue with civil_service_title .But i see very slow performance with agency and fiscal year.
I also plan to have many fields with agency has i use it for different purposes.(autocomplete, keyword and text).
I want to know if my mapping needs to be changed. How can i make sure my autocomplete functionality works for special characters like #,~,&.It doesnot give me results when i type any name that includes special characters example:(Department of education #10)

snigdharao · August 9, 2018, 3:04am

PUT indexName
{
  
   "settings":{
      "analysis":{
         "filter”:{
        "gramFilter": {
          
             "type":     "edge_ngram",
            "min_gram" : 1,
            "max_gram" : 30,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      
      },
         "analyzer":{
            "index_analyzer":{ 
               "type":"custom",
               "tokenizer":"keyword",
               "filter":[
                  "lowercase",
                  "trim"
               ]
             },
            "search_string_analyzer":{ 
               "type":"custom",
               "tokenizer":"keyword",
               "filter":[
                  "lowercase"
               ]
             },
          
          "autocomplete": { 
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "trim",
            "gramFilter",
             "asciifolding"
           
            
          ]
        }
            
      
          
      }
      
      
   ,
   "mappings":{
      “Type”:{
         "properties":{
          
           "civil_service_title":{
             "type":"text",
              "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
              },
          
              "analyzer":"autocomplete", 
              "search_analyzer":"standard"
               
           },
               
            "agency_name":{
              "type":"text",
              "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
              },
          
              "analyzer":"autocomplete", 
              "search_analyzer":"standard"
              
              } ,
          "fiscal_year":{
              "type":"text",
              "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
              },
          
              "analyzer":"autocomplete", 
              "search_analyzer":"standard"
          }
          
         
   }
}
      }

dadoonet · August 9, 2018, 8:49am

Isn't that a bit excessive?

"min_gram" : 1,
"max_gram" : 30,

Use case wise I'd expect to type at least 2 or 3 characters before proposing anything to the end user. Going up to 30 letters then means that after you typed the first 10 letters you still need to "complete"? That sounds too many terms to me.

Do you really want to use a keyword tokenizer BTW to propose the completion?
ie. for Department of education #10 you can only type depar and don't suggest anything for educ?

About

special characters like #,~,&

You defined:

"token_chars": [
  "letter",
  "digit"
]

Take a look at the list of the options there: Edge n-gram tokenizer | Elasticsearch Guide [8.11] | Elastic

snigdharao · August 9, 2018, 2:47pm

My usecase is something like this , i have terms department of education , department of sanitation , department of social services and so on , If i wish to not just stop by typing 3 characters , i will have to use maxgram till 10??Earlier in solr we have implemented this by using regular expression n tokenizer was a keyword.
Also i should be able to type education to find the result department of education .Can you please let me know the best analyzer that suits this scenario
Also , my question is by adding multiple fields with analyzers would reduce the speed?
https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-fields.html

dadoonet · August 9, 2018, 7:24pm

Give it a try. And see how it fits your use case.

The _analyze API will help a lot to see how your text will be indexed.

snigdharao · August 9, 2018, 8:23pm

@dadoonet thank you.

snigdharao · August 9, 2018, 9:01pm

@dadoonet I am planning to use regexp query and use index_analyzer for my usecase.Please suggest me if this is optimal

dadoonet · August 10, 2018, 1:16am

Regexp, wildcards, prefix queries are not optimal IMHO.

If you are looking for the optimal autocomplete solution, look at this API: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters-completion.html

Otherwise edge ngram seems to me a good thing to do.

system · September 7, 2018, 1:16am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Special characters for ngram Elasticsearch	1	776	October 15, 2018
Tokenizer: whitespace not working with edge_ngram Elasticsearch	9	2496	March 5, 2018
Search as you type for documents with digits, unicode and special characters Elasticsearch	3	403	March 27, 2023
Edge Ngram not working on querying all fields Elasticsearch	1	639	July 4, 2017
Search by digits doesn't work with edge_ngram Elasticsearch	3	1616	July 3, 2018

Performance issue by using edge ngram

Related topics