Custom analyzer on match_phrase

Hi,

I am quite puzzled by using analyzer on one of my search fields. Here is the mapping:

{"settings": {
    "analysis": {
      "filter": {
        "filter_shingle":{
               "type":"shingle",
               "max_shingle_size":3,
               "min_shingle_size":2,
               "output_unigrams":"true"
        },
        "tf_eng_stop": {
                "type": "stop",
                "stopwords": "_english_"
              },
        "tf_title_stop": {
                    "type":       "stop",
                    "stopwords": ["intern", "internship", "senior", "Sr.", "Sr"]
                },      
        "tf_synonym": {
          "type": "synonym",
          "synonyms_path" : "synonyms.txt"
        }
      },
      "analyzer": {
        "tf_synonym_analyzer": {
          "tokenizer": "whitespace",
          "filter": [
              "lowercase",
              "tf_eng_stop",
            "tf_synonym"
             
            
          ]
        },
        "tf_title_analyzer": {
          "tokenizer": "standard",
          "filter": [
              "lowercase",
              "tf_title_stop",
              "standard",
              "filter_shingle"
            
          ]
        },
        "tf_synonym_analyzer_keyword_only":{
            "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "tf_eng_stop",
            "tf_synonym"
          ]
        }
      }
    }
  },
  
       "mappings":{  
          "job":{  
             "properties":{  
                "name":{  
                   "type":"text"
                },
                "keywords":{  
                    "type":"text",
                    "analyzer":"tf_synonym_analyzer"
                }, 
                "alias":{  
                    "type":"text"
                },
                "color":{  
                    "type":"text"
                },
                "id":{  
                   "type":"long"
                }
             }
          }
       }
   
}

Here is the query:

_search

{  
   "query":{  
      "match_phrase":{  
         "alias":{  
            "query":"senior staff engineer/ manager",
            "analyzer":"tf_title_analyzer",
            "boost":1.5
         }
      }
   },
   "_source":{  
      "includes":[  
         "name",
         "color"
      ]
   },
   "highlight":{  
      "fields":{  
         "alias":{  

         }
      }
   }
}

I noticed if the query is "senior staff engineer", nothing comes up. If I use "staff engineer", it returns a result. I am not sure why since I specified the query to use a stop word token filter already. Can someone help?

Thanks a lot!

I suspect shingles are confusing the query parser a bit. Can you share the output of the validate API on your query with rewrite equal to true? https://www.elastic.co/guide/en/elasticsearch/reference/current/search-validate.html

Thank you very much for your quick response.

Here is the explain:

"explanations": [
      {
        "index": "jobs",
        "valid": true,
        "explanation": "(alias:\"(_ staff _ staff engineer _ staff engineer manager) (staff staff engineer staff engineer manager) (engineer engineer manager) manager\")^1.5"
      }
    ]

It looks like the shingle is being funny. Why would it break down words like that?

Senior Staff Engineer Manager should be something like

staff engineer, engineer manager.....

One more thing, what's the relationship among the words inside the bracket (staff staff engineer staff engineer manager). Are they OR or AND? or this is just a long string that is taken as a phrase?

What should I do to make the shingle behave correctly?

Thanks a lot!

UPDATE
If I use the same set up, but remove "senior" in the query, here is the explaination:

"explanations": [
      {
        "index": "jobs",
        "valid": true,
        "explanation": "(alias:\"(staff staff engineer staff engineer manager) (engineer engineer manager) manager\")^1.5"
      }
    ]

It does have a hit. But I cannot understand is the difference

Unfortunately I think there are multiple issues here, some of them being hard to fix:

I'd probably recommend to remove shingles from the analyzers.

1 Like

Thank you very much for your response. Another a different note, in this case, the documents consisit mostly of phrases.

for example:

"Senior Software Developer"
"Data Analysts"

They are not really documents.

I found if the search query is "Data Engineer"

It may include "Data Analyst" as result since they both contain "data".

Is there any way to index these phrases as they are?

I also looked at Span search. I found Span Near maybe the best query. However, the issue is that it has to include all span terms. But in this case, we don't necessarily need ALL span terms. For example:

{
    "query": {
        "span_near" : {
            "clauses" : [
                { "span_term" : { "field" : "Senior" } },
                { "span_term" : { "field" : "Engineering" } },
                { "span_term" : { "field" : "Manager" } }
            ],
            "slop" : 12,
            "in_order" : true
        }
    }
}

What if the document contains a phrase "Engineering Manager"? This search would not come up since it also looks for "senior". span_or on the other hand does not support in_order or slope. Any suggestion?

Thaks!

This is true, but at the same time matches that contain both data and analyst should rank higher than those that only contain data.

Query parsers also have a way to make all terms required, have a look at the operator or minimum_should_match options. minimum_should_match parameter | Elasticsearch Guide [8.11] | Elastic

Right. There is no easy answer to this problem. One workaround would be to put your regular query in a MUST clause and a phrase (or span) query in a should clause. This way the phrase query is not required for matching, but if it matches then it will boost scores.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.