Custom analyzer on match_phrase

bunch_of_bytes · March 14, 2018, 6:51am

Hi,

I am quite puzzled by using analyzer on one of my search fields. Here is the mapping:

{"settings": {
    "analysis": {
      "filter": {
        "filter_shingle":{
               "type":"shingle",
               "max_shingle_size":3,
               "min_shingle_size":2,
               "output_unigrams":"true"
        },
        "tf_eng_stop": {
                "type": "stop",
                "stopwords": "_english_"
              },
        "tf_title_stop": {
                    "type":       "stop",
                    "stopwords": ["intern", "internship", "senior", "Sr.", "Sr"]
                },      
        "tf_synonym": {
          "type": "synonym",
          "synonyms_path" : "synonyms.txt"
        }
      },
      "analyzer": {
        "tf_synonym_analyzer": {
          "tokenizer": "whitespace",
          "filter": [
              "lowercase",
              "tf_eng_stop",
            "tf_synonym"
             
            
          ]
        },
        "tf_title_analyzer": {
          "tokenizer": "standard",
          "filter": [
              "lowercase",
              "tf_title_stop",
              "standard",
              "filter_shingle"
            
          ]
        },
        "tf_synonym_analyzer_keyword_only":{
            "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "tf_eng_stop",
            "tf_synonym"
          ]
        }
      }
    }
  },
  
       "mappings":{  
          "job":{  
             "properties":{  
                "name":{  
                   "type":"text"
                },
                "keywords":{  
                    "type":"text",
                    "analyzer":"tf_synonym_analyzer"
                }, 
                "alias":{  
                    "type":"text"
                },
                "color":{  
                    "type":"text"
                },
                "id":{  
                   "type":"long"
                }
             }
          }
       }
   
}

Here is the query:

_search

{  
   "query":{  
      "match_phrase":{  
         "alias":{  
            "query":"senior staff engineer/ manager",
            "analyzer":"tf_title_analyzer",
            "boost":1.5
         }
      }
   },
   "_source":{  
      "includes":[  
         "name",
         "color"
      ]
   },
   "highlight":{  
      "fields":{  
         "alias":{  

         }
      }
   }
}

I noticed if the query is "senior staff engineer", nothing comes up. If I use "staff engineer", it returns a result. I am not sure why since I specified the query to use a stop word token filter already. Can someone help?

Thanks a lot!

jpountz · March 14, 2018, 10:12am

I suspect shingles are confusing the query parser a bit. Can you share the output of the validate API on your query with rewrite equal to true? https://www.elastic.co/guide/en/elasticsearch/reference/current/search-validate.html

bunch_of_bytes · March 14, 2018, 4:38pm

Thank you very much for your quick response.

Here is the explain:

"explanations": [
      {
        "index": "jobs",
        "valid": true,
        "explanation": "(alias:\"(_ staff _ staff engineer _ staff engineer manager) (staff staff engineer staff engineer manager) (engineer engineer manager) manager\")^1.5"
      }
    ]

It looks like the shingle is being funny. Why would it break down words like that?

Senior Staff Engineer Manager should be something like

staff engineer, engineer manager.....

One more thing, what's the relationship among the words inside the bracket (staff staff engineer staff engineer manager). Are they OR or AND? or this is just a long string that is taken as a phrase?

What should I do to make the shingle behave correctly?

Thanks a lot!

UPDATE
If I use the same set up, but remove "senior" in the query, here is the explaination:

"explanations": [
      {
        "index": "jobs",
        "valid": true,
        "explanation": "(alias:\"(staff staff engineer staff engineer manager) (engineer engineer manager) manager\")^1.5"
      }
    ]

It does have a hit. But I cannot understand is the difference

jpountz · March 20, 2018, 8:39am

Unfortunately I think there are multiple issues here, some of them being hard to fix:

shingles do not work well with synonyms at the moment https://issues.apache.org/jira/browse/LUCENE-3475
you should only use one shingle size at search time
shingles don't make it easy to integrate correctly with match_phrase.

I'd probably recommend to remove shingles from the analyzers.

bunch_of_bytes · March 20, 2018, 6:39pm

Thank you very much for your response. Another a different note, in this case, the documents consisit mostly of phrases.

for example:

"Senior Software Developer"
"Data Analysts"

They are not really documents.

I found if the search query is "Data Engineer"

It may include "Data Analyst" as result since they both contain "data".

Is there any way to index these phrases as they are?

I also looked at Span search. I found Span Near maybe the best query. However, the issue is that it has to include all span terms. But in this case, we don't necessarily need ALL span terms. For example:

{
    "query": {
        "span_near" : {
            "clauses" : [
                { "span_term" : { "field" : "Senior" } },
                { "span_term" : { "field" : "Engineering" } },
                { "span_term" : { "field" : "Manager" } }
            ],
            "slop" : 12,
            "in_order" : true
        }
    }
}

What if the document contains a phrase "Engineering Manager"? This search would not come up since it also looks for "senior". span_or on the other hand does not support in_order or slope. Any suggestion?

Thaks!

jpountz · March 26, 2018, 1:13pm

This is true, but at the same time matches that contain both data and analyst should rank higher than those that only contain data.

Query parsers also have a way to make all terms required, have a look at the operator or minimum_should_match options. minimum_should_match parameter | Elasticsearch Guide [8.11] | Elastic

Right. There is no easy answer to this problem. One workaround would be to put your regular query in a MUST clause and a phrase (or span) query in a should clause. This way the phrase query is not required for matching, but if it matches then it will boost scores.

system · April 23, 2018, 1:13pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Stop-Words analyzers does not work as expected Elasticsearch	1	397	June 5, 2018
Search query doesn't use custom analyzer Elasticsearch	5	2330	July 5, 2017
Analyzers at Index time and search time are not matching Elasticsearch	1	338	December 28, 2021
Issue while indexing with custom analyzers Elasticsearch	3	36	July 17, 2024
Custom analyzer registered but not used Elasticsearch	1	353	July 6, 2017

Custom analyzer on match_phrase

Related topics