Can't match this phrase!


(cayala@courthousenews.com) #1

Hi Elastic team,

I've been trying to fix an issue with our match for a few days now where my search for this phrase "Finished IT Project" won't match anything, here is what I have tried so far:

1- Removed IT as stop word
2- Removed All stop words
3- Match Phrase as:

{
    "query": {
        "match_phrase" : {
            "message" : {
                "query" : "Finished IT Project",
                "analyzer" : "my_english"
            }
        }
    }
}

4- Common Term as:

{
  "query": {
    "common": {
      "body": {
        "query": "Finished Court IT Project",
        "cutoff_frequency": 0.001,
        "minimum_should_match": {
                    "low_freq" : 2,
                    "high_freq" : 3
                }
      }
      
    }
  }
}

5- must match as:

{
    "query": {
       "bool": {
         "must": {
           "term": {
             "post_title": "Finished IT Project"
           }
         }
       }
    }
}

Here is the result of the analyzer:

{
  "analyzer": "my_english",
  "text": "Finished IT Project"
}

Result:

{
  "tokens": [
    {
      "token": "finish",
      "start_offset": 0,
      "end_offset": 8,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "project",
      "start_offset": 12,
      "end_offset": 19,
      "type": "<ALPHANUM>",
      "position": 2
    }
  ]
}

Note, at this point I restored the stop words since removing them created a slow down in searching.

And no option seems to return any story with those words in it's text, I know we have a few, and those 3 words are present in at least 2 of our titles but none are returned, I did notice that the word "Finished" get changed to "finish" for some odd reason.

Any help will be much appreciated, thanks and happy new year.


(cayala@courthousenews.com) #2

Bump


(Abdon Pijpelink) #3

Can you share:

  • A sample document that you would expect to find.
  • The mapping of the index - especially for the fields that you are querying: message, body and post_title.
  • The definition of the my_english analyzer from your index settings.

(cayala@courthousenews.com) #4

Sample document, hum, that would be an article from our wordpress site that's saved on a mysql database, currently I'm searching by the title the exact title is:

"‘Finished’ Court IT Project to Cost State 100s of Millions for Years"

The mapping to the filed is:

t post_title

Definition of the analyzer:

"analyzer": {
            "my_english": {
              "type": "english",
              "stopwords": "_english_"
            },

I really need help with this since everything I've tried has failed, I notice I also can search for terms with single quotes in it like "Mary's Chair" or "D'ionge was her name" for example, any help you can provide will be much appreciated, thanks!


(Abdon Pijpelink) #5

Alright, let's try some things.

1 - I notice that in the queries you posted in your first post, you're querying three different fields: message, body and post_title. Should all three match "Finished IT Project"?

2 - What happens if you do not override the analyzer in your match_phrase query?

{
  "query": {
    "match_phrase": {
      "message": {
        "query": "Finished IT Project"
      }
    }
  }
}

3 - What happens if you do a match instead of a match_phrase? If you get hits, can you post the response here, so we can get some sense of what your documents look like?

{
  "query": {
    "match": {
      "message": {
        "query": "Finished IT Project"
      }
    }
  }
}

4 - If the two queries above return no hits, what happens if you change message into post_title?

5 - To get the mappings of the index, hit the _mappings endpoint of the index. For example, if your index is called my_index, can you execute the following and post the response here?

GET my_index/_mappings


(cayala@courthousenews.com) #6

1- Yes I just notice that, must have copied different things from my Kibana console, sorry.

2.- This is the result without the override to the analyzer:

Query:

{
    "query": {
        "match_phrase" : {
            "post_title" : {
                "query" : "Finished Court IT Project"
            }
        }
    }
}

Result:

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 0,
    "max_score": null,
    "hits": []
  }
}

3- By doing a match on both _all and post_title:

Result:

{
  "took": 17,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 92087,
    "max_score": 16.506521,
    "hits": [
      {

The title in question comes second, so this is good, then I think I have an issue with my filters in my search page (uses c# nest) as if I try the same there the query does not return the search result in the same order, maybe I'll have to add a search by headline filter.

4.- message is the wrong field, I think I copied the stuff from the examples in the elastic docs by mistake, I'm using either _all or post_title in my actual queries.

5.- This is the maping result:

 "post_title": {
            "type": "text",
            "fields": {
              "post_title": {
                "type": "text",
                "analyzer": "standard"
              },
              "raw": {
                "type": "keyword",
                "ignore_above": 10922
              },
              "sortable": {
                "type": "keyword",
                "ignore_above": 10922,
                "normalizer": "lowerasciinormalizer"
              }
            }
          },

Body:

      "post_content": {
        "type": "text"
      },

Things to notice though:

This is the actual full headline:

‘Finished’ Court IT Project to Cost State 100s of Millions for Years

Please note the word "Finished" is wrapped in single quotes, this the result from the

index/_analyze

{
  "tokens": [
    {
      "token": "finish",
      "start_offset": 1,
      "end_offset": 9,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "court",
      "start_offset": 11,
      "end_offset": 16,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "project",
      "start_offset": 20,
      "end_offset": 27,
      "type": "<ALPHANUM>",
      "position": 3
    },
    {
      "token": "cost",
      "start_offset": 31,
      "end_offset": 35,
      "type": "<ALPHANUM>",
      "position": 5
    },
    {
      "token": "state",
      "start_offset": 36,
      "end_offset": 41,
      "type": "<ALPHANUM>",
      "position": 6
    },
    {
      "token": "100s",
      "start_offset": 42,
      "end_offset": 46,
      "type": "<ALPHANUM>",
      "position": 7
    },
    {
      "token": "100",
      "start_offset": 42,
      "end_offset": 45,
      "type": "<ALPHANUM>",
      "position": 7
    },
    {
      "token": "s",
      "start_offset": 45,
      "end_offset": 46,
      "type": "<ALPHANUM>",
      "position": 8
    },
    {
      "token": "million",
      "start_offset": 50,
      "end_offset": 58,
      "type": "<ALPHANUM>",
      "position": 10
    },
    {
      "token": "year",
      "start_offset": 63,
      "end_offset": 68,
      "type": "<ALPHANUM>",
      "position": 12
    }
  ]
}

I think I can do a match_phrase because:

  1. Single quotes are being ingnored
  2. Finished is changed to finish
  3. IT is being ignored

If I search for the phrase : " IT project to cost state"

This returns a match_phrase correctly as most of the words are taken into account.

The solution I'm looking for is to help with match_phrase, or look for exact_match (using a double quote wrapped/ boolean search) which right now works for the most part except with this type of phrases that contain single quotes and stopwords.

That's where I'm getting confused since there seems to be a solution already in place with common terms but I can't seems to make it work.


(Abdon Pijpelink) #7

The problems lies in the analyzer that you are overriding at query time.

Your mapping shows that post_title is analyzed using the standard analyzer. The standard analyzer basically breaks up your strings into individual words, removes punctuation and lowercases them. So a word likes ‘Finished’ becomes finished. So that's what's happing at index time: Elasticsearch puts the word finished into the inverted index for this specific document.

At query time, when you apply the my_english analyzer (based on the english analyzer), the word ‘Finished’ is "stemmed". This means that the word is reduced to its root form finish as you have noticed. That's why your queries do not find any hits: there is no term finish in the inverted index.

I think the following query should work for you:

{
  "query": {
    "common": {
      "post_title": {
        "query": "Finished Court IT Project",
        "cutoff_frequency": 0.001,
        "minimum_should_match": {
                    "low_freq" : 2,
                    "high_freq" : 3
                }
      }
      
    }
  }
}

And for phrases, the following should work:

{
    "query": {
        "match_phrase" : {
            "post_title" : {
                "query" : "Finished Court IT Project"
            }
        }
    }
}

(cayala@courthousenews.com) #8

First of all, thanks so much for taking the time to look at this with me, I really appreciated, unfortunately, as you can see in my original post, I tried both of the queries you suggest and both return 0 hits, I tired again just now to make sure and again they return 0 hits.

Should I reindex my stories? Now I understand a bit better what the analyzers are doing and their behavior thanks to you but I'm still confused as to why the common phrase query won't help me with this issue with the analyzer.

UPDATE:

To my surprise if I change both to _all it works, I think there might be an issue with my fields then:

{
    "query": {
        "match_phrase" : {
            "_all" : {
                "query" : "'Finished' Court IT Project"
            }
        }
    }
}

Result:

 "hits": {
    "total": 1,
    "max_score": 8.366125,
    "hits": [
      {

Query:

{
  "query": {
    "common": {
      "_all": {
        "query": "'Finished' Court IT Project",
        "cutoff_frequency": 0.001,
        "minimum_should_match": {
                    "low_freq" : 2,
                    "high_freq" : 3
                }
      }
      
    }
  }
}

Result

 "hits": {
    "total": 2867,
    "max_score": 16.505928,
    "hits": [
      {

(Abdon Pijpelink) #9

I am at a loss why it works for _all and not for the post_title field. You cut off the hits from the search responses, so we can't see what that field looks like in your documents. You may be right, maybe there is a problem with your ingestion process that screws up your fields? Maybe there's something funky going on with your analyzers? Is there a default analyzer defined in your index settings? When you say you changed the stop words, do you mean that you changed analyzer settings by closing and opening the index?


(cayala@courthousenews.com) #10

Correct, I stopped the index, then I changed the analyzer like this:

{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_english": {
          "type":      "english", 
          "stopwords": "_english_" 
        }
      }
    }
  }
}

Then I restarted the index, I had to enable the open/closing index by running:

{
  "persistent": {
    "cluster.indices.close.enable": true
  }
}

UPDATE

I Think I found the issue:

 "mappings": {
      "post": {
        "_all": {
          "analyzer": "simple"
        },

And

 "post_title": {
            "type": "text",
            "fields": {
              "post_title": {
                "type": "text",
                "analyzer": "standard"
              },

As you can see one uses a simple analyzer and the other uses a standad analyzer, is there a way I can change the analyzer for some of the fields to match the simple analyzer?

UPDATE 2:

So what i tried to do was to add change the analyzer with:

{
  "settings": {
    "analysis": {
      "analyzer": {
        "standard": {
          "type":      "simple", 
          "stopwords": "_english_" 
        }
      }
    }
  }
}

But it seems that still has the same behavior, can't find anything else in the documentation to tell me how to use a specific analyzer for certain mappings/fields


(Russ Cam) #11

If you're using NEST, take a look at the client specific section on analysis to see how to specify analyzed and test them. The multi-fields documentation may also help


(system) #12

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.