Whitespace in search term causes ES to return all entries when ngram analyzer is used


(Shazvi Ahmed) #1

Hi,
I'm using elasticsearch 1.5.2 provided by amazon aws and I'm using the official php library to connect to it. I've got this problem where if i apply ngram analyzer for the searchable fields, and then search through it with a "muilti_match" query, and the search term contains a whitespace, the query returns all the entries instead of just the relevant ones. If I remove the ngram analyzer, then multiword queries behave normally and return the relevant results but then I lose partial matching.

The following is what Iv'e tried:

  1. Set the tokenizer to whitespace or setting it to keyword
  2. Defining ngram as a tokenizer instead of as token filter
  3. Using a pattern tokenizer and specifying whitespace as the pattern so that it splits by space.
  4. Using edge-ngram instead of ngram

Non of the above made any difference. One other thing i tried is word_delimiter filter with catenate_all set to true. The effect of this was to join multiple words into a single word by removing spaces. This, when coupled with ngram filter, seemed to work for some cases, but obviously it's not a viable solution because there are too many edge case that i cant account for ()like when the 2 search terms aren't in the same position.

My requirement is to have partial matching and also to allow a search term with spaces in it.

Following is my code.

// Analyzer
'analysis' => array(
    "filter" => array(
        "ngram_token_filter" => array(
            "type" => "nGram",
            "min_gram" => "1",
            "max_gram" => "15"
        )
    ),
    'analyzer' => array(
        'ngram_analyzer' => array(
            'type' => 'custom',
            'tokenizer' => 'standard',
            'filter' => array(
                'lowercase',
                'ngram_token_filter'
            )
        )
    )
)

// Mapping
'properties' => array(
    'title' => array('type' => 'string', 'analyzer' => 'ngram_analyzer'),
    'description' => array('type' => 'string', 'analyzer' => 'ngram_analyzer'),
    'type' => array('type' => 'string', 'analyzer' => 'ngram_analyzer'),
    'status' => array('type' => 'byte')
)

// Query
'query' => array(
    'filtered' => array(
        'query' => array(
            'multi_match' => array(
                'query' => $searchTerm,
                'type' => 'most_fields',
                "minimum_should_match" => "75%",
                'fields' => array('title^2', 'description', 'type')
            )
        ),
        'filter' => array(
            'bool' => array(
                'must' => array(
                    array(
                        'term' => array(
                            'status' => 1
                        )
                    )
                )
            )
        )
    )
)

Any help would be greatly appreciated. Thanks.


(Shazvi Ahmed) #2

Hi,
So the problem i had was that i hadn't specified the search analyzer. When a whitespace search analyzer was specified in the mapping, the issue was no longer there and i could do ngram partial matching on multi word queries. So that fixed it for me.

// Mapping
'properties' => array(
    'title' => array('type' => 'string', 'analyzer' => 'ngram_analyzer', 'search_analyzer' => 'whitespace'),
    'description' => array('type' => 'string', 'analyzer' => 'ngram_analyzer', 'search_analyzer' => 'whitespace'),
    'type' => array('type' => 'string', 'analyzer' => 'ngram_analyzer', 'search_analyzer' => 'whitespace'),
    'status' => array('type' => 'byte')
)

(system) #3