Phrase Query with Prefix produces 0 results

We are replacing a homegrown search engine written with pure Lucene (currently 4.9.1) with ElasticSearch. After adding the search analyzers we used in the old system, I'm now trying to replicate our queries. Right now I'm looking at replacing our "starts with" query with the same syntax in ElasticSearch. It appears that either query_string or simple_query_string should support the original Lucene syntax, but that doesn't seem to be the case.

I have a field that is analyzed using whitespace tokenizer and both lowercase and asciifolding filters. I'm trying to run a query that has a phrase query with a prefix.

Specifically, we are trying to find documents with one or more names, but I expect it to work in other cases as well. For example, I want to be able to find all documents with a name starting with “Smith John”, such as “Smith John”, “Smith Johnny”, “Smith John A”, etc. If a document has “Smith Barry” and “Wilson John”, but doesn’t have a version of “Smith John”, I don’t want to find that document.

This query will find all documents that have exactly "smith john" in the field "name" but no other variations.

{
"query": {
    "simple_query_string": {
        "query": "\"smith john\"",
        "fields": ["name"],
        "default_operator": "AND"
        }
    }
}

If I remove the quotes and add a prefix operator, the query will find “Smith John”, “Smith John A”, “Smith Johnny”, but also find any documents with both “Smith Barry” and “Wilson John” because it searches across instances.

{
"query": {
    "simple_query_string": {
        "query": "smith john*",
        "fields": ["name"],
        "default_operator": "AND"
        }
    }
}

The next query is what I'm trying to use, and it does work in our old system with pure Lucene. The quotes tell it to search across only one instance, and the asterisk () tells it to do a prefix query. However, in ElasticSearch that same syntax never produces any results. I’m guessing it is actually looking for “john” instead of treating the asterisk as a prefix operator.

{
"query": {
    "simple_query_string": {
        "query": "\"smith john*\"",
        "fields": ["name"],
        "default_operator": "AND"
        }
    }
}

I have tried variations of query_string as well with similar results.

I have successfully done this using "match_phrase_prefix" to search for "smith john", but that comes with its own limitations such as not allowing wildcards and needing to know or guess at a value for max_expansions. I found that if I use too small of a number I get partial results, and the documentation warns that too large of a number affects performance.

What do I need to change to get the results I want from this query? Thank you.

I did figure out that we do have a custom implementation of Query that is the reason this works for us now. However, I would be much happier if there was an out-of-the-box way to do it in ElasticSearch.

It actually depends on your mapping and the analyzer you are using. Can you please share them as well?

The analyzer we have is:

{
	"MyAnalyzer": {
		"filter": [
			"lowercase",
			"asciifolding"
		],
		"type": "custom",
		"tokenizer": "whitespace"
	}
}

and the field mapping is:

{
	"text_field": {
		"type": "text",
		"fields": {
			"raw": {
				"type": "keyword"
			}
		},
		"analyzer": "MyAnalyzer"
	}
}

I think you should use edge_ngram filter in your analyzer for indexing and run match_phrase query with standard analyzer for making queries. You can read more about it here

Regarding your second question, can you explain a bit more?

I tried your solution using edge_ngram. What I found is that it prefixes both names. Instead of just finding "smith john" it also found "smithson john". This makes me think that match_phrase_prefix is still the better solution for us.

Have you used match query or match_phrase query? You should use match_phrase query instead of match query.

My query was:

{
	"query": {
		"match_phrase": {
			"name": "smith john"
		}
	}
}

I found a post that mentioned the "span_near" query. That actually seems to work quite well. I had to first convert the terms to lower case since that is what the analyzer is doing when it stores them. Then I had to use "span_multi" with a prefix query for the final term.

I tried using span_first, but that doesn't work if there are multiple names indexed into the same field as an array as it will only find the first name in the list.

To get a true "starts with" query I had to index a special character at the beginning of the name. In our system we index "smith john" as "^ smith john" so the prefix query will work as expected.

You end up with a query like this:

{
    "query": {
        "span_near": {
            "clauses": [
                {
                    "span_term": {
                        "name": "^"
                    }
                },
                {
                    "span_term": {
                        "name": "smith"
                    }
                },
                {
                    "span_multi": {
                        "match": {
                            "prefix": {
                                "name": "john"
                            }
                        }
                    }
                }
            ],
            "in_order": true
        }
    }
}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.