Exact match in a not_analyzed field


(Imran Azad) #1

How would I do an exact match on a not_analyzed field that contains a large amount of text? For example take the following paragraph:

Kefir grains are a combination of lactic acid bacteria and yeasts in a matrix of proteins, lipids, and sugars, and this symbiotic matrix, (or SCOBY) forms "grains" that resemble cauliflower. For this reason, a complex and highly variable community of lactic acid bacteria and yeasts can be found in these grains although some predominate; Lactobacillus species are always present.[3] Even successive batches of kefir may differ due to factors such as the kefir grains rising out of the milk while fermenting, or curds forming around the grains, as well as room temperature.[8]

How can I get a match on a exact phrase search for "lactic acid bacteria and yeasts can be found in these grains" using the query_string only?


(Zachary Tong) #2

So, it is technically possible...but I want to discourage you from doing this. It will lead to poor performance in the long run. It's better to restructure your data now and leverage properly tokenized fields.

To make your query work, you need to query for the exact phrase (wrapped in quotes) with wildcards on either side:

GET /test/_search
{
    "query": {
        "query_string": {
           "default_field": "foo",
           "query": "*\"lactic acid bacteria and yeasts can be found in these grains\"*"
        }
    }
}

The reason this is terrible for performance is because a not_analyzed string is stored as a single token inside the index. To find the phrase, Lucene needs to look through every field then do a linear scan across the characters in that field to see if there is a match. This is very slow because it does not leverage the index at all.

In contrast, if this field was an analyzed field, it would be tokenized, and the individual tokens would be stored in the index. A phrase search can then find all documents with the required tokens via the index, then execute a second phase to see if those documents have the terms in the correct ordering. This is much faster.

Sooo...I'd ask why this field is a not_analyzed field, and if you have the ability to analyze it instead? It would be much better to do this operation with a match_phrase for example.


(system) #3