Query string query fuzzy performance

Hi there,

we're trying to optimize performance of our search and have run into an issue we need help with.

We use the query_string query to allow our customers to execute boolean queries on the data. We also allow fuzzy search in this process.
Recently our search performance really slowed down as the amount of data increased so we tried optimizing it.

A bit of background: We have an Applicant Tracking System and our customers are able to search through candidate resumes with boolean queries like "Software Engineer AND Croatia AND...". They can also search candidates by name, for example "John Doe".

What we expose to our customers is a simple boolean building mechanism where they can write queries like mentioned above, and then we take those queries and transform them into a proper fuzzified query for Elastic. In the end, the above query would look like this "Software~ Engineer~ AND Croatia~ AND...".

At first we considered upgrading the infrastructure tier we have, to a more powerful one, but in the end, that is not a long-term solution without optimizing everything else before.

We've explored the docs and found that fuzzy_prefix_length property has an impressive impact on the search performance and decided to use it. This has created an issue when searching for candidates by last name.
Here's the thing, a lot of Balkan region last names can start with a diacritic characters like č, ć, š, ž. Previously when users would search for a candidate with last name "Šalković", by entering a query "Salkovic", they would naturally find that person. But now that we've set the prefix length to 2, searching for "Salkovic" doesn't return the result because the first two letters are ignored in fuzzy.

We're currently puzzled with finding a solution for this and retaining the performance gain we've achieved with the prefix(it has literally reduced the search time by a factor of 15x).

Here's the thing, a lot of Balkan region last names can start with a diacritic characters like č, ć, š, ž.

Is there any reason why you are not using ASCII folding token filter for this case?

Hi mayya, thank you for your answer. We have indeed tried that but haven't had any success. After your response I went back and realized that we've only defined the analyzers but haven't actually mapped them to any fields. I've fixed it and now it works as expected. Thank you.

For future reference, this is what we currently have:

.Settings(s => s
    .Analysis(a => a
        .Normalizers(n => n
        .Custom("lowercase_normalizer", cs => cs
            .Filters(new string[] { "lowercase" }))
        )
        .Analyzers(an => an
            .Custom("custom_ascii_folding_analyzer", cs => cs
                .Tokenizer("standard")
                .Filters(new string[] { "lowercase", "asciifolding" })
            )
            .Custom("custom_stop_analyzer", cs => cs
                .Tokenizer("standard")
                .Filters(new string[] { "lowercase", "english_stop", "asciifolding" })
            )
        )
    	.TokenFilters(cf => cf
    	    .UserDefined("english_stop", new StopTokenFilter() { StopWords = "_english_" })
    	)
    )
)

And then for the mappings:

[Text(Analyzer = "custom_ascii_folding_analyzer",
    		SearchAnalyzer = "custom_stop_analyzer",
    		SearchQuoteAnalyzer = "custom_ascii_folding_analyzer")]
public string FullName;
1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.