Problem combining "copy_to", "analyzer" and wildcard "query string" query

cherry-wave · July 27, 2018, 2:12pm

I have a bit of a weird problem here:

I use es 5.6.4
I copy some fields via "copy_to" mapping into a field named "search"
I added a german analyzer to the field mapping of the search field
I use a query string query like this: Müll
I have an entry "Herbert Müller"

Now something weird happens: If I search for "Müll" the entry is found. But if I search for "Mülle" or "Müller" the entry is not found.

It seems to not work with the last two letters, because the same will happen if I search in the middle of the name like "üll" will work but again "ülle" not.

On the other hand, I can type in as much letters from the word "Herbert" (has no umlauts in it) and it works.

If I search for "Müller" without wildcards, the entry is found too.

Also if I do this search not on the "copy_to" but on the original field (and add the german analyzer there directly) it finds "Müller" but not for e.g. "ller". And sadly then: "rber" isn't found to (which at least worked with the "copy_to" field).

Sooo do I do something wrong here? Is this the normal behaviour and I just don't understand it? Did I found a bug? Is this fixed in ES 6.x?

Greetings

cherry-wave · July 27, 2018, 2:33pm

UPDATE: I tested it a lot again and it seems having umlauts or not is not the problem here. More like the length of a word? Like with "Natascha" everything is fine, "Natasch" is found, but with "Holger" it won't find "Holge"

dadoonet · July 27, 2018, 3:21pm

Could you provide a full recreation script as described in About the Elasticsearch category. It will help to better understand what you are doing. Please, try to keep the example as simple as possible.

A full reproduction script will help readers to understand, reproduce and if needed fix your problem. It will also most likely help to get a faster answer.

cherry-wave · July 27, 2018, 5:54pm

Hey,

I have done some research by now and can provide a good example what the problem is. I thought I had found the pattern when a search works and when not... But I hadn't. My Java-Test with some examples:

@Test
public void testCopyTo() throws ExecutionException, InterruptedException, IOException {
    XContentBuilder settings = jsonBuilder()
        .startObject()
            .startObject("analysis")
                .startObject("filter")
                    .startObject("german_stop")
                        .field("type", "stop")
                        .field("stopwords", "_german_")
                    .endObject()
                    .startObject("german_keywords")
                        .field("type", "keyword_marker")
                        .field("keywords", "[\"der\"]")
                    .endObject()
                    .startObject("german_stemmer")
                        .field("type", "stemmer")
                        // german, german2, light_german, minimal_german
                        .field("language", "light_german")
                    .endObject()
                .endObject()
                .startObject("analyzer")
                    .startObject("my_analyzer")
                        .field("tokenizer", "ngram")
                        .array("filter", "lowercase", "german_stop", "german_keywords", "german_normalization", "german_stemmer")
                    .endObject()
                .endObject()
            .endObject()
        .endObject();

    XContentBuilder mapping = jsonBuilder()
        .startObject()
            .startObject("properties")
                .startObject("name")
                    .field("type","text")
                    .field("copy_to","search")
                .endObject()
                .startObject("search")
                    .field("type","text")
                    .field("analyzer","german")
                    .startObject("fields")
                        .startObject("keyword")
                            .field("type","keyword")
                            .field("ignore_above",256)
                        .endObject()
                    .endObject()
                .endObject()
            .endObject()
        .endObject();

    elasticClient.admin().indices().prepareCreate("test_index")
        // Doesn't seem to work combined with a *...* search
//            .setSettings(settings)
        .addMapping("test_type", mapping).execute().get();

    Map<String, Object> source = new HashMap<>();

    // 3 characters, everything is ok
    String name = "Güe"; // =)
//        String name = "nne"; // =)
    // 4 characters: last character can't be 'e'
//        String name = "Güet"; // =)
//        String name = "nnen"; // =)
//        String name = "Güne"; // =(
//        String name = "nnne"; // =(
    // 5 characters: very different results...
//        String name = "Güneh"; // =)
//        String name = "nnnen"; // =(
//        String name = "Günte"; // =(
//        String name = "nnnne"; // =(
    // 6 characters: last two characters can't be 'e'
//        String name = "Günehr"; // =)
//        String name = "nnnenn"; // =)
//        String name = "Günter"; // =(
//        String name = "nnnnen"; // =(
    // 7 characters: same
//        String name = "Günteir"; // =)
//        String name = "nnnnenn"; // =)
//        String name = "Günther"; // =(
//        String name = "nnnnnen"; // =(
    // 8 characters: magically, the next to last character can be 'e' again, but "Günthrel" != "nnnnnnen"???
//        String name = "Günthrel"; // =)
//        String name = "nnnnnnen"; // =(
//        String name = "Günthrle"; // =(
//        String name = "nnnnnen"; // =(

    source.put("name", name);

    elasticClient.prepareIndex("test_index", "test_type", "1").setSource(source).setRefreshPolicy(WriteRequest.RefreshPolicy.IMMEDIATE).execute().get();

    // Everything is OK
//        String searchString = name;
    // Not entering the last two characters even works with the "=(" commented names
//        String searchString = "*" + name.substring(0, name.length() - 2) + "*";
    // Both fail with "=(" commented names
//        String searchString = "*" + name.substring(0, name.length() - 1) + "*";
    String searchString = "*" + name + "*";
    SearchResponse response = elasticClient.prepareSearch("test_index").setQuery(queryStringQuery(searchString).defaultField("search")).execute().get();

    assertEquals(1, response.getHits().getTotalHits());
}

As I wrote in the code-comments, I thought I found out that it is a combination of the length of a word, the letter "e", but also surrounding your search string with "*" and how many characters you type into the search. But that seems to be not 100% correct. For example "Güneh" (5 letters) can be found but "nnnen" (also 5 letters) can't. It is really, really confusing.

I tried using an "ngram"-tokenizer but that seems to make it even worse when using wildcards.

You can try this out by yourself and see if the names commented with "=)" will work, and the names with "=(" won't. Or you can try out a different searchString and see if the result fits my comments.

dadoonet · July 27, 2018, 6:09pm

Please do it as a Kibana dev console script as shown in the example I linked.

cherry-wave · July 30, 2018, 7:18am

Here is the whole thing as a kibana dev console script:

DELETE test_index

# The ngram-tokenizer doesn't seem to work combined with a *...* search
PUT test_index
{
  "settings": {
    "analysis": {
      "filter": {
        "german_stop": {
          "type": "stop",
          "stopwords": "_german_"
        },
        "german_keywords": {
          "type": "keyword_marker",
          "keywords": [
            "Beispiel"
          ]
        },
        "german_stemmer": {
          "type": "stemmer",
          "language": "light_german"
        }
      },
      "analyzer": {
        "ngram_german": {
          "tokenizer": "ngram",
          "filter": [
            "lowercase",
            "german_stop",
            "german_keywords",
            "german_normalization",
            "german_stemmer"
          ]
        }
      }
    }
  },
  "mappings": {
    "test_type": {
      "properties": {
        "name": {
          "type": "text",
          "copy_to": "search"
        },
        "search": {
          "type": "text",
          "analyzer": "german",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        }
      }
    }
  }
}

## 3 characters, everything is ok
# Güe =)
# nne =)
## 4 characters: last character can't be 'e'
# Güet =)
# nnen =)
# Güne =(
# nnne =(
## 5 characters: very different results...
# Güneh =)
# nnnen =(
# Günte =(
# nnnne =(
## 6 characters: last two characters can't be 'e'
# Günehr =)
# nnnenn =)
# Günter =(
# nnnnen =(
## 7 characters: same
# Günteir =)
# nnnnenn =)
# Günther =(
# nnnnnen =(
## 8 characters: magically, the next to last character can be 'e' again, but "Günthrel" != "nnnnnnen"???
# Günthrel =)
# nnnnnnen =(
# Günthrle =(
# nnnnnnne =(
DELETE test_index/test_type/1
  
PUT test_index/test_type/1
{
    "name" : "Güe"
}

# Using no wildcards works with all examples
# Not entering the last two characters even works with the "=(" commented names
# Not entering the last character or entering the full name fails
GET /test_index/_search
{
    "query": {
        "query_string" : {
            "default_field" : "search",
            "query" : "*Güe*"
        }
    }
}

dadoonet · July 31, 2018, 8:44pm

To understand things, you can check out the _analyze API to understand what is happening at index time to your text.

Note that the same process happens by default at search time on your search query.
But it does not apply when using Wildcards. See analyze_wildcard in this page: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html

And read that for more details: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html#_wildcards

GET /test_index/_search
{
    "query": {
        "query_string" : {
            "default_field" : "search",
            "query" : "*Güe*",
            "analyze_wildcard": true
        }
    }
}

Might work.

But it's absolutely not recommended to use wildcards, specifically starting with *.
Better to use ngrams IMO.

cherry-wave · August 2, 2018, 7:05am

Thank you very much. I am using ngrams now. Although it is very tough to provide ALL possible language-analyzers for my product^^

system · August 30, 2018, 7:05am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Wildcard search result problems Elasticsearch	6	354	May 22, 2018
Possible bug with query_string, no source stored, German umlaut and wildcard Elasticsearch	7	846	July 6, 2017
Wildcard query no results, but wildcard in query_string works OK Elasticsearch	2	572	November 12, 2018
"query_string" dosen't analyze wildcard queries Elasticsearch	5	4847	December 28, 2017
Query_string with wildcard not working as expected (or wrong understanging of analyze_wildcard) Elasticsearch	0	9	December 12, 2024

Problem combining "copy_to", "analyzer" and wildcard "query string" query

Related topics