Problem combining "copy_to", "analyzer" and wildcard "query string" query


#1

I have a bit of a weird problem here:

  • I use es 5.6.4
  • I copy some fields via "copy_to" mapping into a field named "search"
  • I added a german analyzer to the field mapping of the search field
  • I use a query string query like this: Müll
  • I have an entry "Herbert Müller"

Now something weird happens: If I search for "Müll" the entry is found. But if I search for "Mülle" or "Müller" the entry is not found.

It seems to not work with the last two letters, because the same will happen if I search in the middle of the name like "üll" will work but again "ülle" not.

On the other hand, I can type in as much letters from the word "Herbert" (has no umlauts in it) and it works.

If I search for "Müller" without wildcards, the entry is found too.

Also if I do this search not on the "copy_to" but on the original field (and add the german analyzer there directly) it finds "Müller" but not for e.g. "ller". And sadly then: "rber" isn't found to (which at least worked with the "copy_to" field).

Sooo do I do something wrong here? Is this the normal behaviour and I just don't understand it? Did I found a bug? Is this fixed in ES 6.x?

Greetings :wink:


#2

UPDATE: I tested it a lot again and it seems having umlauts or not is not the problem here. More like the length of a word? Like with "Natascha" everything is fine, "Natasch" is found, but with "Holger" it won't find "Holge"


(David Pilato) #3

Could you provide a full recreation script as described in About the Elasticsearch category. It will help to better understand what you are doing. Please, try to keep the example as simple as possible.

A full reproduction script will help readers to understand, reproduce and if needed fix your problem. It will also most likely help to get a faster answer.


#5

Hey,

I have done some research by now and can provide a good example what the problem is. I thought I had found the pattern when a search works and when not... But I hadn't. My Java-Test with some examples:

@Test
public void testCopyTo() throws ExecutionException, InterruptedException, IOException {
    XContentBuilder settings = jsonBuilder()
        .startObject()
            .startObject("analysis")
                .startObject("filter")
                    .startObject("german_stop")
                        .field("type", "stop")
                        .field("stopwords", "_german_")
                    .endObject()
                    .startObject("german_keywords")
                        .field("type", "keyword_marker")
                        .field("keywords", "[\"der\"]")
                    .endObject()
                    .startObject("german_stemmer")
                        .field("type", "stemmer")
                        // german, german2, light_german, minimal_german
                        .field("language", "light_german")
                    .endObject()
                .endObject()
                .startObject("analyzer")
                    .startObject("my_analyzer")
                        .field("tokenizer", "ngram")
                        .array("filter", "lowercase", "german_stop", "german_keywords", "german_normalization", "german_stemmer")
                    .endObject()
                .endObject()
            .endObject()
        .endObject();

    XContentBuilder mapping = jsonBuilder()
        .startObject()
            .startObject("properties")
                .startObject("name")
                    .field("type","text")
                    .field("copy_to","search")
                .endObject()
                .startObject("search")
                    .field("type","text")
                    .field("analyzer","german")
                    .startObject("fields")
                        .startObject("keyword")
                            .field("type","keyword")
                            .field("ignore_above",256)
                        .endObject()
                    .endObject()
                .endObject()
            .endObject()
        .endObject();

    elasticClient.admin().indices().prepareCreate("test_index")
        // Doesn't seem to work combined with a *...* search
//            .setSettings(settings)
        .addMapping("test_type", mapping).execute().get();

    Map<String, Object> source = new HashMap<>();

    // 3 characters, everything is ok
    String name = "Güe"; // =)
//        String name = "nne"; // =)
    // 4 characters: last character can't be 'e'
//        String name = "Güet"; // =)
//        String name = "nnen"; // =)
//        String name = "Güne"; // =(
//        String name = "nnne"; // =(
    // 5 characters: very different results...
//        String name = "Güneh"; // =)
//        String name = "nnnen"; // =(
//        String name = "Günte"; // =(
//        String name = "nnnne"; // =(
    // 6 characters: last two characters can't be 'e'
//        String name = "Günehr"; // =)
//        String name = "nnnenn"; // =)
//        String name = "Günter"; // =(
//        String name = "nnnnen"; // =(
    // 7 characters: same
//        String name = "Günteir"; // =)
//        String name = "nnnnenn"; // =)
//        String name = "Günther"; // =(
//        String name = "nnnnnen"; // =(
    // 8 characters: magically, the next to last character can be 'e' again, but "Günthrel" != "nnnnnnen"???
//        String name = "Günthrel"; // =)
//        String name = "nnnnnnen"; // =(
//        String name = "Günthrle"; // =(
//        String name = "nnnnnen"; // =(

    source.put("name", name);

    elasticClient.prepareIndex("test_index", "test_type", "1").setSource(source).setRefreshPolicy(WriteRequest.RefreshPolicy.IMMEDIATE).execute().get();

    // Everything is OK
//        String searchString = name;
    // Not entering the last two characters even works with the "=(" commented names
//        String searchString = "*" + name.substring(0, name.length() - 2) + "*";
    // Both fail with "=(" commented names
//        String searchString = "*" + name.substring(0, name.length() - 1) + "*";
    String searchString = "*" + name + "*";
    SearchResponse response = elasticClient.prepareSearch("test_index").setQuery(queryStringQuery(searchString).defaultField("search")).execute().get();

    assertEquals(1, response.getHits().getTotalHits());
}

As I wrote in the code-comments, I thought I found out that it is a combination of the length of a word, the letter "e", but also surrounding your search string with "*" and how many characters you type into the search. But that seems to be not 100% correct. For example "Güneh" (5 letters) can be found but "nnnen" (also 5 letters) can't. It is really, really confusing.

I tried using an "ngram"-tokenizer but that seems to make it even worse when using wildcards.

You can try this out by yourself and see if the names commented with "=)" will work, and the names with "=(" won't. Or you can try out a different searchString and see if the result fits my comments.


(David Pilato) #6

Please do it as a Kibana dev console script as shown in the example I linked.


#7

Here is the whole thing as a kibana dev console script:

DELETE test_index

# The ngram-tokenizer doesn't seem to work combined with a *...* search
PUT test_index
{
  "settings": {
    "analysis": {
      "filter": {
        "german_stop": {
          "type": "stop",
          "stopwords": "_german_"
        },
        "german_keywords": {
          "type": "keyword_marker",
          "keywords": [
            "Beispiel"
          ]
        },
        "german_stemmer": {
          "type": "stemmer",
          "language": "light_german"
        }
      },
      "analyzer": {
        "ngram_german": {
          "tokenizer": "ngram",
          "filter": [
            "lowercase",
            "german_stop",
            "german_keywords",
            "german_normalization",
            "german_stemmer"
          ]
        }
      }
    }
  },
  "mappings": {
    "test_type": {
      "properties": {
        "name": {
          "type": "text",
          "copy_to": "search"
        },
        "search": {
          "type": "text",
          "analyzer": "german",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        }
      }
    }
  }
}

## 3 characters, everything is ok
# Güe =)
# nne =)
## 4 characters: last character can't be 'e'
# Güet =)
# nnen =)
# Güne =(
# nnne =(
## 5 characters: very different results...
# Güneh =)
# nnnen =(
# Günte =(
# nnnne =(
## 6 characters: last two characters can't be 'e'
# Günehr =)
# nnnenn =)
# Günter =(
# nnnnen =(
## 7 characters: same
# Günteir =)
# nnnnenn =)
# Günther =(
# nnnnnen =(
## 8 characters: magically, the next to last character can be 'e' again, but "Günthrel" != "nnnnnnen"???
# Günthrel =)
# nnnnnnen =(
# Günthrle =(
# nnnnnnne =(
DELETE test_index/test_type/1
  
PUT test_index/test_type/1
{
    "name" : "Güe"
}

# Using no wildcards works with all examples
# Not entering the last two characters even works with the "=(" commented names
# Not entering the last character or entering the full name fails
GET /test_index/_search
{
    "query": {
        "query_string" : {
            "default_field" : "search",
            "query" : "*Güe*"
        }
    }
}

(David Pilato) #8

To understand things, you can check out the _analyze API to understand what is happening at index time to your text.

Note that the same process happens by default at search time on your search query.
But it does not apply when using Wildcards. See analyze_wildcard in this page: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html

And read that for more details: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html#_wildcards

GET /test_index/_search
{
    "query": {
        "query_string" : {
            "default_field" : "search",
            "query" : "*Güe*",
            "analyze_wildcard": true
        }
    }
}

Might work.

But it's absolutely not recommended to use wildcards, specifically starting with *.
Better to use ngrams IMO.


#9

Thank you very much. I am using ngrams now. Although it is very tough to provide ALL possible language-analyzers for my product^^


(system) closed #10

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.