Query_string is not behaving as expected with analyzer (simple)?

Hello, I want to be able to able to search Japanese inputs as well as English. I don't want to use the plugin, because I just want to do partial search within a Japanese input. Therefore please do not suggest kuromoji.

ElasticSearch version: 5.6.1

The problem is that, I want to use simple analyzer for my index and I think I achieved that with elasticsearch-dsl.

First problem (and also a question):

When I call blue.local:9200/contracts/_settings/, I cannot see simple as being the analyzer in the index settings:

{
    "contracts": {
        "settings": {
            "index": {
                "creation_date": "1507127956748",
                "number_of_shards": "5",
                "number_of_replicas": "1",
                "uuid": "ehfhOJ2OStqS7fd4wLLn1g",
                "version": {
                    "created": "5060199"
                },
                "provided_name": "contracts"
            }
        }
    }
}

I believe this might be normal for generic analyzers. Right?

Then I analyzed the simple analyzer by calling blue.local:9200/_analyze?analyzer=simple&text=地上権 and the result was:

{
    "tokens": [
        {
            "token": "地上権",
            "start_offset": 0,
            "end_offset": 3,
            "type": "word",
            "position": 0
        }
    ]
}

When I was using standard analyzer, every Japanese letter was a token. Now it's not, and I think this is what I want.

Then, I validated my query by calling:

POST blue.local:9200/contracts/_validate/query?explain
{
    "query": {
        "query_string" : {
            "query" : "name:地上権",
            "analyzer": "simple"
        }
    }
}

And the response was:

{
    "valid": true,
    "_shards": {
        "total": 1,
        "successful": 1,
        "failed": 0
    },
    "explanations": [
        {
            "index": "contracts",
            "valid": true,
            "explanation": "name:地上権"
        }
    ]
}

I guess, the explanation here shows that I am on the right path.

BUT, when I do the query:

POST blue.local:9200/contracts/_search/
{
    "query": {
        "query_string" : {
            "query" : "name:地上権",
            "analyzer": "simple"
        }
    }
}

I get zero hits:

{
    "took": 1,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 0,
        "max_score": null,
        "hits": []
    }
}

I am sure that the data exists.

When I remove the analyzer in the query:

POST blue.local:9200/contracts/_search/?explain
{
    "query": {
        "query_string" : {
            "query" : "name:地上権"
        }
    }
}

I get this:

{
    "took": 12,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 36,
        "max_score": 12.01759,
        "hits": [
            {
                "_shard": "[contracts][0]",
                "_node": "IPjnafPMRyGqMLrABkvudQ",
                "_index": "contracts",
                "_type": "contract_document",
                "_id": "6192",
                "_score": 12.01759,
                "_source": {
                    "client": "My Client",
                    "id": 6192,
                    "name": "地上権"
                },
                "_explanation": {
                    "value": 12.01759,
                    "description": "sum of:",
                    "details": [
                        {
                            "value": 3.6425304,
                            "description": "weight(name:地 in 114) [PerFieldSimilarity], result of:",
                            "details": [
                                ...
                            ]
                        },
                        ...
                    ]
                }
            }
        ]
    }
}

I see that the weight is being calculated on each letter like they are indexed with standard analyzer.

What should be my next step? I spent over 6 hours to find the issue, but I failed. :frowning:

But you did not defined the simple analyzer on your fields in mapping, right?

At index time, fields are analyzed using the standard analyzer then and I'm pretty sure not token "地上権" has been generated for the name field.

When I make a call to blue.local:9200/contracts/contract_document/6192/_termvectors?fields=name
I see that the terms are each Japanese letters. So you might be right!

I know this is not a topic of elasticsearch, but I am using elasticsearch-dsl-py with Django wrapper and I did the settings like this:

contracts = Index('contracts')
my_analyzer = analyzer('simple')

contracts.analyzer(my_analyzer)


@contracts.doc_type
class ContractDocument(DocType):
    client = fields.StringField(attr='client_name')

    class Meta:
        model = Contract

        fields = [
            'id',
            'name'
        ]

So I assummed that when I declare an analyzer for index, it would apply to all fields. I guess not, right? Each field should have its own analyzer defined?

GET blue.local:9200/contracts/

Run above command and paste the response here. I'll try to help you solve the problem. I think the name field is not using simple analyzer.

If you want to use simple as the default analyzer, you should set it like this.

PUT /contracts
{
  "mappings": {
    "contract_document":{
      "properties": {
        "name":{
          "type":"text",
          "analyzer": "simple"
        }
      }
    }
  }
}

Hey rocky :slight_smile:

Like you said, in the settings I just see the analyzer. I think elasticsearch-dsl only support fields based analyzer so I added an analyzer to name:

{
    "contracts": {
        "aliases": {},
        "mappings": {
            "contract_document": {
                "properties": {
                    "client": {
                        "type": "text"
                    },
                    "id": {
                        "type": "integer"
                    },
                    "name": {
                        "type": "text",
                        "analyzer": "simple"
                    }
                }
            }
        },
        "settings": {
            "index": {
                "creation_date": "1507190453923",
                "number_of_shards": "5",
                "number_of_replicas": "1",
                "uuid": "1PgotcwuRxmYdgDRWanpbg",
                "version": {
                    "created": "5060199"
                },
                "provided_name": "contracts"
            }
        }
    }
}

Now, when I do search on name field, it works great. I will create a ticket to elasticsearch-dsl to see how to apply it to all fields by using the wrapper. But thanks for letting me know how to do it with ES REST.

I read somewhere that, when you use query_string on all fields, you cannot apply the analyzer. I checked it out, I see that standard analyzer is applied to _all. Do you have any smart solution to this? I mean, I can write all field names in the query but isn't it getting slower like this?

try this

POST test_max_one/_close

PUT test_max_one/_settings
{
  "analysis": {
    "analyzer": {
      "default": {
        "type": "simple"
      }
    }
  }
}

POST test_max_one/_open

I am afraid this didn't work.

Yeah. It can't work on existing fields I believe.
Create a new index from scratch.

I created the index from scratch. But still it does not work with _all field in query_string. I guess I will have to define all the fields to search for.

Last question:

The explanation of this query:

{
    "query": {
        "query_string" : {
        	"fields": ["name", "client"],
            "query" : "地上権設定契約書 AND blabla",
            "analyzer": "simple"
        }
    }
}

Is this:
"explanation": "+(client:地上権設定契約書 | name:地上権設定契約書) +(client:blabla | name:blabla)"

Does this mean that each expression in parenthesis is AND'ed to each other?

_all behavior will change in 6.0. And actually _all field won't exist anymore.

I have always preferred disabling it and I've using copy_to feature instead.

It means that it must match both terms.

I prefer TBH using a match query instead of query_string.
Or at least simple_query_string.

My method is to set the default analyzer in index level. If you set this , there is no need to set analyzer for every field .

If the field has existed, you need to reconstruct a new index and reindex all the documents.

Thanks for the answer and the information.

But it didn't work. I mean query_string didn't find the match after setting the default analyzer and reindexing.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.