Query_string is not behaving as expected with analyzer (simple)?

ebsaral · October 5, 2017, 7:11am

Hello, I want to be able to able to search Japanese inputs as well as English. I don't want to use the plugin, because I just want to do partial search within a Japanese input. Therefore please do not suggest kuromoji.

ElasticSearch version: 5.6.1

The problem is that, I want to use simple analyzer for my index and I think I achieved that with elasticsearch-dsl.

First problem (and also a question):

When I call blue.local:9200/contracts/_settings/, I cannot see simple as being the analyzer in the index settings:

{
    "contracts": {
        "settings": {
            "index": {
                "creation_date": "1507127956748",
                "number_of_shards": "5",
                "number_of_replicas": "1",
                "uuid": "ehfhOJ2OStqS7fd4wLLn1g",
                "version": {
                    "created": "5060199"
                },
                "provided_name": "contracts"
            }
        }
    }
}

I believe this might be normal for generic analyzers. Right?

Then I analyzed the simple analyzer by calling blue.local:9200/_analyze?analyzer=simple&text=地上権 and the result was:

{
    "tokens": [
        {
            "token": "地上権",
            "start_offset": 0,
            "end_offset": 3,
            "type": "word",
            "position": 0
        }
    ]
}

When I was using standard analyzer, every Japanese letter was a token. Now it's not, and I think this is what I want.

Then, I validated my query by calling:

POST blue.local:9200/contracts/_validate/query?explain
{
    "query": {
        "query_string" : {
            "query" : "name:地上権",
            "analyzer": "simple"
        }
    }
}

And the response was:

{
    "valid": true,
    "_shards": {
        "total": 1,
        "successful": 1,
        "failed": 0
    },
    "explanations": [
        {
            "index": "contracts",
            "valid": true,
            "explanation": "name:地上権"
        }
    ]
}

I guess, the explanation here shows that I am on the right path.

BUT, when I do the query:

POST blue.local:9200/contracts/_search/
{
    "query": {
        "query_string" : {
            "query" : "name:地上権",
            "analyzer": "simple"
        }
    }
}

I get zero hits:

{
    "took": 1,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 0,
        "max_score": null,
        "hits": []
    }
}

I am sure that the data exists.

When I remove the analyzer in the query:

POST blue.local:9200/contracts/_search/?explain
{
    "query": {
        "query_string" : {
            "query" : "name:地上権"
        }
    }
}

I get this:

{
    "took": 12,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 36,
        "max_score": 12.01759,
        "hits": [
            {
                "_shard": "[contracts][0]",
                "_node": "IPjnafPMRyGqMLrABkvudQ",
                "_index": "contracts",
                "_type": "contract_document",
                "_id": "6192",
                "_score": 12.01759,
                "_source": {
                    "client": "My Client",
                    "id": 6192,
                    "name": "地上権"
                },
                "_explanation": {
                    "value": 12.01759,
                    "description": "sum of:",
                    "details": [
                        {
                            "value": 3.6425304,
                            "description": "weight(name:地 in 114) [PerFieldSimilarity], result of:",
                            "details": [
                                ...
                            ]
                        },
                        ...
                    ]
                }
            }
        ]
    }
}

I see that the weight is being calculated on each letter like they are indexed with standard analyzer.

What should be my next step? I spent over 6 hours to find the issue, but I failed.

dadoonet · October 5, 2017, 7:30am

But you did not defined the simple analyzer on your fields in mapping, right?

At index time, fields are analyzed using the standard analyzer then and I'm pretty sure not token "地上権" has been generated for the name field.

ebsaral · October 5, 2017, 7:36am

When I make a call to blue.local:9200/contracts/contract_document/6192/_termvectors?fields=name
I see that the terms are each Japanese letters. So you might be right!

I know this is not a topic of elasticsearch, but I am using elasticsearch-dsl-py with Django wrapper and I did the settings like this:

contracts = Index('contracts')
my_analyzer = analyzer('simple')

contracts.analyzer(my_analyzer)


@contracts.doc_type
class ContractDocument(DocType):
    client = fields.StringField(attr='client_name')

    class Meta:
        model = Contract

        fields = [
            'id',
            'name'
        ]

So I assummed that when I declare an analyzer for index, it would apply to all fields. I guess not, right? Each field should have its own analyzer defined?

rockybean · October 5, 2017, 7:43am

GET blue.local:9200/contracts/

Run above command and paste the response here. I'll try to help you solve the problem. I think the name field is not using simple analyzer.

If you want to use simple as the default analyzer, you should set it like this.

PUT /contracts
{
  "mappings": {
    "contract_document":{
      "properties": {
        "name":{
          "type":"text",
          "analyzer": "simple"
        }
      }
    }
  }
}

ebsaral · October 5, 2017, 8:11am

Hey rocky

Like you said, in the settings I just see the analyzer. I think elasticsearch-dsl only support fields based analyzer so I added an analyzer to name:

{
    "contracts": {
        "aliases": {},
        "mappings": {
            "contract_document": {
                "properties": {
                    "client": {
                        "type": "text"
                    },
                    "id": {
                        "type": "integer"
                    },
                    "name": {
                        "type": "text",
                        "analyzer": "simple"
                    }
                }
            }
        },
        "settings": {
            "index": {
                "creation_date": "1507190453923",
                "number_of_shards": "5",
                "number_of_replicas": "1",
                "uuid": "1PgotcwuRxmYdgDRWanpbg",
                "version": {
                    "created": "5060199"
                },
                "provided_name": "contracts"
            }
        }
    }
}

Now, when I do search on name field, it works great. I will create a ticket to elasticsearch-dsl to see how to apply it to all fields by using the wrapper. But thanks for letting me know how to do it with ES REST.

I read somewhere that, when you use query_string on all fields, you cannot apply the analyzer. I checked it out, I see that standard analyzer is applied to _all. Do you have any smart solution to this? I mean, I can write all field names in the query but isn't it getting slower like this?

rockybean · October 5, 2017, 8:39am

try this

POST test_max_one/_close

PUT test_max_one/_settings
{
  "analysis": {
    "analyzer": {
      "default": {
        "type": "simple"
      }
    }
  }
}

POST test_max_one/_open

ebsaral · October 5, 2017, 9:01am

I am afraid this didn't work.

dadoonet · October 5, 2017, 9:12am

Yeah. It can't work on existing fields I believe.
Create a new index from scratch.

ebsaral · October 5, 2017, 9:15am

I created the index from scratch. But still it does not work with _all field in query_string. I guess I will have to define all the fields to search for.

Last question:

The explanation of this query:

{
    "query": {
        "query_string" : {
        	"fields": ["name", "client"],
            "query" : "地上権設定契約書 AND blabla",
            "analyzer": "simple"
        }
    }
}

Is this:
"explanation": "+(client:地上権設定契約書 | name:地上権設定契約書) +(client:blabla | name:blabla)"

Does this mean that each expression in parenthesis is AND'ed to each other?

dadoonet · October 5, 2017, 9:29am

_all behavior will change in 6.0. And actually _all field won't exist anymore.

I have always preferred disabling it and I've using copy_to feature instead.

It means that it must match both terms.

I prefer TBH using a match query instead of query_string.
Or at least simple_query_string.

rockybean · October 5, 2017, 9:46am

My method is to set the default analyzer in index level. If you set this , there is no need to set analyzer for every field .

If the field has existed, you need to reconstruct a new index and reindex all the documents.

ebsaral · October 5, 2017, 11:03am

Thanks for the answer and the information.

ebsaral · October 5, 2017, 11:04am

But it didn't work. I mean query_string didn't find the match after setting the default analyzer and reindexing.

system · November 2, 2017, 11:04am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Simple Query String analyzer question Elasticsearch	3	3014	April 20, 2017
Need Help with Japanese analyzer - (Kuromoji) Elasticsearch	1	363	July 6, 2017
Dumb question- using the cjk analyzer Elasticsearch	3	620	July 6, 2017
(Plugin Kuromoji) Can you help me resolve config elasticsearch.yml create analyzer? 日本語による質問・議論はこちら	5	1664	July 6, 2017
Can you help me resolve config elasticsearch.yml create analyzer? Elasticsearch	2	570	July 5, 2017

Query_string is not behaving as expected with analyzer (simple)?

Related topics