Stemming not performed


(R_C) #1

Hi ,
I am using elasticsearch v5.4.3 it seems it is not supporting stemming.
</> curl -X PUT "localhost:9200/my_index" -H 'Content-Type: application/json' -d'

{
"settings": {
"analysis" : {
"analyzer" : {
"my_analyzer" : {
"tokenizer" : "standard",
"filter" : ["standard", "lowercase", "my_stemmer"]
}
},
"filter" : {
"my_stemmer" : {
"type" : "stemmer",
"name" : "english"
}
}
}
}
}
'
</>
{"acknowledged":true,"shards_acknowledged":true,"index":"my_index"}[root@node1 ~]#

Now insert data

</> curl -XPUT "localhost:9200/my_index/user/1" -d '{ "text": "Qbox Elasticsearch Hosting is not at all difficult" }'
{"_index":"my_index","_type":"user","_id":"1","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"created":true}[root@node1 ~]# curl -XPUT "localhost:9200/my_index/user/2" -d '{ "name": "Unconventional Elasticsearch Hosting is difficult" }'
{"_index":"my_index","_type":"user","_id":"2","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"created":true}[root@node1 ~]#
</>

====================================
Perform search query
</>
curl -XGET 'localhost:9200/analysis1/user/_search' -d '{

"query": {
"multi_match": {
"type": "most_fields",
"query": "not difficult Qbox host",
"fields": [ "text", "text.english" ]
}
}
}'
</>

Output Received :
{"took":4,"timed_out":false,"_shards":{"total":5,"successful":5,"skipped":0,"failed":0},"hits":{"total":1,"max_score":0.84748024,"hits":[{"_index":"analysis1","_type":"user","_id":"1","_score":0.84748024,"_source":{ "text": "Qbox Elasticsearch Hosting is not at all difficult" }}]}}[root@node1 ~]#

Expected output :
both documents are returned as results due to the word stemming


(David Pilato) #2

Please format your code, logs or configuration files using </> icon as explained in this guide and not the citation button. It will make your post more readable.

Or use markdown style like:

```
CODE
```

This is the icon to use if you are not using markdown format:

There's a live preview panel for exactly this reasons.

Lots of people read these forums, and many of them will simply skip over a post that is difficult to read, because it's just too large an investment of their time to try and follow a wall of badly formatted text.
If your goal is to get an answer to your questions, it's in your interest to make it as easy to read and understand as possible.
Please update your post.


(David Pilato) #3

Have a look at the _analyze API to understand how your text is processed.


(Thomas Dasch) #4

Roshni,

Im going to give this a try, hopefully it is helpful!

Looking at your first PUT command, it looks like to me you are trying to setup a standard analyzer and then an english stemmer. In your second PUT command, your inserting data in a field called text and a field called name. In your GET multi_match query you are using the most_fields type.

To try and figure things out I setup my own index my_index and created a mapping for name, making it of type text, which uses the standard analyzer(ES default fortext type), and type text with the english analyzer (which performs stemming).

> PUT my_index
> {
>   "mappings": {
>     "book": {
>       "properties": {
>         "name": {
>           "type": "text",
>           "fields": {
>             "english": {
>               "type": "text",
>               "analyzer": "english"
>             }
>           }
>         }
>       }
>     }
>   }
> }

I then added two documents to the name field ( you had added to name and text fields).

> PUT my_index/book/1
> {
>   "name": "Qbox Elasticsearch Hosting is not at all difficult"
> }
> 
> PUT my_index/book/2
> {
>   "name": "Unconventional Elasticsearch Hosting is difficult" 
> }

Then performed my multi_match query using the most_fields type. The fields queried are name and name.english (you were only searching the text and text.english fields)

> GET my_index/_search
> {
>   "query": {
>     "multi_match": {
>       "query": "not difficult Qbox host",
>       "type": "most_fields", 
>       "fields": ["name", "name.english"]
>     }
>   }
> }

Which returns:

> "hits": {
>     "total": 2,
>     "max_score": 1.7260926,
>     "hits": [
>       {
>         "_index": "my_index",
>         "_type": "book",
>         "_id": "1",
>         "_score": 1.7260926,
>         "_source": {
>           "name": "Qbox Elasticsearch Hosting is not at all difficult"
>         }
>       },
>       {
>         "_index": "my_index",
>         "_type": "book",
>         "_id": "2",
>         "_score": 0.8630463,
>         "_source": {
>           "name": "Unconventional Elasticsearch Hosting is difficult"
>         }

most_fields was used on the main field, name, which ES by default used the standard analyzer, and the second field, name.english, which uses the english analyzer. My query matches both documents for the stemmed word host because of name.english and "Qbox Elasticsearch Hosting is not at all difficult" has a higher _score due to "not" being in the title.

Hope this helps!


(system) #5

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.