Language analysers Behaviour in ES

George_Aba · October 24, 2016, 2:56pm

Hi all,

I recently started trying out ES and I got to say it looks amazing. In an effort to understand more how it works, and since I want to use it with greek language, I set up a new mapping for an index where I wish to store the title and content fields.

PUT /test
{
"settings": {
"analysis": {
"filter": {
"greek_stop": {
"type": "stop",
"stopwords": "greek"
},
"greek_lowercase": {
"type": "lowercase",
"language": "greek"
},
"greek_stemmer": {
"type": "stemmer",
"language": "greek"
}
},
"analyzer": {
"greek": {
"tokenizer": "standard",
"filter": [
"greek_lowercase",
"greek_stop",
"greek_stemmer"
]
}
}
}
},
"mappings": {
"article": {
"properties": {
"title": {
"type": "string",
"fields": {
"greek": {
"type": "string",
"analyser": "greek"
}
}
},
"content": {
"type": "string",
"fields": {
"greek": {
"type": "string",
"analyser": "greek"
}
}
}
}
}
}
}

After carefully reading ES documentation I decided to index title and content using two analysers. As it can be found on the ES documentation:

However, if we have two documents, one of which contains jumped and the other jumping, the user would probably expect the first document to rank higher, as it contains exactly what was typed in.

Then I bulk inserted a couple of dummy entries in order to test my mapping:

{"index": {"_id":1}}
{"title": "Τρεις λαγοί μα τι λαγοί","content": "Παραμύθι με λαγούς με πετραχείλια","category": "Kids"}
{"index": {"_id":2}}
{"title": "Ο λαγός και η χελώνα","content": "Παραμύθι για την ιστορία του λαγού με την χελώνα","category": "Kids"}
{"index": {"_id":3}}
{"title": "Οι χελώνες εξαφανίστηκαν","content": "Οι λόγοι που οι χελώνες δεν βρίσκονται τόσο συχνά είναι πολλοί.","category": "Documentary"}
{"index": {"_id":4}}
{"title": "Ποια χελώνα ζει περισσότερο","content": "Ένας λόγος που σαν ζώο η χελώνα ζει περισσότερο βρίσκεται στο DNA των χελώνων","category": "Article"}

Then I executed the query below:

{
"query":{
"dis_max": {
"queries": [
{
"multi_match":
{
"query": "λαγοί",
"type":"most_fields",
"fields": [ "title.greek","content.greek","content","title"]
}}
]
}
}
}

Which gave me those results:
"hits": {
"total": 2,
"max_score": 0.20227146,
"hits": [
{
"_index": "hub",
"_type": "article",
"_id": "2",
"_score": 0.20227146,
"_source": {
"title": "Ο λαγός και η χελώνα",
"content": "Παραμύθι για την ιστορία του λαγού με την χελώνα",
"category": "Kids"
}
},
{
"_index": "hub",
"_type": "article",
"_id": "1",
"_score": 0.11385604,
"_source": {
"title": "Τρεις λαγοί μα τι λαγοί",
"content": "Παραμύθι με λαγούς με πετραχείλια",
"category": "Kids"
}
}
]
}

I know all the above might seem greek to you but you can see that the second hit is way more relevant than the first one (I also tried to use in the "type" field of my query "cross_fields", "best_fields" and "cross_fields" as options and the ordering of the results remained the same with minor differences in the relevance). I still cannot explain why the result that contains the word exactly as entered in the query is returned as less relevant than the other one.

What I did then is created a new index using only the greek analyser, and the results were pretty much the same.

I have been stuck on this for quite some time now and I would appreciate any idea/explanation that could help me get to the bottom of this.

FYI for the query term "λαγοί" its stemmed root is "λαγ".

Thank you for your time.

mainec · October 25, 2016, 12:07pm

Did you try using the explain API:

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-explain.html

to figure out why your documents got the scores the received?

Also to check what your analyzer is actually doing to the terms you index you might want to check out the analyze-api:

https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html

Hope this helps,
Isabel

George_Aba · November 9, 2016, 9:15am

Hi Isabel,

Thanks for pointing those out. I must admit I had forgotten how helpful they can be.

Still trying to figure it out thought but your suggestion really helped.

Cheers,
George

Topic		Replies	Views
Using differents analysers based on the document language Elasticsearch	2	327	July 6, 2017
Issue with greek language Elasticsearch	17	3672	July 27, 2017
Analyzer from plugin works well when _analyze is called but does not work in search Elasticsearch	1	363	December 4, 2018
What is the proper settings and mapping for multiple languages Elasticsearch	1	705	December 7, 2018
Multilingual Search Elasticsearch	1	356	July 19, 2022

Language analysers Behaviour in ES

Related topics