Language analysers Behaviour in ES


(George Aba) #1

Hi all,

I recently started trying out ES and I got to say it looks amazing. In an effort to understand more how it works, and since I want to use it with greek language, I set up a new mapping for an index where I wish to store the title and content fields.

PUT /test
{
"settings": {
"analysis": {
"filter": {
"greek_stop": {
"type": "stop",
"stopwords": "greek"
},
"greek_lowercase": {
"type": "lowercase",
"language": "greek"
},
"greek_stemmer": {
"type": "stemmer",
"language": "greek"
}
},
"analyzer": {
"greek": {
"tokenizer": "standard",
"filter": [
"greek_lowercase",
"greek_stop",
"greek_stemmer"
]
}
}
}
},
"mappings": {
"article": {
"properties": {
"title": {
"type": "string",
"fields": {
"greek": {
"type": "string",
"analyser": "greek"
}
}
},
"content": {
"type": "string",
"fields": {
"greek": {
"type": "string",
"analyser": "greek"
}
}
}
}
}
}
}

After carefully reading ES documentation I decided to index title and content using two analysers. As it can be found on the ES documentation:

However, if we have two documents, one of which contains jumped and the other jumping, the user would probably expect the first document to rank higher, as it contains exactly what was typed in.

https://www.elastic.co/guide/en/elasticsearch/guide/2.x/most-fields.html

Then I bulk inserted a couple of dummy entries in order to test my mapping:

{"index": {"_id":1}}
{"title": "Τρεις λαγοί μα τι λαγοί","content": "Παραμύθι με λαγούς με πετραχείλια","category": "Kids"}
{"index": {"_id":2}}
{"title": "Ο λαγός και η χελώνα","content": "Παραμύθι για την ιστορία του λαγού με την χελώνα","category": "Kids"}
{"index": {"_id":3}}
{"title": "Οι χελώνες εξαφανίστηκαν","content": "Οι λόγοι που οι χελώνες δεν βρίσκονται τόσο συχνά είναι πολλοί.","category": "Documentary"}
{"index": {"_id":4}}
{"title": "Ποια χελώνα ζει περισσότερο","content": "Ένας λόγος που σαν ζώο η χελώνα ζει περισσότερο βρίσκεται στο DNA των χελώνων","category": "Article"}

Then I executed the query below:

{
"query":{
"dis_max": {
"queries": [
{
"multi_match":
{
"query": "λαγοί",
"type":"most_fields",
"fields": [ "title.greek","content.greek","content","title"]
}}
]
}
}
}

Which gave me those results:
"hits": {
"total": 2,
"max_score": 0.20227146,
"hits": [
{
"_index": "hub",
"_type": "article",
"_id": "2",
"_score": 0.20227146,
"_source": {
"title": "Ο λαγός και η χελώνα",
"content": "Παραμύθι για την ιστορία του λαγού με την χελώνα",
"category": "Kids"
}
},
{
"_index": "hub",
"_type": "article",
"_id": "1",
"_score": 0.11385604,
"_source": {
"title": "Τρεις λαγοί μα τι λαγοί",
"content": "Παραμύθι με λαγούς με πετραχείλια",
"category": "Kids"
}
}
]
}

I know all the above might seem greek to you :slight_smile: but you can see that the second hit is way more relevant than the first one (I also tried to use in the "type" field of my query "cross_fields", "best_fields" and "cross_fields" as options and the ordering of the results remained the same with minor differences in the relevance). I still cannot explain why the result that contains the word exactly as entered in the query is returned as less relevant than the other one.

What I did then is created a new index using only the greek analyser, and the results were pretty much the same.

I have been stuck on this for quite some time now and I would appreciate any idea/explanation that could help me get to the bottom of this.

FYI for the query term "λαγοί" its stemmed root is "λαγ".

Thank you for your time.


(Isabel Drost-Fromm) #2

Did you try using the explain API:

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-explain.html

to figure out why your documents got the scores the received?

Also to check what your analyzer is actually doing to the terms you index you might want to check out the analyze-api:

https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html

Hope this helps,
Isabel


(George Aba) #3

Hi Isabel,

Thanks for pointing those out. I must admit I had forgotten how helpful they can be.

Still trying to figure it out thought but your suggestion really helped.

Cheers,
George


(system) #4