Basic word_list problem

karejonsson · November 29, 2017, 11:32pm

I have this

{
    "analysis": {
      "filter": {
        "swedish_stop": {
          "type":       "stop",
          "stopwords":  "_swedish_" 
        },
        "swedish_stemmer": {
          "type":       "stemmer",
          "language":   "swedish"
        }
        "swedish_dictionary": {
          "type" : "dictionary_decompounder",
          "word_list": ["kontinent", "fast", "land", "genom", "tänkt", "klar", "uppmana"],
        }
      },
      "analyzer": {
        "swedish": {
          "tokenizer":  "standard",
          "filter": [
            "swedish_dictionary",
            "lowercase",
            "swedish_stop",
            "swedish_stemmer"
          ]
        }
      }
    }
}

And I have a set of texts processed. Among those there is for example the Swedish word "interkontinental". My processing is working somewhat in that I do find "interkontinental" if I search for exactly that. But if I search for "inter" or "kontinental" or "kontinent" I do not find it.

I am running Elastic embedded. Is that an issue? Since I provide the list of word in the settings, i.e. not referencing a separate file, I would not expect it.

Am I expecting the wrong thing? This is not acceptable. The word "fastland" (Mainland in english) occurs in the texts but that is not found when searching for "land".

This seams as a small and simple case to me. Any help is very appreciated. The order of the filters in the analyser has been tested with no different result.

jimczi · December 1, 2017, 10:00am

I tried your example and it works fine. Are you sure that your index is created with the correct settings, the analysis must be set under settings. Here is the recreation I tried:

PUT my_index
{
   "settings":{
      "analysis":{
         "filter":{
            "swedish_stop":{
               "type":"stop",
               "stopwords":"_swedish_"
            },
            "swedish_stemmer":{
               "type":"stemmer",
               "language":"swedish"
            },
            "swedish_dictionary":{
               "type":"dictionary_decompounder",
               "word_list":[
                  "kontinent",
                  "fast",
                  "land",
                  "genom",
                  "tänkt",
                  "klar",
                  "uppmana"
               ]
            }
         },
         "analyzer":{
            "swedish_l":{
               "tokenizer":"standard",
               "filter":[
                  "swedish_dictionary",
                  "lowercase"
               ]
            }
         }
      }
   },
   "mappings":{
      "doc":{
         "properties":{
            "text":{
               "type":"text",
               "analyzer":"swedish_l"
            }
         }
      }
   }
}

POST my_index/_analyze
{
	"text": "fastland",
	"analyzer": "swedish_l"
}

=> returns:
{
    "tokens": [
        {
            "token": "fastland",
            "start_offset": 0,
            "end_offset": 8,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "fast",
            "start_offset": 0,
            "end_offset": 8,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "land",
            "start_offset": 0,
            "end_offset": 8,
            "type": "<ALPHANUM>",
            "position": 0
        }
    ]
}

karejonsson · December 2, 2017, 11:19pm

Thanks Jim
The analysis was under settings but I had no mappings. I believe the mappings specify that when a "doc" is indexed the "swedish_l" analyser shall be used. When I did what you said it works. What I then need is an extensive list of words. My idea was to use "word_list_path" and point to it. When looking around on the net I found that the hunspell dictionary seams most used. It is also incorporated in the elastic source code which is reassuring. Since the format of each line for forms of words is different from what the "word_list"- property held I concluded that the hunspell plugin must be used with that list. What I now have is

{
  "analysis": {
    "filter": {
      "sv_SE" : {
        "type" : "hunspell",
        "locale" : "sv_SE",
        "dedup" : true
      }
    },
    "analyzer": {
      "swedish_language": {
        "tokenizer": "standard",
        "filter": [
          "lowercase",
          "sv_SE"
        ]
      }
    }
  }
}

and the mappings

{
  "article": {
	"properties": {
	  "text": {
	    "type": "text",
	    "analyzer": "swedish_language"
	  }
	}
  }
}

(I have "doc" -> "article" and "swedish_l" -> "swedish_language", I skip stopwords until this works. )

The leading webbpage is
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-hunspell-tokenfilter.html

Before it was possible to execute I got an exception about that the "se_SV" dictionary was not possible to load. I fixed the path and converted the file I had in ISO-8859-1 to UTF-8.

Two things bother me here. The setting

Settings.builder()
	 ...
	 .put("indices.analysis.hunspell.dictionary.location", dict_path)

did not work and the file layout specified as "conf/hunspell" did not work. By looking at the ES-sourcecode in core/src/main/java/org/elasticsearch/env/Environment.java on line 103 I changed to "config/hunspell" and then it works with the condition that the se_SV folder is at $path_home/config/hunspell/se_SV.

My questions:
In general, would you say I am on the right track?
The "fastland" example does not work. Any idea why?

karejonsson · December 5, 2017, 8:48am

Hello again

I have some insights since my last post and writes something in case someone is about to respond to this. In retrospec my own point of view on my post is that it is somewhat confusing so it makes good sense for me to explaimn myself.

I thought that words like "fastland" and "interkontinental" would spontaneously be broken down to subwords by some internal well crafted machinery into proper subwords. When I wrote "fastland does not work" what I had in mind was that it did not get analysed with tokens "fast" and "land". I only got "fastland" back. In the case with "fastland" it is not certain that this is the wanted thing but with "interkontinental" there should be some hit when searching for "kontinental".

Now I have a better idea of what to expect.

karejonsson · December 11, 2017, 8:23am

I am no longer waiting for this to be replied. When I understood things better I asked "the same thing" in another way. I'll mark it solved with this.

system · January 8, 2018, 8:23am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to analyze an HTML text with compound words Elasticsearch	1	422	July 5, 2017
Language analysers Behaviour in ES Elasticsearch	3	662	July 5, 2017
Analyser doesn't remove English stopwords Elasticsearch	3	441	June 4, 2018
Stopwords in analyzer doesn't seem to work Elasticsearch	3	384	June 26, 2020
Stop words not used by the analyzer Elasticsearch	5	614	July 6, 2017

Basic word_list problem

Related topics