Hunspell filter problem

JohnB · November 9, 2015, 8:21pm

Hello,

I cant seem to make the hunspell filter to work.

My config:

curl -XPUT "http://localhost:9200/test_index" -d'
{
   "settings": {
      "analysis": {
         "analyzer": {
            "my_analyzer": {
               "type": "custom",
               "tokenizer":"ngram",
               "filter": [
                  "lowercase",
                  "standard",
                  "hspell"
               ]
            }
         },
         "filter" : {
            "hspell" : {
                "type" : "hunspell",
                "locale" : "rs_RS"
            }
        }
      }
   },
   "mappings": {
      "test_mapping": {
         "properties": {
            "name": {
               "index_analyzer": "basic",
               "type": "string",
               "store":true
            }
         }
      }
   }
}'

Requests:

1. curl -XGET "http://localhost:9200/_analyze?index_analyzer=my_analyzer&text=raća"
2. curl -XGET "http://localhost:9200/_analyze?index_analyzer=my_analyzer&text=raža"
3. curl -XGET "http://localhost:9200/_analyze?index_analyzer=my_analyzer&text=đaja"
4. curl -XGET "http://localhost:9200/_analyze?index_analyzer=my_analyzer&text=čača"
5. curl -XGET "http://localhost:9200/_analyze?index_analyzer=my_analyzer&text=liše"

Responses:

1.{
   "tokens": [
      {
         "token": "ra",
         "start_offset": 0,
         "end_offset": 2,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "263",
         "start_offset": 4,
         "end_offset": 7,
         "type": "<NUM>",
         "position": 2
      },
      {
         "token": "a",
         "start_offset": 8,
         "end_offset": 9,
         "type": "<ALPHANUM>",
         "position": 3
      }
   ]
}
2.{
   "tokens": [
      {
         "token": "ra",
         "start_offset": 0,
         "end_offset": 2,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "a",
         "start_offset": 3,
         "end_offset": 4,
         "type": "<ALPHANUM>",
         "position": 2
      }
   ]
}
3.{
   "tokens": [
      {
         "token": "273",
         "start_offset": 2,
         "end_offset": 5,
         "type": "<NUM>",
         "position": 1
      },
      {
         "token": "aja",
         "start_offset": 6,
         "end_offset": 9,
         "type": "<ALPHANUM>",
         "position": 2
      }
   ]
}
4.{
   "tokens": [
      {
         "token": "269",
         "start_offset": 2,
         "end_offset": 5,
         "type": "<NUM>",
         "position": 1
      },
      {
         "token": "a",
         "start_offset": 6,
         "end_offset": 7,
         "type": "<ALPHANUM>",
         "position": 2
      },
      {
         "token": "269",
         "start_offset": 9,
         "end_offset": 12,
         "type": "<NUM>",
         "position": 3
      },
      {
         "token": "a",
         "start_offset": 13,
         "end_offset": 14,
         "type": "<ALPHANUM>",
         "position": 4
      }
   ]
}
5.{
   "tokens": [
      {
         "token": "li",
         "start_offset": 0,
         "end_offset": 2,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "e",
         "start_offset": 3,
         "end_offset": 4,
         "type": "<ALPHANUM>",
         "position": 2
      }
   ]
}

So in 1,3,4 it replaces the special character with a number. In 2 and 5 it doesnt do anything. Shouldnt it replace these chars also with a number?

Im using the latest aff and dic files.

Thanks.

JohnB · November 10, 2015, 5:42pm

So ok.

Ive scraped the hunspell, and went with the ICU.

Test index:

PUT /my_sexy_test
{
   "settings": {
      "analysis": {
         "analyzer": {
            "my_sexy_analyzer": {
               "char_filter": [
                  "icu_normalizer"
               ],
               "tokenizer": "icu_tokenizer"
            }
         }
      }
   }
}

And when I try to ping the index with for instance:

GET /my_sexy_test/_analyze?text=KARAKONĐULA&analyzer=my_sexy_analyzer

Elastic returns:

{
   "tokens": [
      {
         "token": "karakon",
         "start_offset": 0,
         "end_offset": 7,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "272",
         "start_offset": 9,
         "end_offset": 12,
         "type": "<NUM>",
         "position": 2
      },
      {
         "token": "ula",
         "start_offset": 13,
         "end_offset": 16,
         "type": "<ALPHANUM>",
         "position": 3
      }
   ]
}

Now, Karakonđula is a single word. Alo, Ive tried with to test it with čvorak and žika (all single words). It doesnt understand the encoding. Im using elastic 1.7, sense and 2.7.0 ICU.

At first I thought sense messes up the encoding, but when I try calling it from python or curl with charset it does the same thing. (Also tried latin1, latin2 encodings, and also decode the json to utf-8 for python):

# -*- coding: utf-8-*-
import urllib2
import json


def main():
    url = 'http://localhost:9200/my_sexy_test/_analyze?text=KARAKONĐULA&analyzer=my_sexy_analyzer'
    req = urllib2.Request(url)
    out = urllib2.urlopen(req)
    data = out.read()
    print data

    data = json.loads(data)

    print data


if __name__ == '__main__':
    main()

It still returns crap. What the hell dude. Is there a workaround around this, or wat?

Cheers

JohnB · November 10, 2015, 6:41pm

Yay finally I fixed it!!!!!

Man, this elastic is sexy!

Cheers dudes!!

warkolm · November 11, 2015, 6:45am

How did you fix this?
It might help others in future

JohnB · November 12, 2015, 5:23pm

It was hell.

I had to change the encoding in chrome to utf-8 :Đ

Sense it seems uses that encoding and sends crap to the elastic, who then interprets that crap and sends crap back.

Now everything is hunky dory and my elastic works like a sweetheart.

Cheers!

Topic		Replies	Views
Cannot make hunspell to work Elasticsearch	12	605	July 6, 2017
Hunspell does not work Elasticsearch	3	414	July 6, 2017
Ann: ElasticSearch Hunspell Analysis plugin Elasticsearch	12	1208	July 6, 2017
Hunspell analyzer Elasticsearch	3	752	July 5, 2017
Dutch hunspell doesn't work in 1.7.1? Elasticsearch	4	1862	July 5, 2017

Hunspell filter problem

Related topics