Hunspell filter problem


(John) #1

Hello,

I cant seem to make the hunspell filter to work.

My config:

curl -XPUT "http://localhost:9200/test_index" -d'
{
   "settings": {
      "analysis": {
         "analyzer": {
            "my_analyzer": {
               "type": "custom",
               "tokenizer":"ngram",
               "filter": [
                  "lowercase",
                  "standard",
                  "hspell"
               ]
            }
         },
         "filter" : {
            "hspell" : {
                "type" : "hunspell",
                "locale" : "rs_RS"
            }
        }
      }
   },
   "mappings": {
      "test_mapping": {
         "properties": {
            "name": {
               "index_analyzer": "basic",
               "type": "string",
               "store":true
            }
         }
      }
   }
}'

Requests:

1. curl -XGET "http://localhost:9200/_analyze?index_analyzer=my_analyzer&text=raća"
2. curl -XGET "http://localhost:9200/_analyze?index_analyzer=my_analyzer&text=raža"
3. curl -XGET "http://localhost:9200/_analyze?index_analyzer=my_analyzer&text=đaja"
4. curl -XGET "http://localhost:9200/_analyze?index_analyzer=my_analyzer&text=čača"
5. curl -XGET "http://localhost:9200/_analyze?index_analyzer=my_analyzer&text=liše"

Responses:

1.{
   "tokens": [
      {
         "token": "ra",
         "start_offset": 0,
         "end_offset": 2,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "263",
         "start_offset": 4,
         "end_offset": 7,
         "type": "<NUM>",
         "position": 2
      },
      {
         "token": "a",
         "start_offset": 8,
         "end_offset": 9,
         "type": "<ALPHANUM>",
         "position": 3
      }
   ]
}
2.{
   "tokens": [
      {
         "token": "ra",
         "start_offset": 0,
         "end_offset": 2,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "a",
         "start_offset": 3,
         "end_offset": 4,
         "type": "<ALPHANUM>",
         "position": 2
      }
   ]
}
3.{
   "tokens": [
      {
         "token": "273",
         "start_offset": 2,
         "end_offset": 5,
         "type": "<NUM>",
         "position": 1
      },
      {
         "token": "aja",
         "start_offset": 6,
         "end_offset": 9,
         "type": "<ALPHANUM>",
         "position": 2
      }
   ]
}
4.{
   "tokens": [
      {
         "token": "269",
         "start_offset": 2,
         "end_offset": 5,
         "type": "<NUM>",
         "position": 1
      },
      {
         "token": "a",
         "start_offset": 6,
         "end_offset": 7,
         "type": "<ALPHANUM>",
         "position": 2
      },
      {
         "token": "269",
         "start_offset": 9,
         "end_offset": 12,
         "type": "<NUM>",
         "position": 3
      },
      {
         "token": "a",
         "start_offset": 13,
         "end_offset": 14,
         "type": "<ALPHANUM>",
         "position": 4
      }
   ]
}
5.{
   "tokens": [
      {
         "token": "li",
         "start_offset": 0,
         "end_offset": 2,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "e",
         "start_offset": 3,
         "end_offset": 4,
         "type": "<ALPHANUM>",
         "position": 2
      }
   ]
}

So in 1,3,4 it replaces the special character with a number. In 2 and 5 it doesnt do anything. Shouldnt it replace these chars also with a number?

Im using the latest aff and dic files.

Thanks.


(John) #2

So ok.

Ive scraped the hunspell, and went with the ICU.

Test index:

PUT /my_sexy_test
{
   "settings": {
      "analysis": {
         "analyzer": {
            "my_sexy_analyzer": {
               "char_filter": [
                  "icu_normalizer"
               ],
               "tokenizer": "icu_tokenizer"
            }
         }
      }
   }
}

And when I try to ping the index with for instance:

GET /my_sexy_test/_analyze?text=KARAKONĐULA&analyzer=my_sexy_analyzer

Elastic returns:

{
   "tokens": [
      {
         "token": "karakon",
         "start_offset": 0,
         "end_offset": 7,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "272",
         "start_offset": 9,
         "end_offset": 12,
         "type": "<NUM>",
         "position": 2
      },
      {
         "token": "ula",
         "start_offset": 13,
         "end_offset": 16,
         "type": "<ALPHANUM>",
         "position": 3
      }
   ]
}

Now, Karakonđula is a single word. Alo, Ive tried with to test it with čvorak and žika (all single words). It doesnt understand the encoding. Im using elastic 1.7, sense and 2.7.0 ICU.

At first I thought sense messes up the encoding, but when I try calling it from python or curl with charset it does the same thing. (Also tried latin1, latin2 encodings, and also decode the json to utf-8 for python):

# -*- coding: utf-8-*-
import urllib2
import json


def main():
    url = 'http://localhost:9200/my_sexy_test/_analyze?text=KARAKONĐULA&analyzer=my_sexy_analyzer'
    req = urllib2.Request(url)
    out = urllib2.urlopen(req)
    data = out.read()
    print data

    data = json.loads(data)

    print data


if __name__ == '__main__':
    main()

It still returns crap. What the hell dude. Is there a workaround around this, or wat?

Cheers


(John) #3

Yay finally I fixed it!!!!!

Man, this elastic is sexy!

Cheers dudes!!


(Mark Walkom) #4

How did you fix this?
It might help others in future :slight_smile:


(John) #5

It was hell.

I had to change the encoding in chrome to utf-8 :Đ

Sense it seems uses that encoding and sends crap to the elastic, who then interprets that crap and sends crap back.

Now everything is hunky dory and my elastic works like a sweetheart.

Cheers!


(system) #6