Ukrainian analyzer

Hello! I have a hosting with installed the Ukrainian plugin from this page Ukrainian analysis plugin | Elasticsearch Plugins and Integrations [8.7] | Elastic .
It's proven by this command:
bin/elasticsearch-plugin list
analysis-phonetic
analysis-ukrainian
However, I can't get it working.

brand3@vs2556:/usr/share/elasticsearch$ bin/elasticsearch-plugin list
analysis-phonetic
analysis-ukrainian
brand3@vs2556:/usr/share/elasticsearch$ curl -X GET "localhost:9200/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
    "filter" : ["lowercase", "trim", {"type" : "stop", "stopwords" : "_ukrainian_"}, {"type" : "stemmer", "language" : "ukrainian"}],
    "tokenizer" : "standard",
    "text" : "Потужність двигуна"
}
'
{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "Invalid stemmer class specified: Ukrainian"
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "Invalid stemmer class specified: Ukrainian",
    "caused_by" : {
      "type" : "class_not_found_exception",
      "reason" : "org.tartarus.snowball.ext.UkrainianStemmer"
    }
  },
  "status" : 400
}

Could somebody give me a cue where I'm wrong, please?

Welcome!

Please don't post images of text as they are hard to read, may not display correctly for everyone, and are not searchable.

Instead, paste the text and format it with </> icon or pairs of triple backticks (```), and check the preview window to make sure it's properly formatted before posting it. This makes it more likely that your question will receive a useful answer.

It would be great if you could update your post to solve this.

What's the output of this?

GET /_cat/plugins?v=true&s=component&h=name,component,version,description

Thank you for answering. I've updated my original request.
Your suggestion with the GET request shows the next:
[1] 1696767
[2] 1696768

You need to run that in Kibana Dev Console.

I don't have it. I have only the SSH access to my server and the support said that they have installed the plugin for the Ukrainian language and proved that as I have shown in my initial request.

Run:

curl localhost:9200/_cat/plugins?v
name                component          version
vs2556.mirohost.net analysis-phonetic  7.17.9
vs2556.mirohost.net analysis-ukrainian 7.17.9
[1]-  Exit 1                  GET /_cat/plugins?v=true
[2]+  Done                    s=component

So I read the documentation and apparently this plugin provides an analyzer named ukrainian but not a stemmer with that name.

So I guess you can try something like:

curl -X GET "localhost:9200/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
    "analyzer" : "ukrainian",
    "text" : "Потужність двигуна"
}
'

Only the analyzer is exposed by the plugin:

Thank you. After all, it means that I need Java to be installed on my server. Could you provide which version of Java is supposed to be, please?

I did not mean that...
You just need to have the provided JVM available within the distribution.

Why do you want to install Java?

If I understood correctly, I need Java for this plugin elasticsearch/AnalysisUkrainianPlugin.java at 7.17 · elastic/elasticsearch · GitHub
If not, what do you mean under JVM? I use a Unix hosting machine.

You need Java if you want to modify/code/change this plugin.
If you just want to use it, just do what I said:

curl -X GET "localhost:9200/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
    "analyzer" : "ukrainian",
    "text" : "Потужність двигуна"
}
'

It means that there's not an opportunity in existence for using the stemming process for the Ukrainian language. Ok, I got it.

But what is the output you are getting?
Is it what you need?

I'm getting:

brand3@vs2556:/usr/share/elasticsearch$ curl -X GET "localhost:9200/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
    "analyzer" : "ukrainian",
    "text" : "Потужність двигуна"
}
'
{
  "tokens" : [
    {
      "token" : "потужність",
      "start_offset" : 0,
      "end_offset" : 10,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "двигун",
      "start_offset" : 11,
      "end_offset" : 18,
      "type" : "<ALPHANUM>",
      "position" : 1
    }
  ]
}

but I supposed to get the main part of a word that I can achieve with the Russian language and its stemmer:

brand3@vs2556:/usr/share/elasticsearch$ curl -X GET "localhost:9200/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
    "char_filter" : ["html_strip"],
    "filter" : ["lowercase", "trim", {"type" : "stop", "stopwords" : "_russian_"}, {"type" : "stemmer", "language" : "russian"}],
    "tokenizer" : "standard",
    "text" : "<table border=\"0\" cellpadding=\"3\" cellspacing=\"0\" class=\"table_technical_02\" style=\"width: 100%\">\t<tbody>\t\t<tr>\t\t\t<td class=\"tt_col_1\">Мощность двигателя в превеликий:<\/td>\t\t\t<td class=\"tt_col_2\">1,5 \u043a\u0412\u0442;<\/td>\t\t<\/tr>\t<\/tbody><\/table>"
}
'
{
  "tokens" : [
    {
      "token" : "мощност",
      "start_offset" : 135,
      "end_offset" : 143,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "двигател",
      "start_offset" : 144,
      "end_offset" : 153,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "превелик",
      "start_offset" : 156,
      "end_offset" : 166,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "1,5",
      "start_offset" : 196,
      "end_offset" : 199,
      "type" : "<NUM>",
      "position" : 4
    },
    {
      "token" : "квт",
      "start_offset" : 200,
      "end_offset" : 203,
      "type" : "<ALPHANUM>",
      "position" : 5
    }
  ]
}

for the word "Потужність" it's supposed to be "Потужн"

I see. That means that the built-in stemmer does not work as expected I guess.

This code (UkrainianMorfologikAnalyzer) is coming from Lucene. I honestly don't know how to fix that problem.

May be you could open a github issue in Elasticsearch and describe there the text you are sending, what you are getting back and what you are supposed to get?

Or may be this should be better created in Lucene as Elasticsearch is "only" exposing it here...

Yep, I will try some future discovery and if I find something interesting I'll report here. Thank you!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.