Ukrainian analyzer

Vladimir_Talabko · May 10, 2023, 9:56am

Hello! I have a hosting with installed the Ukrainian plugin from this page Ukrainian analysis plugin | Elasticsearch Plugins and Integrations [8.7] | Elastic .
It's proven by this command:
bin/elasticsearch-plugin list
analysis-phonetic
analysis-ukrainian
However, I can't get it working.

brand3@vs2556:/usr/share/elasticsearch$ bin/elasticsearch-plugin list
analysis-phonetic
analysis-ukrainian
brand3@vs2556:/usr/share/elasticsearch$ curl -X GET "localhost:9200/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
    "filter" : ["lowercase", "trim", {"type" : "stop", "stopwords" : "_ukrainian_"}, {"type" : "stemmer", "language" : "ukrainian"}],
    "tokenizer" : "standard",
    "text" : "Потужність двигуна"
}
'
{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "Invalid stemmer class specified: Ukrainian"
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "Invalid stemmer class specified: Ukrainian",
    "caused_by" : {
      "type" : "class_not_found_exception",
      "reason" : "org.tartarus.snowball.ext.UkrainianStemmer"
    }
  },
  "status" : 400
}

Could somebody give me a cue where I'm wrong, please?

dadoonet · May 10, 2023, 10:16am

Welcome!

Please don't post images of text as they are hard to read, may not display correctly for everyone, and are not searchable.

Instead, paste the text and format it with </> icon or pairs of triple backticks (```), and check the preview window to make sure it's properly formatted before posting it. This makes it more likely that your question will receive a useful answer.

It would be great if you could update your post to solve this.

What's the output of this?

GET /_cat/plugins?v=true&s=component&h=name,component,version,description

Vladimir_Talabko · May 10, 2023, 12:01pm

Thank you for answering. I've updated my original request.
Your suggestion with the GET request shows the next:
[1] 1696767
[2] 1696768

dadoonet · May 10, 2023, 12:21pm

You need to run that in Kibana Dev Console.

Vladimir_Talabko · May 10, 2023, 12:27pm

I don't have it. I have only the SSH access to my server and the support said that they have installed the plugin for the Ukrainian language and proved that as I have shown in my initial request.

dadoonet · May 10, 2023, 12:50pm

Run:

curl localhost:9200/_cat/plugins?v

Vladimir_Talabko · May 10, 2023, 1:07pm

name                component          version
vs2556.mirohost.net analysis-phonetic  7.17.9
vs2556.mirohost.net analysis-ukrainian 7.17.9
[1]-  Exit 1                  GET /_cat/plugins?v=true
[2]+  Done                    s=component

dadoonet · May 10, 2023, 1:49pm

So I read the documentation and apparently this plugin provides an analyzer named ukrainian but not a stemmer with that name.

So I guess you can try something like:

curl -X GET "localhost:9200/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
    "analyzer" : "ukrainian",
    "text" : "Потужність двигуна"
}
'

Only the analyzer is exposed by the plugin:

github.com

elastic/elasticsearch/blob/7.17/plugins/analysis-ukrainian/src/main/java/org/elasticsearch/plugin/analysis/ukrainian/AnalysisUkrainianPlugin.java#L23-L26


      
          @Override
          public Map<String, AnalysisProvider<AnalyzerProvider<? extends Analyzer>>> getAnalyzers() {
              return singletonMap("ukrainian", UkrainianAnalyzerProvider::new);
          }

Vladimir_Talabko · May 10, 2023, 2:22pm

Thank you. After all, it means that I need Java to be installed on my server. Could you provide which version of Java is supposed to be, please?

dadoonet · May 10, 2023, 2:41pm

I did not mean that...
You just need to have the provided JVM available within the distribution.

Why do you want to install Java?

Vladimir_Talabko · May 10, 2023, 2:44pm

If I understood correctly, I need Java for this plugin elasticsearch/AnalysisUkrainianPlugin.java at 7.17 · elastic/elasticsearch · GitHub
If not, what do you mean under JVM? I use a Unix hosting machine.

dadoonet · May 10, 2023, 2:59pm

You need Java if you want to modify/code/change this plugin.
If you just want to use it, just do what I said:

curl -X GET "localhost:9200/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
    "analyzer" : "ukrainian",
    "text" : "Потужність двигуна"
}
'

Vladimir_Talabko · May 10, 2023, 3:03pm

It means that there's not an opportunity in existence for using the stemming process for the Ukrainian language. Ok, I got it.

dadoonet · May 10, 2023, 3:04pm

But what is the output you are getting?
Is it what you need?

Vladimir_Talabko · May 10, 2023, 3:12pm

I'm getting:

brand3@vs2556:/usr/share/elasticsearch$ curl -X GET "localhost:9200/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
    "analyzer" : "ukrainian",
    "text" : "Потужність двигуна"
}
'
{
  "tokens" : [
    {
      "token" : "потужність",
      "start_offset" : 0,
      "end_offset" : 10,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "двигун",
      "start_offset" : 11,
      "end_offset" : 18,
      "type" : "<ALPHANUM>",
      "position" : 1
    }
  ]
}

but I supposed to get the main part of a word that I can achieve with the Russian language and its stemmer:

brand3@vs2556:/usr/share/elasticsearch$ curl -X GET "localhost:9200/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
    "char_filter" : ["html_strip"],
    "filter" : ["lowercase", "trim", {"type" : "stop", "stopwords" : "_russian_"}, {"type" : "stemmer", "language" : "russian"}],
    "tokenizer" : "standard",
    "text" : "<table border=\"0\" cellpadding=\"3\" cellspacing=\"0\" class=\"table_technical_02\" style=\"width: 100%\">\t<tbody>\t\t<tr>\t\t\t<td class=\"tt_col_1\">Мощность двигателя в превеликий:<\/td>\t\t\t<td class=\"tt_col_2\">1,5 \u043a\u0412\u0442;<\/td>\t\t<\/tr>\t<\/tbody><\/table>"
}
'
{
  "tokens" : [
    {
      "token" : "мощност",
      "start_offset" : 135,
      "end_offset" : 143,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "двигател",
      "start_offset" : 144,
      "end_offset" : 153,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "превелик",
      "start_offset" : 156,
      "end_offset" : 166,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "1,5",
      "start_offset" : 196,
      "end_offset" : 199,
      "type" : "<NUM>",
      "position" : 4
    },
    {
      "token" : "квт",
      "start_offset" : 200,
      "end_offset" : 203,
      "type" : "<ALPHANUM>",
      "position" : 5
    }
  ]
}

for the word "Потужність" it's supposed to be "Потужн"

dadoonet · May 10, 2023, 3:19pm

I see. That means that the built-in stemmer does not work as expected I guess.

This code (UkrainianMorfologikAnalyzer) is coming from Lucene. I honestly don't know how to fix that problem.

May be you could open a github issue in Elasticsearch and describe there the text you are sending, what you are getting back and what you are supposed to get?

Or may be this should be better created in Lucene as Elasticsearch is "only" exposing it here...

Vladimir_Talabko · May 10, 2023, 3:46pm

Yep, I will try some future discovery and if I find something interesting I'll report here. Thank you!

system · June 7, 2023, 3:47pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Custom ukrainian analyzer Elasticsearch	2	776	August 4, 2018
Установка плагина elasticsearch-analysis-morphology в докер контейнер Вопросы на русском языке	5	3667	December 6, 2017
Ukrainian official language analyzers in elasticsearch Elasticsearch	5	2095	July 5, 2017
Problems with Stempel Polish Analysis Plugin Elasticsearch	10	1049	July 5, 2017
Is it possible to reimplemented ukrainian build-in analyzer as a custom analyzer (extend ukrainian_analyzer with shingle filter)? Elasticsearch	1	229	January 6, 2023

Ukrainian analyzer

Related topics