How to analyze an HTML text with compound words

kralizek · November 2, 2015, 12:12pm

I'm writing a search service based on Elasticsearch for a bunch of sites with content written in agglutinated languages like Swedish, German and Finnish.

I know that Elasticsearch offers language analyzers by default but after some testing I found their support sloppy at best.

What I got so far is:

{
  "settings":{
    "analysis":{
      "filter":{
        "swedish_stop":{
          "type": "stop",
          "stopwords": "_swedish_"
        },
        "swedish_stemmer":{
          "type":"stemmer",
          "language":"swedish"
        },
        "swedish_words":{
          "type":"dictionary_decompounder",
          "word_list":["very", "long", "list", "of", "words", "almost", "13", "MB"]
        }
      },
      "analyzer":{
        "custom_swedish":{
          "tokenizer": "standard",
          "filter":[
            "lowercase",
            "swedish_stop",
            "swedish_stemmer",
            "swedish_words"
          ],
          "char_filter":[
            "html_strip"
          ]
        }
      }
    }
  }
}

Do you guys have a clue?

Topic		Replies	Views
How to use es analyzer for compound words? Elasticsearch	2	740	July 6, 2017
Basic word_list problem Elasticsearch	5	959	January 8, 2018
Translate elasticsearch language analyzer to NEST Elasticsearch	1	678	June 14, 2017
Language and HTML analyzer Elasticsearch	4	600	July 5, 2017
Re: elasticsearch and swedish compound words Elasticsearch	1	801	July 6, 2017

How to analyze an HTML text with compound words

Related topics