How to analyze an HTML text with compound words

(Renato Golia) #1

I'm writing a search service based on Elasticsearch for a bunch of sites with content written in agglutinated languages like Swedish, German and Finnish.

I know that Elasticsearch offers language analyzers by default but after some testing I found their support sloppy at best.

What I got so far is:

          "type": "stop",
          "stopwords": "_swedish_"
          "word_list":["very", "long", "list", "of", "words", "almost", "13", "MB"]
          "tokenizer": "standard",

Do you guys have a clue?

(system) #2