Custom normalization/filtering?


(Barsk) #1

I have spent the last day or so trying to get my head around how normalization is done in ElasticSearch and how to customize it for my needs.

I am responsible for a webbappliction that indexes library catalogues. i.e card catalogues that all libraries had prior to the digitized era. These catalogues may be really old, like spanning from 1600-1974 and are often sorted according to some specific rules. For instance all accents should be removed, but not for those letters that are part of our alphabet in Sweden åäö, ÅÄÖ. For all the rest the accents are removed e.g é=e etc. Also, some catalogues have some special rules such as v=w, i=j etc . All my indexes are ISO-8859-1.

In my webapp I have made my own normalization handling based on these rules and I store the index in an SQL database.
All fine.

But now we are going to OCR process all those cards which we have scanned already and create a free text search on all the text on the cards, not just the main entry that the card is sorted under (author or title). So I am looking at Elastic Search to help me with this, and the features so far is awesome. I aim to replace the search engine in my webapp with elastic search.
However the analyzer/normalization part raises some questions.

  1. How do I create a custom analyzer that has a filter that removes the accents according to these rules? Is there an API to build upon? What I need to do is close to the ISOLatin1AccentFilter in Lucene, but with some customisation.
  2. Filter according to specific rules, e.g v=w, i=j etc
  3. Nordic stemming (swedish, norwegian, finnish), seems not to be available. It is a part of the Snowball classes that I saw is about to be introduced in 0.15, but only German, English and Dutch was supported there. How do I go about to add Swedish stemming in ES?

ICU is also an option, the docs on their homepage is far from light though. But it seems they have normalization features that are configurable. However the icu-plugin only handles the default formats and no custom. I think tough that ICU handling is more than I need really.


(system) #2