Custom normalisation and filtering?

On Fri, Feb 4, 2011 at 7:53 AM, Kristian Jörg krjg@devo.se wrote:

I have spent the last day or so trying to get my head around how
normalization is done in Elasticsearch and how to customize it for my needs.

I am responsible for a webbappliction that indexes library catalogues. i.e
card catalogues that all libraries had prior to the digitized era. These
catalogues may be really old, like spanning from 1600-1974 and are often
sorted according to some specific rules. For instance all accents should be
removed, but not for those letters that are part of our alphabet in Sweden
åäö, ÅÄÖ. For all the rest the accents are removed e.g é=e etc. Also, some
catalogues have some special rules such as v=w, i=j etc . All my indexes are
ISO-8859-1.

Hello: there are a number of ways you can do this in ICU: collation,
normalization, and transliteration.

But if your goal is to achieve correct sort order for a sort field,
and not for search, I would recommend using collation.
http://lucene.apache.org/java/3_0_3/api/contrib-collation/index.html

In particular, I would use the ICU variants here, you get support for
many more locales, smaller sort keys, and faster indexing performance.

The collation filters here will normalize your text into a 'collation
key' at index time for sorting, so that at runtime, you just sort on
the field in binary order and results come back in language-sensitive
order, just like how this is often done in databases.