Custom normalisation and filtering?


(Barsk) #1
I have spent the last day or so trying to get my head around how
normalization is done in ElasticSearch and how to customize it for
my needs.





I am responsible for a webbappliction that indexes library
catalogues. i.e card catalogues that all libraries had prior to the
digitized era. These catalogues may be really old, like spanning
from 1600-1974 and are often sorted according to some specific
rules. For instance all accents should be removed, but not for those
letters that are part of our alphabet in Sweden åäö, ÅÄÖ. For all
the rest the accents are removed e.g é=e etc. Also, some catalogues
have some special rules such as v=w, i=j etc . All my indexes are
ISO-8859-1.





In my webapp I have made my own normalization handling based on
these rules and I store the index in an SQL database.



All fine.





But now we are going to OCR process all those cards which we have
scanned already and create a free text search on <b>all</b> the
text on the cards, not just the main entry that the card is sorted
under (author or title). So I am looking at Elastic Search to help
me with this, and the features so far is awesome. I aim to replace
the search engine in my webapp with elastic search.



However the analyzer/normalization part raises some questions.





1) How do I create a custom analyzer that has a filter that removes
the accents according to these rules? Is there an API to build upon?
What I need to do is close to the ISOLatin1AccentFilter in Lucene,
but with some customisation.



2) Filter according to specific rules, e.g v=w, i=j etc



3) Nordic stemming (swedish, norwegian, finnish), seems not to be
available. It is a part of the Snowball classes that I saw is about
to be introduced in 0.15, but only German, English and Dutch was
supported there. How do I go about to add Swedish stemming in ES?





ICU is also an option, the docs on their homepage is far from light
though. But it seems they have normalization features that are
configurable. However the icu-plugin only handles the default
formats and no custom. I think tough that ICU handling is more than
I need really.

(Lukáš Vlček) #2

Hi,

as far as I can understand it should be possible to implement your ES plugin
with whatever filters and analyzers you need if they are missing now.
Language analyzers that are now present in ES are based on Lucene 3.0.3.
Note that analysis module has undergone significant development since then
in Lucene project, now you can find all "sv", "fi" and "no" analyzers in
Lucene 3.1-dev version (in trunk) and it is not hard to backport those new
analyzers (and stemmers) from 3.1 into ES. I have done this for Czech
Stemmer some time ago (check here for inspiration: my original pull
requesthttps://github.com/lukas-vlcek/elasticsearch/commit/0c0e43db76d0fdcfc7c711e1b0a210b1aba71b09and
here for Shay's
cleanuphttps://github.com/elasticsearch/elasticsearch/commit/034a66263a345c29d1efe1a7fb4c25e2e0f2fb4d).
The same could be done with other analyzers as well (though not very nice
practice it is probably better then switching to 3.1-dev version of Lucene
now).

If you need some customized analyzer or filter then it might be better idea
to implement ES plugin for it. In this case you can try to look at some of
the existing plugins (ICU could be a good candidate?) and start from there.

Regards,
Lukas

On Fri, Feb 4, 2011 at 1:53 PM, Kristian Jörg krjg@devo.se wrote:

I have spent the last day or so trying to get my head around how
normalization is done in ElasticSearch and how to customize it for my needs.

I am responsible for a webbappliction that indexes library catalogues. i.e
card catalogues that all libraries had prior to the digitized era. These
catalogues may be really old, like spanning from 1600-1974 and are often
sorted according to some specific rules. For instance all accents should be
removed, but not for those letters that are part of our alphabet in Sweden
åäö, ÅÄÖ. For all the rest the accents are removed e.g é=e etc. Also, some
catalogues have some special rules such as v=w, i=j etc . All my indexes are
ISO-8859-1.

In my webapp I have made my own normalization handling based on these rules
and I store the index in an SQL database.
All fine.

But now we are going to OCR process all those cards which we have scanned
already and create a free text search on all the text on the cards, not
just the main entry that the card is sorted under (author or title). So I am
looking at Elastic Search to help me with this, and the features so far is
awesome. I aim to replace the search engine in my webapp with elastic
search.
However the analyzer/normalization part raises some questions.

  1. How do I create a custom analyzer that has a filter that removes the
    accents according to these rules? Is there an API to build upon? What I need
    to do is close to the ISOLatin1AccentFilter in Lucene, but with some
    customisation.
  2. Filter according to specific rules, e.g v=w, i=j etc
  3. Nordic stemming (swedish, norwegian, finnish), seems not to be
    available. It is a part of the Snowball classes that I saw is about to be
    introduced in 0.15, but only German, English and Dutch was supported there.
    How do I go about to add Swedish stemming in ES?

ICU is also an option, the docs on their homepage is far from light though.
But it seems they have normalization features that are configurable. However
the icu-plugin only handles the default formats and no custom. I think tough
that ICU handling is more than I need really.


#3

On Fri, Feb 4, 2011 at 7:53 AM, Kristian Jörg krjg@devo.se wrote:

I have spent the last day or so trying to get my head around how
normalization is done in ElasticSearch and how to customize it for my needs.

I am responsible for a webbappliction that indexes library catalogues. i.e
card catalogues that all libraries had prior to the digitized era. These
catalogues may be really old, like spanning from 1600-1974 and are often
sorted according to some specific rules. For instance all accents should be
removed, but not for those letters that are part of our alphabet in Sweden
åäö, ÅÄÖ. For all the rest the accents are removed e.g é=e etc. Also, some
catalogues have some special rules such as v=w, i=j etc . All my indexes are
ISO-8859-1.

Hello: there are a number of ways you can do this in ICU: collation,
normalization, and transliteration.

But if your goal is to achieve correct sort order for a sort field,
and not for search, I would recommend using collation.
http://lucene.apache.org/java/3_0_3/api/contrib-collation/index.html

In particular, I would use the ICU variants here, you get support for
many more locales, smaller sort keys, and faster indexing performance.

The collation filters here will normalize your text into a 'collation
key' at index time for sorting, so that at runtime, you just sort on
the field in binary order and results come back in language-sensitive
order, just like how this is often done in databases.


(Barsk) #4

Robert Muir skrev 2011-02-04 21:39:

On Fri, Feb 4, 2011 at 7:53 AM, Kristian Jörgkrjg@devo.se wrote:

I have spent the last day or so trying to get my head around how
normalization is done in ElasticSearch and how to customize it for my needs.

I am responsible for a webbappliction that indexes library catalogues. i.e
card catalogues that all libraries had prior to the digitized era. These
catalogues may be really old, like spanning from 1600-1974 and are often
sorted according to some specific rules. For instance all accents should be
removed, but not for those letters that are part of our alphabet in Sweden
åäö, Ã
ÄÖ. For all the rest the accents are removed e.g é=e etc. Also, some

catalogues have some special rules such as v=w, i=j etc . All my indexes are
ISO-8859-1.
Hello: there are a number of ways you can do this in ICU: collation,
normalization, and transliteration.

But if your goal is to achieve correct sort order for a sort field,
and not for search, I would recommend using collation.
http://lucene.apache.org/java/3_0_3/api/contrib-collation/index.html

In particular, I would use the ICU variants here, you get support for
many more locales, smaller sort keys, and faster indexing performance.

The collation filters here will normalize your text into a 'collation
key' at index time for sorting, so that at runtime, you just sort on
the field in binary order and results come back in language-sensitive
order, just like how this is often done in databases.
Yes, I guess the ICU way is the correct one if we get a broader scope
for the library catalogues. For now we are only focusing on the nordic
countries and Sweden i particular so ISO-8859-1 handling is enough. If
we internationalize fully, UTF-8 and the ICU support seems like a
perfect way.
However right now the product is limited to ISO-8859-1 in most other
respects so having just one part (free text search) being full UTF-8
compliant is of limited use.

My question in particular was not which normalisation/collation package
to use, it was HOW to get them into ES. The current support seems
limited and poorly documented. It look like I have to grab the full
source and hack around? There should be a better way to "plug in"
whatever lucene or solr you may need. At what I have grasped so far from
the source most or all of the analyzers and filters are pure lucene
stuff with some wrapper code on them. Could'nt this be done dynamically
in runtime, perhaps with the help of reflection etc?


#5

On Mon, Feb 7, 2011 at 2:31 AM, Kristian Jörg krjg@devo.se wrote:

However right now the product is limited to ISO-8859-1 in most other
respects so having just one part (free text search) being full UTF-8
compliant is of limited use.

I don't understand what you are saying here. All lucene indexes are
UTF-8, that includes yours too.


(Barsk) #6

Robert Muir skrev 2011-02-07 10:00:

On Mon, Feb 7, 2011 at 2:31 AM, Kristian Jörgkrjg@devo.se wrote:

However right now the product is limited to ISO-8859-1 in most other
respects so having just one part (free text search) being full UTF-8
compliant is of limited use.

I don't understand what you are saying here. All lucene indexes are
UTF-8, that includes yours too.
Ah, right.
My point is, as I understood the ICU docs, it is specialized in handling
UTF-8 locales with regard to analyzing and collation.
My needs is, for the forseable future, only with the nordic languages in
mind so I will only be using a very small portion of what ICU is capble
of. And if using ICU is complicated, sticking to the "normal" lucene
stuff may well work for my needs.

So I am still looking for what is the best route to follow. What I think
I need to do is filter for the special library sorting rules ( like i=j,
v=w etc) with a custom filter that I create as a plugin and then run it
through an ordinary analyzer like snowball with swedish stemming. I
think I have figured out how to do it from the latest contributions to
the source (snowball filter is part of 0.15). Collation is another
question I am investigating. What controls that? I need swedish collation...


#7

On Mon, Feb 7, 2011 at 8:10 AM, Kristian Jörg krjg@devo.se wrote:

Ah, right.
My point is, as I understood the ICU docs, it is specialized in handling
UTF-8 locales with regard to analyzing and collation.
My needs is, for the forseable future, only with the nordic languages in
mind so I will only be using a very small portion of what ICU is capble of.
And if using ICU is complicated, sticking to the "normal" lucene stuff may
well work for my needs.

So I am still looking for what is the best route to follow. What I think I
need to do is filter for the special library sorting rules ( like i=j, v=w
etc) with a custom filter that I create as a plugin and then run it through
an ordinary analyzer like snowball with swedish stemming. I think I have
figured out how to do it from the latest contributions to the source
(snowball filter is part of 0.15). Collation is another question I am
investigating. What controls that? I need swedish collation...

Collation and locales don't have anything to do with UTF-8... as far
as using ICU here its just as easy as using the JDK support! Just drop
in the extra jar file.
You can see some examples here:
http://lucene.apache.org/java/3_0_3/api/contrib-collation/org/apache/lucene/collation/package-summary.html

I wouldn't recommend trying to normalize text yourself to make your
own collation keys... i would use the built in support
In lucene this is just as easy as new
ICUCollationKeyAnalyzer(Collator.getInstance(new ULocale("sv")));
then you are indexing sort keys for swedish collation.

if your library truly does have special sorting rules, you can take an
existing collator and customize it, here's an example:
http://wiki.apache.org/solr/UnicodeCollation#Sorting_text_with_custom_rules

But first, i would explore the built-in rules to make sure they don't
satisfy your requirements first (you can do this with ICU's locale
explorer, e.g.):
http://demo.icu-project.org/icu-bin/locexp?=sv_SE&d=en&x=col


(Shay Banon) #8

There is the ICU plugin that provides the ICU level token filters that you are after: http://www.elasticsearch.org/guide/reference/index-modules/analysis/icu-plugin.html/
On Monday, February 7, 2011 at 6:29 PM, Robert Muir wrote:

On Mon, Feb 7, 2011 at 8:10 AM, Kristian Jörg krjg@devo.se wrote:

Ah, right.
My point is, as I understood the ICU docs, it is specialized in handling
UTF-8 locales with regard to analyzing and collation.
My needs is, for the forseable future, only with the nordic languages in
mind so I will only be using a very small portion of what ICU is capble of.
And if using ICU is complicated, sticking to the "normal" lucene stuff may
well work for my needs.

So I am still looking for what is the best route to follow. What I think I
need to do is filter for the special library sorting rules ( like i=j, v=w
etc) with a custom filter that I create as a plugin and then run it through
an ordinary analyzer like snowball with swedish stemming. I think I have
figured out how to do it from the latest contributions to the source
(snowball filter is part of 0.15). Collation is another question I am
investigating. What controls that? I need swedish collation...

Collation and locales don't have anything to do with UTF-8... as far
as using ICU here its just as easy as using the JDK support! Just drop
in the extra jar file.
You can see some examples here:
http://lucene.apache.org/java/3_0_3/api/contrib-collation/org/apache/lucene/collation/package-summary.html

I wouldn't recommend trying to normalize text yourself to make your
own collation keys... i would use the built in support
In lucene this is just as easy as new
ICUCollationKeyAnalyzer(Collator.getInstance(new ULocale("sv")));
then you are indexing sort keys for swedish collation.

if your library truly does have special sorting rules, you can take an
existing collator and customize it, here's an example:
http://wiki.apache.org/solr/UnicodeCollation#Sorting_text_with_custom_rules

But first, i would explore the built-in rules to make sure they don't
satisfy your requirements first (you can do this with ICU's locale
explorer, e.g.):
http://demo.icu-project.org/icu-bin/locexp?=sv_SE&d=en&x=col


(Barsk) #9

Robert Muir skrev 2011-02-07 17:29:

On Mon, Feb 7, 2011 at 8:10 AM, Kristian Jörgkrjg@devo.se wrote:

Ah, right.
My point is, as I understood the ICU docs, it is specialized in handling
UTF-8 locales with regard to analyzing and collation.
My needs is, for the forseable future, only with the nordic languages in
mind so I will only be using a very small portion of what ICU is capble of.
And if using ICU is complicated, sticking to the "normal" lucene stuff may
well work for my needs.

So I am still looking for what is the best route to follow. What I think I
need to do is filter for the special library sorting rules ( like i=j, v=w
etc) with a custom filter that I create as a plugin and then run it through
an ordinary analyzer like snowball with swedish stemming. I think I have
figured out how to do it from the latest contributions to the source
(snowball filter is part of 0.15). Collation is another question I am
investigating. What controls that? I need swedish collation...

Collation and locales don't have anything to do with UTF-8... as far
as using ICU here its just as easy as using the JDK support! Just drop
in the extra jar file.
You can see some examples here:
http://lucene.apache.org/java/3_0_3/api/contrib-collation/org/apache/lucene/collation/package-summary.html

I wouldn't recommend trying to normalize text yourself to make your
own collation keys... i would use the built in support
In lucene this is just as easy as new
ICUCollationKeyAnalyzer(Collator.getInstance(new ULocale("sv")));
then you are indexing sort keys for swedish collation.

if your library truly does have special sorting rules, you can take an
existing collator and customize it, here's an example:
http://wiki.apache.org/solr/UnicodeCollation#Sorting_text_with_custom_rules

But first, i would explore the built-in rules to make sure they don't
satisfy your requirements first (you can do this with ICU's locale
explorer, e.g.):
http://demo.icu-project.org/icu-bin/locexp?=sv_SE&d=en&x=col
Thanks Robert,

a very nice answer to my questions, and I did also miss some of the info
in your first reply that covers a fair bit. I was a bit preoccupied then
I guess.
I will look into all of this in depth during the day and and try to
build the index with normalizing and collation as I need it and see how
it goes.

One thing that is still not crystal clear to me is how to actually USE
the customized filters I need in ES. In Lucene it is kind of built-in.
Here it looks like I need to build a wrapper and deploy it as a JAR. But
things will clear once I get started I suppose.

Thanks again for all your splendid support

/Kristian


#10

On Tue, Feb 8, 2011 at 3:12 AM, Kristian Jörg krjg@devo.se wrote:

One thing that is still not crystal clear to me is how to actually USE the
customized filters I need in ES. In Lucene it is kind of built-in. Here it
looks like I need to build a wrapper and deploy it as a JAR. But things will
clear once I get started I suppose.

Did you see Shay Banon's response? It appears elasticsearch already
has a nice integration with these filters:

http://www.elasticsearch.org/guide/reference/index-modules/analysis/icu-plugin.html


(system) #11