Configurable ASCIIFolding and CharReplace filters done

Hi all.




I've been digging around in ES lately to try it to do what I want.
Part of that is to normalize so that diacritics and other accents
etc are removed.


And there are a number of ways to do this in ES. We have the
ICU_folding filter for instance. But it folds ALL diacritics without
regard of language. Likewise with the ASCIIFoldingFilter.


Btw, the UTR#30 spec that ICU_folding is based on has NOT been
approved as a standard by ICU. It may be useful still though...




I want to retain the swedish characters åäöÅÄÖ, but fold all other
variants.


in swedish åäö is not variants of the letters aao, they are primary
letters that has as much meaning as abc. For instance 

kalla and källa are two distictly different words.
The same goes for the letter æ and ø in norwegian and danish.

So I came up with the solution to modify the standard Lucene
ASCIFolding filter and have it ignore some configurable characters.
I also optimized the filter to scan for lower case charcters first.
The Lucene implementation mixes lower and capitals, but for one
there are more lower case chars in a text and most often one would
use a lower case filter first. 




The filter will normalize for instance (with ignore_chars=åäö)


idé  -> ide


Lukáš -> Lukas


Göteborg -> Göteborg




I also created another filter that replaces characters. I have
several cases with library cataloges where no distinction should be
made between for instance 'v' and 'w'. I.e if you search for
värmland you should get a hit for both Wärmland and Värmland.


The filter will normalize the following (for a setting w=v)


värmland = värmland


wärmland = värmland




The way to use both of these filters is:




index :


    analysis :


        analyzer : 




           default : 


                type: custom


                tokenizer: standard


                filter: [lowercase,  asciiFolding, replaceChars]


        filter :


         asciiFolding:


             type:
se.devo.esfilter.ConfigurableASCIIFoldingTokenFilterFactory


             ignore_chars : åäö


        replaceChars:


             type: se.devo.esfilter.ReplaceCharTokenFilterFactory


             char_pairs : [w,v, j,i]




Is this something of general interest so it should be contributed?
Or is it better that I keep it a private plugin?
-- 
Kristian Jörg

On Wed, Feb 16, 2011 at 9:27 AM, Kristian Jörg krjg@devo.se wrote:

Hi all.

I've been digging around in ES lately to try it to do what I want. Part of
that is to normalize so that diacritics and other accents etc are removed.
And there are a number of ways to do this in ES. We have the ICU_folding
filter for instance. But it folds ALL diacritics without regard of language.
Likewise with the ASCIIFoldingFilter.
Btw, the UTR#30 spec that ICU_folding is based on has NOT been approved as a
standard by ICU. It may be useful still though...

I want to retain the swedish characters åäöÅÄÖ, but fold all other variants.
in swedish åäö is not variants of the letters aao, they are primary letters
that has as much meaning as abc. For instance
kalla and källa are two distictly different words. The same goes for the
letter æ and ø in norwegian and danish.

Right, the purpose of this is language-independent folding. its not
any unicode standard, just based off a nice set of work (withdrawn
standard) for what is just a heuristic.

If you want it to not fold certain things, you should use the expert
ctor with a FilteredNormalizer2.

example:

/* the normalizer2s here are immutable and can be static/thread-safe */
Normalizer2 base = Normalizer2.getInstance(
ICUFoldingFilter.class.getResourceAsStream("utr30.nrm"),
"utr30", Normalizer2.Mode.COMPOSE);
UnicodeSet filter = new UnicodeSet("[^åäöÅÄÖ]");
filter.freeze()
Normalizer2 filtered = new FilteredNormalizer2(base, filter);
TokenStream stream = new ICUNormalizer2Filter(stream, filtered);
...

Robert Muir skrev 2011-02-16 16:40:

On Wed, Feb 16, 2011 at 9:27 AM, Kristian Jörgkrjg@devo.se wrote:

Hi all.

I've been digging around in ES lately to try it to do what I want. Part of
that is to normalize so that diacritics and other accents etc are removed.
And there are a number of ways to do this in ES. We have the ICU_folding
filter for instance. But it folds ALL diacritics without regard of language.
Likewise with the ASCIIFoldingFilter.
Btw, the UTR#30 spec that ICU_folding is based on has NOT been approved as a
standard by ICU. It may be useful still though...

I want to retain the swedish characters åäöÅÄÖ, but fold all other variants.
in swedish åäö is not variants of the letters aao, they are primary letters
that has as much meaning as abc. For instance
kalla and källa are two distictly different words. The same goes for the
letter æ and ø in norwegian and danish.

Right, the purpose of this is language-independent folding. its not
any unicode standard, just based off a nice set of work (withdrawn
standard) for what is just a heuristic.

If you want it to not fold certain things, you should use the expert
ctor with a FilteredNormalizer2.

example:

/* the normalizer2s here are immutable and can be static/thread-safe */
Normalizer2 base = Normalizer2.getInstance(
ICUFoldingFilter.class.getResourceAsStream("utr30.nrm"),
"utr30", Normalizer2.Mode.COMPOSE);
UnicodeSet filter = new UnicodeSet("[^åäöÅÄÖ]");
filter.freeze()
Normalizer2 filtered = new FilteredNormalizer2(base, filter);
TokenStream stream = new ICUNormalizer2Filter(stream, filtered);
...
Nice. I had no idea one could that. But I still need to write a "ES
wrapper" for this right? There is no built-in support for this out of
the box? I.e a specialized version of the ICU normalizer class.

I found the source files for the UTR30 binary (utr30.nrm). One approach
would be to build a special version of utr30.nrm for my (and others)
special needs. But the process of creating the binary is not documented.
The gennorm2 tool does not work for the datafiles. I am not sure this
approach is better than the one you mentioned though.

About the "ReplaceCharFilter". Is there a standard approach to this as
well that could be used?

/Kristian

On Wed, Feb 16, 2011 at 12:12 PM, Kristian Jörg krjg@devo.se wrote:

I found the source files for the UTR30 binary (utr30.nrm). One approach
would be to build a special version of utr30.nrm for my (and others) special
needs. But the process of creating the binary is not documented. The
gennorm2 tool does not work for the datafiles. I am not sure this approach
is better than the one you mentioned though.

it might be better, if you checkout the lucene source, you can see in
the contrib/icu/build.xml there is a gennorm2 task that regenerates
this file.
so you can install icu4c, and run this task, to make your own .nrm
file... by editing the existing source text files etc.

here's the build.xml for reference:
http://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/lucene/contrib/icu/build.xml

but, this is a really advanced use and you might want to try an easier
route if you want more flexibility, or don't need the absolute fastest
indexing performance, see below:

About the "ReplaceCharFilter". Is there a standard approach to this as well
that could be used?

Additionally there is ICUTransformFilter (i am not sure if its
integrated into Elasticsearch, but maybe you could make your own
factory).
http://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/lucene/contrib/icu/src/java/org/apache/lucene/analysis/icu/ICUTransformFilter.java

In this case, you can use either a built in rule-set like
"Traditional-Simplified" or define your own as a string (read in from
a file or whatever).
There is a rule tutorial linked in the javadocs header.. in general
the rules are much more flexible but maybe the processing is slower if
you
have a ton of rules.

for example your rules could have something like:
w > v;
sch > sh;
ss > z;

and you can use things like define variables
$vowel = [aieou];

and context (only convert x to a z, if followed by a vowel)

$vowel } x > z;

Robert Muir skrev 2011-02-16 18:53:

On Wed, Feb 16, 2011 at 12:12 PM, Kristian Jörgkrjg@devo.se wrote:

I found the source files for the UTR30 binary (utr30.nrm). One approach
would be to build a special version of utr30.nrm for my (and others) special
needs. But the process of creating the binary is not documented. The
gennorm2 tool does not work for the datafiles. I am not sure this approach
is better than the one you mentioned though.
it might be better, if you checkout the lucene source, you can see in
the contrib/icu/build.xml there is a gennorm2 task that regenerates
this file.
so you can install icu4c, and run this task, to make your own .nrm
file... by editing the existing source text files etc.

here's the build.xml for reference:
http://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/lucene/contrib/icu/build.xml

but, this is a really advanced use and you might want to try an easier
route if you want more flexibility, or don't need the absolute fastest
indexing performance, see below:
Yes, I will stick with my current, or your previously suggested solution
with a filter first. Really need to get my main project going now.
But having a language dependent ascii folding method is cruical for my
coming projects. As long as the stuff I need to index is swedish I
cannot see how a language independent folding is of any use. It will
cause a lot of false hits if used as-is. It would be interesting to
understand the reasoning behind why the normalization and folding
features in ICU are not considering language differences. It is as vital
as locale dependent collation IMHO.

Creating a swe-utr30.nrm should fix this for me, but for everybody else?
I must be missing something...

About the "ReplaceCharFilter". Is there a standard approach to this as well
that could be used?
Additionally there is ICUTransformFilter (i am not sure if its
integrated into Elasticsearch, but maybe you could make your own
factory).
http://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/lucene/contrib/icu/src/java/org/apache/lucene/analysis/icu/ICUTransformFilter.java

In this case, you can use either a built in rule-set like
"Traditional-Simplified" or define your own as a string (read in from
a file or whatever).
There is a rule tutorial linked in the javadocs header.. in general
the rules are much more flexible but maybe the processing is slower if
you
have a ton of rules.

for example your rules could have something like:
w> v;
sch> sh;
ss> z;

and you can use things like define variables
$vowel = [aieou];

and context (only convert x to a z, if followed by a vowel)

$vowel } x> z;
Ah! great. That looks really powerful indeed. I found the full docs
here:ICU User Guide | ICU Documentation
It would for sure make a great contribution to the ES ICU plugin to have
that factory class. I might look into it.

Thanks!

/Kristian

On Wed, Feb 16, 2011 at 3:37 PM, Kristian Jörg krjg@devo.se wrote:

But having a language dependent ascii folding method is cruical for my
coming projects. As long as the stuff I need to index is swedish I cannot
see how a language independent folding is of any use. It will cause a lot
of false hits if used as-is. It would be interesting to understand the
reasoning behind why the normalization and folding features in ICU are not
considering language differences. It is as vital as locale dependent
collation IMHO.

This is intentional, these are all approximate term conflation
techniques, which have shown good results even for your example
swedish[1].

You should always consider your users and what you are searching. For
example, it seems it might be the case that the "rule" you speak of
such as substitution of v and w is considered obselete by Swedish
language bodies since 2006? It appears these are now considered
distinct characters? [2]

Perhaps this makes sense for your special case, but not for everyone,
this is why defaults are defaults and you can adjust them to your
needs :slight_smile:

  1. http://www.eecs.qmul.ac.uk/~christof/html/publications/inrt142.pdf
  2. http://www.svenskaakademien.se/web/Svenska_Akademiens_ordlista.aspx

Robert Muir skrev 2011-02-16 22:26:

On Wed, Feb 16, 2011 at 3:37 PM, Kristian Jörgkrjg@devo.se wrote:

But having a language dependent ascii folding method is cruical for my
coming projects. As long as the stuff I need to index is swedish I cannot
see how a language independent folding is of any use. It will cause a lot
of false hits if used as-is. It would be interesting to understand the
reasoning behind why the normalization and folding features in ICU are not
considering language differences. It is as vital as locale dependent
collation IMHO.

This is intentional, these are all approximate term conflation
techniques, which have shown good results even for your example
swedish[1].

Yes, for an international audience serving an international index this
is the way to go. But for a swedish audience serving an swedish index it
is all wrong. There will be no swedish user that mixes åäö in swedish
search terms. Just as little as you would mix v or w in english
contexts. aao and åäö are just as distictively different characters that
abc is for you. So transforming all åäö to aao is going to produce a lot
of false hits. You want MISS anything, but you will also collect a lot
of garbage hits. What I would like to see is a formal, standardized way
to fold all diacritics and common characters that are not primary in a
language of choice. I.e a ut30-folding mechanism that is language
dependent. Just as collation is language dependent in a standardized way.
I think this would be hugely useful for any language specific index.
Of course, with the methods I have developed, or the more standard ways
(filtering) you mentioned there is a workaround for this. But it is awkward.

You should always consider your users and what you are searching. For
example, it seems it might be the case that the "rule" you speak of
such as substitution of v and w is considered obselete by Swedish
language bodies since 2006? It appears these are now considered
distinct characters? [2]
Yes. In modern swedish these letters are distict. In the old library
catalogues I am indexing from 1600- there are all kinds if funny little
rules. There was no standard at those times. Even catalogues from the
same era has different sorting rules. So in my "framework" for handling
all of these catalogues I will have to be able to easily configure what
rules to index by.

Perhaps this makes sense for your special case, but not for everyone,
this is why defaults are defaults and you can adjust them to your
needs :slight_smile:

  1. http://www.eecs.qmul.ac.uk/~christof/html/publications/inrt142.pdf
  2. http://www.svenskaakademien.se/web/Svenska_Akademiens_ordlista.aspx
    Yes. And ES is a nice companion to work with, however the features you
    have mentioned, for instance the ICUTransform filter is not integrated.
    So there is a bit of implementation to cope with before it is available.

Hi all!

I finally went along and implemented the proposed code from Robert Muir
below. I added some code to the already existing icu_folding filter so
that it now optionally can handle a unicodeSetFilter param. The code is
committed and documented with two pull requests in GitHub. It is up to
Shay to accept the code if he finds it appropriate.

What is it good for? Say you have a national index that you want to be
searchable by a local audience, e.g people understanding a local
alphabet. The swedish alphabet is composed of the same charcters as the
english + åäö. These chars are primary letters just as abc, not variants
and as such it is wanted to be able to search these characters
distinctly. One solution would be to use the asciiFoldingFilter or the
icu_folding which removes ALL the diacritics leaving "åäö" as "aao".
Searches on the index with that normalization would find what you look
for, but it would also include a lot of false hits.
This example excempts åäö from folding (YAML):

index :
analysis :
analyzer :
textAnalyzer :
type: custom
tokenizer: standard
filter: [standard, myFolding, lowercase]

     filter :
      myFolding :
      	type : icu_folding
        unicodeSet : "[^åäöÅÄÖ]"

Kristian Jörg skrev 2011-02-16 18:12:

Robert Muir skrev 2011-02-16 16:40:

On Wed, Feb 16, 2011 at 9:27 AM, Kristian Jörgkrjg@devo.se wrote:

Hi all.

I've been digging around in ES lately to try it to do what I want.
Part of
that is to normalize so that diacritics and other accents etc are
removed.
And there are a number of ways to do this in ES. We have the
ICU_folding
filter for instance. But it folds ALL diacritics without regard of
language.
Likewise with the ASCIIFoldingFilter.
Btw, the UTR#30 spec that ICU_folding is based on has NOT been
approved as a
standard by ICU. It may be useful still though...

I want to retain the swedish characters åäöÅÄÖ, but fold all other
variants.
in swedish åäö is not variants of the letters aao, they are primary
letters
that has as much meaning as abc. For instance
kalla and källa are two distictly different words. The same goes for
the
letter æ and ø in norwegian and danish.

Right, the purpose of this is language-independent folding. its not
any unicode standard, just based off a nice set of work (withdrawn
standard) for what is just a heuristic.

If you want it to not fold certain things, you should use the expert
ctor with a FilteredNormalizer2.

example:

/* the normalizer2s here are immutable and can be static/thread-safe */
Normalizer2 base = Normalizer2.getInstance(
ICUFoldingFilter.class.getResourceAsStream("utr30.nrm"),
"utr30", Normalizer2.Mode.COMPOSE);
UnicodeSet filter = new UnicodeSet("[^åäöÅÄÖ]");
filter.freeze()
Normalizer2 filtered = new FilteredNormalizer2(base, filter);
TokenStream stream = new ICUNormalizer2Filter(stream, filtered);
...