multi_field and sort

Jesus_Lopes · November 30, 2011, 4:17pm

Hi,

I'm starting to use elasticsearch. During my tests, I want sort the
results by name. Some names have accents (like Éder) and may have
other characters.

Bellow the gist with my test:

gist.github.com

https://gist.github.com/jtadeulopes/1409591

elastic.rb

# encoding: utf-8
require "tire"

users = [
  { :id => '1', :type => 'user', :name => 'Jesus Lopes',  :email => 'jl@zigotto.com' },
  { :id => '2', :type => 'user', :name => 'Alfredo',      :email => 'alfredo@email.com' },
  { :id => '3', :type => 'user', :name => 'Éder Costa',   :email => 'ec@zigotto.com' }
]

Tire.index("users") do

This file has been truncated. show original

elastic_multi_field.rb

# encoding: utf-8
require "tire"

users = [
  { :id => '1', :type => 'user', :name => 'Jesus Lopes',  :email => 'jl@zigotto.com' },
  { :id => '2', :type => 'user', :name => 'Alfredo',      :email => 'alfredo@email.com' },
  { :id => '3', :type => 'user', :name => 'Éder Costa',   :email => 'ec@zigotto.com' }
]

Tire.index("users") do

This file has been truncated. show original

elasticsearch.yml

index:
  analysis:
    analyzer:
      default:
        type: brazilian

I'm using multi_field correctly? Which better method for building this
sort?

Using solr we have the same problem.

Thank you!

Jésus Lopes

Clinton_Gormley · November 30, 2011, 5:06pm

Hi Robert

Pinging you personally because you are the author of the ICU code in
Lucene, and hoping you can shed some light on how to use it in
Elasticsearch.

ICU normalization, case folding and collation filters are available
through the ICU plugin:

But unicode being the black art that it is, I'm unclear as to what
exactly is required to solve the issue mentioned below

I'm starting to use elasticsearch. During my tests, I want sort the
results by name. Some names have accents (like Ãder) and may have
other characters.

Bellow the gist with my test:

elasticsearch sort · GitHub

Some questions:

should the tokenizer always be 'keyword' when using the ICU plugin,
as per the examples given on the page the above link points to?

Some of the issues in JIRA talk about using the whitespace tokenizer
instead.
When would you use normalization with nfkc_cf and when would you use
the case-folding filter?
Given that sorting can only be done on fields with single terms,
presumably it'd only be useful to use the collation filter by
itself, with the 'keyword' tokenizer. Or would it be feasible
to combine with the case folding filter?
How would you combine these filters with language stemmers?
How would the collations work if each document can have its own
language?

...plus other stuff I haven't thought of

If you were able to produce an example, it'd be greatly appreciated.

many thanks

Clint

rmuir · December 12, 2011, 12:16pm

I totally missed this message the first time around.

I think in general the confusion is about "collation" versus
"everything else"? The rest of the stuff in the icu package is
"normal" analysis components but collation is wierd to think about
because you are building binary sort keys instead.

On Wed, Nov 30, 2011 at 12:06 PM, Clinton Gormley clint@traveljury.com wrote:

should the tokenizer always be 'keyword' when using the ICU plugin,
as per the examples given on the page the above link points to?

It depends what you are doing. If you are using Collation, definitely,
because you are building a sort key.

When would you use normalization with nfkc_cf and when would you use
the case-folding filter?

When you say case-folding filter, maybe you are referring to ICUFoldingFilter?

This is not really a case-folding filter (though it does case-fold, it
also does other stuff!, and nfkc_cf case folds too)!.
This is just nfkc_cf PLUS additional stuff.

There is a long list of what this additional stuff actually is here:
http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/icu/src/java/org/apache/lucene/analysis/icu/ICUFoldingFilter.java,
but in general its stuff like "removing accents".

In general, you would use nfkc_cf as a replacement for LowerCaseFilter
in your application. In most applications that will be displaying text
to the user, nfkc_cf would be considered pretty aggressive: for
example it reduces full-width numbers to ascii numbers, and things
like that. But for search, this kind of stuff isn't aggressive, its
actually pretty conservative.

If you felt like it wasn't doing enough (e.g. you want to remove
accents too), then you could change to the foldingfilter for even more
aggressive normalization. Think of the ICUFoldingfilter as a
replacement for both LowerCaseFilter and ASCIIFoldingFilter.

Given that sorting can only be done on fields with single terms,
presumably it'd only be useful to use the collation filter by
itself, with the 'keyword' tokenizer. Or would it be feasible
to combine with the case folding filter?

You could case-fold before collation, but there is no need to do this.
Its also dangerous (depending upon what your collation rules are),
because some locales have special case rules (but nfkc_cf is
"generic").

Instead, set properties like strength on the collator to determine if
it should build case-insensitive sort keys (and let the collator
handle it totally).

How would you combine these filters with language stemmers?

which filters? I dont think stemming makes sense for sort keys. So I
don't see it combined with collation.

For the tokenizer, normalization filter, etc, just use swap in with
your existing tokenizer in your chain if you want to try it out.
instead of StandardTokenizer + LowerCaseFilter + EnglishStemmer, you
could use ICUTokenizer + ICUNormalizer2Filter + EnglishStemmer.

How would the collations work if each document can have its own
language?

that sounds more like an application problem... The question is
really, how do you want the sort to work? Whose rules do you want it
to use?

If you absolutely MUST sort across different languages and just want
the best sort that makes the least people angry, then use the root
locale (this is the empty string).

...plus other stuff I haven't thought of

If you were able to produce an example, it'd be greatly appreciated.

For collation or for using all the icu filters in general?

For collation we have these examples:
http://lucene.apache.org/java/3_5_0/api/all/org/apache/lucene/collation/package-summary.html
http://wiki.apache.org/solr/UnicodeCollation
http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/contrib/analysis-extras/src/test/org/apache/solr/analysis/TestICUCollationKeyFilterFactory.java

The icu documentation is really good:
ICU User Guide | ICU Documentation, and they have a cool demo
that lets you investigate sort rules for different locales/different
options visually online. go to
http://demo.icu-project.org/icu-bin/locexp, pick a locale and hit the
demo button under Collation rules.

For the tokenizer, its just a drop in replacement for
StandardTokenizer. The Normalizer2Filter with nfkc_cf can be seen as a
replacement for LowerCaseFilter. The FoldingFilter can be seen as a
replacement for AsciiFoldingFilter, etc.

--

jprante · December 12, 2011, 3:30pm

Hi,

if you have missed the ICU tokenizer und the ICU normalizer in
elasticsearch-analysis-icu plugin, it has just been added (thanks,
shay!)

For the records, an example usage would be

{
"index" : {
"analysis" : {
"analyzer" : {
"default" : {
"type" : "custom",
"tokenizer" : "icu_tokenizer",
"filter" : [ "snowball", "icu_folding" ]
}
},
"filter" : {
"snowball" : {
"type" : "snowball",
"language" : "German2"
}
}
}
}
}

for search in a german library catalog, with dozens of languages
including cjk, with the "umlaut folding" feature.

Jörg

project2501 · December 12, 2011, 5:07pm

Does sorting work for multi valued fields? The Solr people say no, so
I assumed it was a Lucene thing altogether.
Just tossing that tidbit into this.

On Dec 12, 10:30 am, jprante joergpra...@gmail.com wrote:

Hi,

if you have missed the ICU tokenizer und the ICU normalizer in
elasticsearch-analysis-icu plugin, it has just been added (thanks,
shay!)

https://github.com/elasticsearch/elasticsearch-analysis-icu/commit/44...

For the records, an example usage would be

{
"index" : {
"analysis" : {
"analyzer" : {
"default" : {
"type" : "custom",
"tokenizer" : "icu_tokenizer",
"filter" : [ "snowball", "icu_folding" ]
}
},
"filter" : {
"snowball" : {
"type" : "snowball",
"language" : "German2"
}
}
}
}

}

for search in a german library catalog, with dozens of languages
including cjk, with the "umlaut folding" feature.

Jörg

jprante · December 13, 2011, 12:00am

You will get an error if you try to sort on multi valued fields, like
this:

Caused by: java.io.IOException: Can't sort on string types with more
than one value per doc, or more than one token per field at
org.elasticsearch.index.field.data.strings.StringOrdValFieldDataComparator.setNextReader(StringOrdValFieldDataComparator.java:
119)
[...]

Jörg
On Dec 12, 6:07 pm, project2501 darreng5...@gmail.com wrote:

Does sorting work for multi valued fields? The Solr people say no, so
I assumed it was a Lucene thing altogether.
Just tossing that tidbit into this.

On Dec 12, 10:30 am, jprante joergpra...@gmail.com wrote:

Hi,

if you have missed the ICU tokenizer und the ICU normalizer in
elasticsearch-analysis-icu plugin, it has just been added (thanks,
shay!)

https://github.com/elasticsearch/elasticsearch-analysis-icu/commit/44...

For the records, an example usage would be

{
"index" : {
"analysis" : {
"analyzer" : {
"default" : {
"type" : "custom",
"tokenizer" : "icu_tokenizer",
"filter" : [ "snowball", "icu_folding" ]
}
},
"filter" : {
"snowball" : {
"type" : "snowball",
"language" : "German2"
}
}
}
}

}

for search in a german library catalog, with dozens of languages
including cjk, with the "umlaut folding" feature.

Jörg

kimchy · December 13, 2011, 1:30pm

I've just release a 1.1.0 version with those changes in ICU:
GitHub - elastic/elasticsearch-analysis-icu: ICU Analysis plugin for Elasticsearch. Can be
installed (all the migrated plugins) can be installed on 0.18.x.

On Mon, Dec 12, 2011 at 5:30 PM, jprante joergprante@gmail.com wrote:

Hi,

if you have missed the ICU tokenizer und the ICU normalizer in
elasticsearch-analysis-icu plugin, it has just been added (thanks,
shay!)

Merge pull request #1 from jprante/master · elastic/elasticsearch-analysis-icu@44f9a2c · GitHub

For the records, an example usage would be

{
"index" : {
"analysis" : {
"analyzer" : {
"default" : {
"type" : "custom",
"tokenizer" : "icu_tokenizer",
"filter" : [ "snowball", "icu_folding" ]
}
},
"filter" : {
"snowball" : {
"type" : "snowball",
"language" : "German2"
}
}
}
}
}

for search in a german library catalog, with dozens of languages
including cjk, with the "umlaut folding" feature.

Jörg

Clinton_Gormley · December 15, 2011, 10:00am

Hiya Robert

On Mon, 2011-12-12 at 07:16 -0500, Robert Muir wrote:

I totally missed this message the first time around.

Many thanks for the informative post and links.

As soon as I get a moment, I'm going to play around with ICU and post a
tutorial with examples demonstrating how to use it with ES.

clint

I think in general the confusion is about "collation" versus
"everything else"? The rest of the stuff in the icu package is
"normal" analysis components but collation is wierd to think about
because you are building binary sort keys instead.

On Wed, Nov 30, 2011 at 12:06 PM, Clinton Gormley clint@traveljury.com wrote:

should the tokenizer always be 'keyword' when using the ICU plugin,
as per the examples given on the page the above link points to?

It depends what you are doing. If you are using Collation, definitely,
because you are building a sort key.

When would you use normalization with nfkc_cf and when would you use
the case-folding filter?

When you say case-folding filter, maybe you are referring to ICUFoldingFilter?

This is not really a case-folding filter (though it does case-fold, it
also does other stuff!, and nfkc_cf case folds too)!.
This is just nfkc_cf PLUS additional stuff.

There is a long list of what this additional stuff actually is here:
http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/icu/src/java/org/apache/lucene/analysis/icu/ICUFoldingFilter.java,
but in general its stuff like "removing accents".

In general, you would use nfkc_cf as a replacement for LowerCaseFilter
in your application. In most applications that will be displaying text
to the user, nfkc_cf would be considered pretty aggressive: for
example it reduces full-width numbers to ascii numbers, and things
like that. But for search, this kind of stuff isn't aggressive, its
actually pretty conservative.

If you felt like it wasn't doing enough (e.g. you want to remove
accents too), then you could change to the foldingfilter for even more
aggressive normalization. Think of the ICUFoldingfilter as a
replacement for both LowerCaseFilter and ASCIIFoldingFilter.

Given that sorting can only be done on fields with single terms,
presumably it'd only be useful to use the collation filter by
itself, with the 'keyword' tokenizer. Or would it be feasible
to combine with the case folding filter?

You could case-fold before collation, but there is no need to do this.
Its also dangerous (depending upon what your collation rules are),
because some locales have special case rules (but nfkc_cf is
"generic").

Instead, set properties like strength on the collator to determine if
it should build case-insensitive sort keys (and let the collator
handle it totally).

How would you combine these filters with language stemmers?

which filters? I dont think stemming makes sense for sort keys. So I
don't see it combined with collation.

For the tokenizer, normalization filter, etc, just use swap in with
your existing tokenizer in your chain if you want to try it out.
instead of StandardTokenizer + LowerCaseFilter + EnglishStemmer, you
could use ICUTokenizer + ICUNormalizer2Filter + EnglishStemmer.

How would the collations work if each document can have its own
language?

that sounds more like an application problem... The question is
really, how do you want the sort to work? Whose rules do you want it
to use?

If you absolutely MUST sort across different languages and just want
the best sort that makes the least people angry, then use the root
locale (this is the empty string).

...plus other stuff I haven't thought of

If you were able to produce an example, it'd be greatly appreciated.

For collation or for using all the icu filters in general?

For collation we have these examples:
org.apache.lucene.collation (Lucene 3.5.0 API)
UnicodeCollation - Solr - Apache Software Foundation
http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/contrib/analysis-extras/src/test/org/apache/solr/analysis/TestICUCollationKeyFilterFactory.java

The icu documentation is really good:
ICU User Guide | ICU Documentation, and they have a cool demo
that lets you investigate sort rules for different locales/different
options visually online. go to
http://demo.icu-project.org/icu-bin/locexp, pick a locale and hit the
demo button under Collation rules.

For the tokenizer, its just a drop in replacement for
StandardTokenizer. The Normalizer2Filter with nfkc_cf can be seen as a
replacement for LowerCaseFilter. The FoldingFilter can be seen as a
replacement for AsciiFoldingFilter, etc.