How to so sort multy token string and distinct feature + utf8 support?

Pascal_Pensa · December 12, 2011, 7:35am

Hi,

I'm new to ES ans trying to sort strings i always get an error if
string contains more than one word.

Another question is about dynamic deduplication/distinct/unique based
on a field, i've searched along the ES wiki and trying to search in
this group without success, does ES provides a "unique" feature or
something equivalent removing duplicates answers given a field ?
something like:

..?q=some+key+words&unique=reference

removing any duplicates from the resultset based on the "reference"
tag.

And the last one, my test data contains accents, i tried various
configurations of analysers, installed the icu plugin and set it into
a filter, set langage to french, but it seems accents are not removed
from tokenized items.

I'm actually using sphinxsearch and accents need to be manually table-
mapped into the configuration file, is there an quivalent into ES ?

Thanks !
Pascal

Karussell1 · December 12, 2011, 9:02am

Hi

I'm new to ES ans trying to sort strings i always get an error if
string contains more than one word.

You'll need to index them via keyword analyzer

Another question is about dynamic deduplication/distinct/unique based
on a field

issue 256 regarding group by feature is not yet implemented. You'll
need to do it on the client side.

And the last one, my test data contains accents, i tried various
configurations of analysers, installed the icu plugin and set it into
a filter, set langage to french, but it seems accents are not removed
from tokenized items.

Did you tried the custom rules of the icu plugin? I read an article
that it should be somehow possible ... I'll check

Regards,
Peter.

Karussell1 · December 12, 2011, 9:54am

Hi

Did you tried the custom rules of the icu plugin? I read an article
that it should be somehow possible ... I'll check

Hmmh, strange for German umlauts it is done in the filter:

http://web.archiveorange.com/archive/v/xJxT8VzgTaUwuBXnP9gJ

for french not I think:

http://grepcode.com/file/repository.grepcode.com/java/eclipse.org/3.7/org.apache.lucene/analysis/2.9.1/org/apache/lucene/analysis/fr/FrenchStemmer.java

I fear you will have to patch it or include another stemmer. Ah, but I
saw you already asked it at the right place (french elasticsearch
group)

Regards,
Peter.

Karussell1 · December 12, 2011, 10:10am

This stemmer should do the work:

https://github.com/apache/lucene-solr/blob/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchLightStemmer.java#L230

but you'll need to include it (and raise an issue?) like I did with a
custom filter:

github.com

karussell/Jetwick/blob/master/es/config/elasticsearch.json

{    
    "network" : {
        "host" : "127.0.0.1"
    },
    "index" : {
        "//provided via API number_of_shards": 4,
        "//number_of_replicas": 1,
        "//refresh_interval" : "20s",        
        "analysis" : {
            "//standard tokenizer removes all punctuation chars so avoid it to have # and @":"comment",
            "analyzer" : {                
                "index_analyzer" : {                    
                    "tokenizer" : "whitespace",
                    "filter" : ["jetwickfilter", "lowercase", "snowball"]
                },
                "search_analyzer" : {                                                    
                    "tokenizer" : "whitespace",
                    "filter" : ["jetwickfilter", "lowercase", "snowball"]
                }
            },

This file has been truncated. show original

using this filter&factory and setting the stemmer.

https://github.com/apache/lucene-solr/blob/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchStemFilter.java
https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/index/analysis/FrenchStemTokenFilterFactory.java

Something like

public class MyFrenchStemTokenFilterFactory extends
FrenchStemTokenFilterFactory {

private final Set<?> exclusions;

@Inject
public FrenchStemTokenFilterFactory(Index index, @IndexSettings

Settings indexSettings, @Assisted String name, @Assisted Settings
settings) {
super(index, indexSettings, name, settings);
}

@Override
public TokenStream create(TokenStream tokenStream) {
    return new FrenchStemFilter(tokenStream,

exclusions).setStemmer(new FrenchLightStemmer());
}
}

Peter.

Karussell1 · December 12, 2011, 10:13am

Ok, sorry to bubble up once you should be able to simply use:

light_french

as filter. found it in the code:

https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/index/analysis/StemmerTokenFilterFactory.java

Regards,
Peter.

On 12 Dez., 11:10, Karussell tableyourt...@googlemail.com wrote:

This stemmer should do the work:

https://github.com/apache/lucene-solr/blob/trunk/modules/analysis/com...

but you'll need to include it (and raise an issue?) like I did with a
custom filter:

https://github.com/karussell/Jetwick/blob/master/es/config/elasticsea...

using this filter&factory and setting the stemmer.

https://github.com/apache/lucene-solr/blob/trunk/modules/analysis/com...https://github.com/elasticsearch/elasticsearch/blob/master/src/main/j...

Something like

public class MyFrenchStemTokenFilterFactory extends
FrenchStemTokenFilterFactory {
private final Set<?> exclusions;

@Inject
public FrenchStemTokenFilterFactory(Index index, @IndexSettings
Settings indexSettings, @Assisted String name, @Assisted Settings
settings) {
super(index, indexSettings, name, settings);
}
@Override
public TokenStream create(TokenStream tokenStream) {
    return new FrenchStemFilter(tokenStream,
exclusions).setStemmer(new FrenchLightStemmer());
}

}

Peter.

Pascal_Pensa · December 13, 2011, 6:45am

Thanks, i'll try your suggestions and give feedback,

For unicity (or grouping as named in sphinxsearch) it's because we
have products and videos duplicated in various sub catalogs /
categories.
We don't recombine similar entries as they have their own keywords,
target url and so on depending on the portal they belong to, and our
search is possible in a given portal or cross portal.
In cross universe search only one result is displayed sorted by
various factors (freshness, relevance, ...) others similar results are
throwed, today everythnig is done by the search engine.

Throwing data client side is complicated as we have to get many
results to build the navigation bar by removing duplicates and
counting, imagine we may have thousand results it'll be a pain to
paginate results, we're using ajax and we prefer search engine powered
pagination to limit data transfer and platform load.

Pascal

Karussell1 · December 13, 2011, 10:48am

Not sure if I completely followed your usecase but IMO one option in
your case would be to use only one product with an array for the urls
+categories and decide (via middle layer or client) which one to
display.

Throwing data client side is complicated as we have to get many
results to build the navigation bar by removing duplicates and
counting, imagine we may have thousand results it'll be a pain to
paginate results, we're using ajax and we prefer search engine powered
pagination to limit data transfer and platform load.

Ok, I more meant with 'client side' the middle layer (if there is
any).

Regards,
Peter.

Karussell1 · December 13, 2011, 10:49am

Also have a look into parent/child if that could solve your problem:

http://www.elasticsearch.org/guide/reference/query-dsl/top-children-query.html

Pascal_Pensa · December 13, 2011, 6:20pm

thanks,

Found the way to remove accents: added asciifolding filter as french
stemmer doesn't

Pascal

Topic		Replies	Views
multi_field and sort Elasticsearch	8	729	December 15, 2011
Multi-lingual ES Elasticsearch	9	1257	July 6, 2017
Custom normalisation and filtering? Elasticsearch	10	1617	July 6, 2017
Not_analyzed attribute ==> Can't sort on string types with more than one value per doc, or more than one token per field Elasticsearch	9	399	July 6, 2017
Configure the right analyzer Elasticsearch	10	867	July 6, 2017

How to so sort multy token string and distinct feature + utf8 support?

Related topics