How to so sort multy token string and distinct feature + utf8 support?


(Pascal Pensa) #1

Hi,

I'm new to ES ans trying to sort strings i always get an error if
string contains more than one word.

Another question is about dynamic deduplication/distinct/unique based
on a field, i've searched along the ES wiki and trying to search in
this group without success, does ES provides a "unique" feature or
something equivalent removing duplicates answers given a field ?
something like:

..?q=some+key+words&unique=reference

removing any duplicates from the resultset based on the "reference"
tag.

And the last one, my test data contains accents, i tried various
configurations of analysers, installed the icu plugin and set it into
a filter, set langage to french, but it seems accents are not removed
from tokenized items.

I'm actually using sphinxsearch and accents need to be manually table-
mapped into the configuration file, is there an quivalent into ES ?

Thanks !
Pascal


(Karussell) #2

Hi

I'm new to ES ans trying to sort strings i always get an error if
string contains more than one word.

You'll need to index them via keyword analyzer

Another question is about dynamic deduplication/distinct/unique based
on a field

issue 256 regarding group by feature is not yet implemented. You'll
need to do it on the client side.

And the last one, my test data contains accents, i tried various
configurations of analysers, installed the icu plugin and set it into
a filter, set langage to french, but it seems accents are not removed
from tokenized items.

Did you tried the custom rules of the icu plugin? I read an article
that it should be somehow possible ... I'll check

Regards,
Peter.


(Karussell) #3

Hi

Did you tried the custom rules of the icu plugin? I read an article
that it should be somehow possible ... I'll check

Hmmh, strange for German umlauts it is done in the filter:

http://web.archiveorange.com/archive/v/xJxT8VzgTaUwuBXnP9gJ

for french not I think:

http://grepcode.com/file/repository.grepcode.com/java/eclipse.org/3.7/org.apache.lucene/analysis/2.9.1/org/apache/lucene/analysis/fr/FrenchStemmer.java

I fear you will have to patch it or include another stemmer. Ah, but I
saw you already asked it at the right place :slight_smile: (french elasticsearch
group)

Regards,
Peter.


(Karussell) #4

This stemmer should do the work:

https://github.com/apache/lucene-solr/blob/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchLightStemmer.java#L230

but you'll need to include it (and raise an issue?) like I did with a
custom filter:

using this filter&factory and setting the stemmer.

https://github.com/apache/lucene-solr/blob/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchStemFilter.java
https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/index/analysis/FrenchStemTokenFilterFactory.java

Something like

public class MyFrenchStemTokenFilterFactory extends
FrenchStemTokenFilterFactory {

private final Set<?> exclusions;

@Inject
public FrenchStemTokenFilterFactory(Index index, @IndexSettings

Settings indexSettings, @Assisted String name, @Assisted Settings
settings) {
super(index, indexSettings, name, settings);
}

@Override
public TokenStream create(TokenStream tokenStream) {
    return new FrenchStemFilter(tokenStream,

exclusions).setStemmer(new FrenchLightStemmer());
}
}

Peter.


(Karussell) #5

Ok, sorry to bubble up once you should be able to simply use:

light_french

as filter. found it in the code:

https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/index/analysis/StemmerTokenFilterFactory.java

Regards,
Peter.

On 12 Dez., 11:10, Karussell tableyourt...@googlemail.com wrote:

This stemmer should do the work:

https://github.com/apache/lucene-solr/blob/trunk/modules/analysis/com...

but you'll need to include it (and raise an issue?) like I did with a
custom filter:

https://github.com/karussell/Jetwick/blob/master/es/config/elasticsea...

using this filter&factory and setting the stemmer.

https://github.com/apache/lucene-solr/blob/trunk/modules/analysis/com...https://github.com/elasticsearch/elasticsearch/blob/master/src/main/j...

Something like

public class MyFrenchStemTokenFilterFactory extends
FrenchStemTokenFilterFactory {

private final Set<?> exclusions;

@Inject
public FrenchStemTokenFilterFactory(Index index, @IndexSettings

Settings indexSettings, @Assisted String name, @Assisted Settings
settings) {
super(index, indexSettings, name, settings);
}

@Override
public TokenStream create(TokenStream tokenStream) {
    return new FrenchStemFilter(tokenStream,

exclusions).setStemmer(new FrenchLightStemmer());
}

}

Peter.


(Pascal Pensa) #6

Thanks, i'll try your suggestions and give feedback,

For unicity (or grouping as named in sphinxsearch) it's because we
have products and videos duplicated in various sub catalogs /
categories.
We don't recombine similar entries as they have their own keywords,
target url and so on depending on the portal they belong to, and our
search is possible in a given portal or cross portal.
In cross universe search only one result is displayed sorted by
various factors (freshness, relevance, ...) others similar results are
throwed, today everythnig is done by the search engine.

Throwing data client side is complicated as we have to get many
results to build the navigation bar by removing duplicates and
counting, imagine we may have thousand results it'll be a pain to
paginate results, we're using ajax and we prefer search engine powered
pagination to limit data transfer and platform load.

Pascal


(Karussell) #7

Not sure if I completely followed your usecase but IMO one option in
your case would be to use only one product with an array for the urls
+categories and decide (via middle layer or client) which one to
display.

Throwing data client side is complicated as we have to get many
results to build the navigation bar by removing duplicates and
counting, imagine we may have thousand results it'll be a pain to
paginate results, we're using ajax and we prefer search engine powered
pagination to limit data transfer and platform load.

Ok, I more meant with 'client side' the middle layer (if there is
any).

Regards,
Peter.


(Karussell) #8

Also have a look into parent/child if that could solve your problem:

http://www.elasticsearch.org/guide/reference/query-dsl/top-children-query.html


(Pascal Pensa) #9

thanks,

Found the way to remove accents: added asciifolding filter as french
stemmer doesn't

Pascal


(system) #10