Facet filters with ICU folding?

Guillermo_Arias_del_ · July 15, 2013, 3:37pm

Hi, all!

I have an index with a field "_tokens" which has relevant tokens associated
with a document. This field is configured as follows:

"_token" : {
"type" : "multi_field",
"fields" : {
"_token" : {
"type" : "string",
"index" : "not_analyzed",
...
},
"folded" : {
"type" : "string",
"analyzer" : "folded",
...
},
"folded_edge_ngram" : {
"type" : "string",
"index_analyzer" : "folded_edge_ngram",
"search_analyzer" : "folded",
...
}
}
}
}

The analyzer "folded" and "folded_edge_ngram" are ICU folded and the latter
has edge_ngram as well.

I'm tring to do a search using the following code:

{
"size": 0,
"query": {
"bool": {
"must": [
{
"term": {
"_token.folded_edge_ngram": "bar"
}
}
]
}
},
"facets" : {
"tokens" : {
"terms" : {
"field" : "_token"
}
}
}
}

It returns all tokens beginning with "bar" with ICU folding, such as "Bär"
or "bar". But it also returns related tokens (remember that there can be
more than one token in "_tokens"), so I want to restrict the facets with
something like:

"exclude": doesn't work, because it only supports a full term match
"regex": it works to an extent (match beginning, case insensitive), but it
doesn't do ICU folding
"scripts": OMG, how does this work?

So, my question is: Is there a form to reduce the facets based on a match
with the ICU folding analyer? Or, am I totally wrong and should be using
something else (more probable)?

P.S. : Afterwards, I also need the opposite. That is: search all documents
containing a (ICU folded) word and do a faceting among the other terms
(this has to do with autocompletion).

Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · July 15, 2013, 9:19pm

Think of facet entries like visual entities. You can run a facet query
on a ICU folded field, but ICU folded terms are not really suitable for
being visual entities. If you facet them, you receive "bar" for "bar and
"bar" for "Bär". So far, so bad.

For this, I always use keyword-analyzed fields for faceting, like you do
with multifielding. So I get two entries for "bar" and "Bär", as in the
original document.

The challenge I have is the facet entries being sorted by ICU
collations, so I once openend a pull request

Or do you want to collapse "bar" and "Bär" into one facet entry by
intention?

Jörg

Am 15.07.13 17:37, schrieb Guillermo Arias del Río:

So, my question is: Is there a form to reduce the facets based on a
match with the ICU folding analyer?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Guillermo_Arias_del_ · July 16, 2013, 7:19am

Hi, Jörg,

I am trying to perform autocompletion, and I want to do it with ICU
folding. If the user types "bar", I want tokens like "Bär", "bar" and
"barman". But it gets more interesting: I want the user to be able to give
me more than one token. For example:

User types: "hello wor"
I match "hello" against "_tokens.folded" and match "wor" against
"_tokens.folded_edge_ngram"
ES gives me the documents, for instance: document1._tokens = [ "Hello"
"World" ], document2._tokens = [ "hello" "word" "blabla" ]
I want to exclude "Hello", "hello", and "blabla"; and retain "World"
and "word"

If it were a search against an unanalyzed field, I could accomplish this
with "exclude" and "regex", but I can't. So now, I am looping through the
results and filtering myself, which means calling _analyze for each
result...

Maybe I should try with another index structure, I don't know.

Guillermo.

El lunes, 15 de julio de 2013 23:19:58 UTC+2, Jörg Prante escribió:

Think of facet entries like visual entities. You can run a facet query
on a ICU folded field, but ICU folded terms are not really suitable for
being visual entities. If you facet them, you receive "bar" for "bar and
"bar" for "Bär". So far, so bad.

For this, I always use keyword-analyzed fields for faceting, like you do
with multifielding. So I get two entries for "bar" and "Bär", as in the
original document.

The challenge I have is the facet entries being sorted by ICU
collations, so I once openend a pull request
Adding ICU collation based sorting for facets by jprante · Pull Request #7 · elastic/elasticsearch-analysis-icu · GitHub

Or do you want to collapse "bar" and "Bär" into one facet entry by
intention?

Jörg

Am 15.07.13 17:37, schrieb Guillermo Arias del Río:

So, my question is: Is there a form to reduce the facets based on a
match with the ICU folding analyer?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · July 16, 2013, 7:35am

I still don't get what your proposed role of facet filters is.

Autocompletion works best on edge-n-gram fields. You can combine edge
n-gram and ICU folding. No need to take care for exclude, regex, and
_analyze.

But, it seems you want autosuggest, not autocomplete. That is, you want
to deliver a list of ranked suggestions for a start of word/phrase.
Check if the Suggest API can help you:

Jörg

Am 16.07.13 09:19, schrieb Guillermo Arias del Río:

Hi, Jörg,

I am trying to perform autocompletion, and I want to do it with ICU
folding. If the user types "bar", I want tokens like "Bär", "bar" and
"barman". But it gets more interesting: I want the user to be able to
give me more than one token. For example:

User types: "hello wor"

I match "hello" against "_tokens.folded" and match "wor" against
"_tokens.folded_edge_ngram"

ES gives me the documents, for instance: document1._tokens = [
"Hello" "World" ], document2._tokens = [ "hello" "word" "blabla" ]

I want to exclude "Hello", "hello", and "blabla"; and retain
"World" and "word"

If it were a search against an unanalyzed field, I could accomplish
this with "exclude" and "regex", but I can't. So now, I am looping
through the results and filtering myself, which means calling _analyze
for each result...

Maybe I should try with another index structure, I don't know.

Guillermo.

El lunes, 15 de julio de 2013 23:19:58 UTC+2, Jörg Prante escribió:
Think of facet entries like visual entities. You can run a facet
query
on a ICU folded field, but ICU folded terms are not really
suitable for
being visual entities. If you facet them, you receive "bar" for
"bar and
"bar" for "Bär". So far, so bad.

For this, I always use keyword-analyzed fields for faceting, like
you do
with multifielding. So I get two entries for "bar" and "Bär", as
in the
original document.

The challenge I have is the facet entries being sorted by ICU
collations, so I once openend a pull request
https://github.com/elasticsearch/elasticsearch-analysis-icu/pull/7/ <https://github.com/elasticsearch/elasticsearch-analysis-icu/pull/7/>


Or do you want to collapse "bar" and "Bär" into one facet entry by
intention?

Jörg

Am 15.07.13 17:37, schrieb Guillermo Arias del Río:
> So, my question is: Is there a form to reduce the facets based on a
> match with the ICU folding analyer?
--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
ICU Folding for Latin Subscript Letters Elasticsearch	1	425	July 16, 2019
Asciifolding character filter Elasticsearch	4	795	July 6, 2017
ICU Analysers for Elastic search Elasticsearch	5	1146	July 5, 2017
[Ann] ICU facet allows sorting based on ICU collations Elasticsearch	4	458	July 6, 2017
ICU exclude lowercase filter Elasticsearch	1	605	July 5, 2017

Facet filters with ICU folding?

Related topics