Alphabetic sorting strategies

Matthew_Painter · November 22, 2013, 7:46pm

Hi all,

We have a field that we would like to (approximately) sort on that is
unique and unicode - and potentially long.

In order to decrease the amount of memory required by elasticsearch, we
have been thinking about strategies such as:

using a multi field and only indexing the first n characters
mapping the strings to a float

Does anyone have any good suggestions for how to manage this kind of use
case better?

Thanks

Matt

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · November 22, 2013, 9:15pm

Use ICU collation of the ICU plugin for sorting. With the "strength" level,
sort key length may be affected. Sorting depends on locale, so I do not
recommend only indexing first n characters.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Matthew_Painter · November 22, 2013, 10:51pm

Indeed I see that the more loose you are with comparison, the shorter a
collation key can be.

I can't see what issues taking the first n characters would cause, assuming
the accents are combined with letters in the normalized Unicode form? Of
course what would perhaps be a better alternative is only taking the first
n bytes of the collation key. This should give an approximate ordering with
a known precision. Doing this, ignoring punctuation in the collator, looks
best to me to get a good-enough ordering?

Thanks for the input of course

On Friday, November 22, 2013, joergprante@gmail.com wrote:

Use ICU collation of the ICU plugin for sorting. With the "strength"
level, sort key length may be affected. Sorting depends on locale, so I do
not recommend only indexing first n characters.

GitHub - elastic/elasticsearch-analysis-icu: ICU Analysis plugin for Elasticsearch

Jörg

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/tAhq6GbfuPg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com <javascript:_e({}, 'cvml',
'elasticsearch%2Bunsubscribe@googlegroups.com');>.
For more options, visit https://groups.google.com/groups/opt_out.

--
Sent from Gmail Mobile

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · November 23, 2013, 12:27pm

The length of a sort key can be reduced by many methods, e.g. run length
encoding, more details are available at the Unicode Collation algorithm
http://www.unicode.org/reports/tr10/#Reducing_Sort_Key_Lengths

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Is it possible to sort in a custom grouping manner with unicode collation algorithm in elasticsearch? Elasticsearch	1	137	April 5, 2023
ICU, collation, and numbers Elasticsearch	3	614	July 5, 2017
Unexpected Behavior with ICU Collation Keyword Sorting Elastic Search	1	21	December 9, 2024
ICU sorting of terms aggregation with multi-valued fields Elasticsearch runtime-fields	7	700	March 24, 2022
Terms aggregation with ICU multi-field and arrays Elasticsearch runtime-fields	2	688	March 22, 2022

Alphabetic sorting strategies

Related topics