Alphabetic sorting strategies


(Matthew Painter) #1

Hi all,

We have a field that we would like to (approximately) sort on that is
unique and unicode - and potentially long.

In order to decrease the amount of memory required by elasticsearch, we
have been thinking about strategies such as:

  • using a multi field and only indexing the first n characters
  • mapping the strings to a float

Does anyone have any good suggestions for how to manage this kind of use
case better?

Thanks :slight_smile:

Matt

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #2

Use ICU collation of the ICU plugin for sorting. With the "strength" level,
sort key length may be affected. Sorting depends on locale, so I do not
recommend only indexing first n characters.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Matthew Painter) #3

Indeed I see that the more loose you are with comparison, the shorter a
collation key can be.

I can't see what issues taking the first n characters would cause, assuming
the accents are combined with letters in the normalized Unicode form? Of
course what would perhaps be a better alternative is only taking the first
n bytes of the collation key. This should give an approximate ordering with
a known precision. Doing this, ignoring punctuation in the collator, looks
best to me to get a good-enough ordering?

Thanks for the input of course :slight_smile:

On Friday, November 22, 2013, joergprante@gmail.com wrote:

Use ICU collation of the ICU plugin for sorting. With the "strength"
level, sort key length may be affected. Sorting depends on locale, so I do
not recommend only indexing first n characters.

https://github.com/elasticsearch/elasticsearch-analysis-icu

Jörg

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/tAhq6GbfuPg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com <javascript:_e({}, 'cvml',
'elasticsearch%2Bunsubscribe@googlegroups.com');>.
For more options, visit https://groups.google.com/groups/opt_out.

--
Sent from Gmail Mobile

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #4

The length of a sort key can be reduced by many methods, e.g. run length
encoding, more details are available at the Unicode Collation algorithm
http://www.unicode.org/reports/tr10/#Reducing_Sort_Key_Lengths

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #5