Sorting a string field numerically


(Axsuul) #1

I have a field that is string type. Sometimes it will contain integer values
such as 1, 2, 10, 20. Currently how it sorts those is 1, 10, 2, 20. How do
I go aboutsorting that numerically so that it's 1, 2, 10, 20?

--


(Nick Hoffman) #2

Hah, I asked the exact same question:
https://groups.google.com/forum/#!topic/elasticsearch/gds3sZKA-SI

On Sunday, 28 October 2012 14:45:25 UTC-4, James Hu wrote:

I have a field that is string type. Sometimes it will contain integer values
such as 1, 2, 10, 20. Currently how it sorts those is 1, 10, 2, 20. How
do I go aboutsorting that numerically so that it's 1, 2, 10, 20?

--


(Axsuul) #3

Awesome, thanks for pointing that out! I think leading zeros is a great solution. I'm now using 10 character length with leading zeros and it works well. Maybe you should consider using something conservative like 15 so that it will never surpass that.


(Radu Gheorghe) #4

Hello,

You can use the _source field, if you got it stored, in a script
within a custom_score query. For example, assuming "foo" is the name
of our field:

curl -XPOST localhost:9200/test/test/_search?pretty=true -d '{
"query": {
"custom_score": {
"query": {
"match_all": {}
},
"script": "if (_source.foo is Integer) { _source.foo }
else { _score/5 }"
}
}
}'

This will get your numeric values sorted as numbers. You'd need to
fill out the "else" part of the script with whatever values you want
to give to your non-numeric fields.

Best regards,
Radu

http://sematext.com/ -- ElasticSearch -- Solr -- Lucene

On Sun, Oct 28, 2012 at 9:10 PM, Nick Hoffman nick@deadorange.com wrote:

Hah, I asked the exact same question:
https://groups.google.com/forum/#!topic/elasticsearch/gds3sZKA-SI

On Sunday, 28 October 2012 14:45:25 UTC-4, James Hu wrote:

I have a field that is string type. Sometimes it will contain integer
values such as 1, 2, 10, 20. Currently how it sorts those is 1, 10, 2, 20.
How do I go aboutsorting that numerically so that it's 1, 2, 10, 20?

--

--


(Nick Hoffman) #5

Hi Radu. That's an interesting idea. How would you score fields that aren't
numeric, though?

On Monday, 29 October 2012 05:49:37 UTC-4, Radu Gheorghe wrote:

Hello,

You can use the _source field, if you got it stored, in a script
within a custom_score query. For example, assuming "foo" is the name
of our field:

curl -XPOST localhost:9200/test/test/_search?pretty=true -d '{
"query": {
"custom_score": {
"query": {
"match_all": {}
},
"script": "if (_source.foo is Integer) { _source.foo }
else { _score/5 }"
}
}
}'

This will get your numeric values sorted as numbers. You'd need to
fill out the "else" part of the script with whatever values you want
to give to your non-numeric fields.

Best regards,
Radu

http://sematext.com/ -- ElasticSearch -- Solr -- Lucene

On Sun, Oct 28, 2012 at 9:10 PM, Nick Hoffman <ni...@deadorange.com<javascript:>>
wrote:

Hah, I asked the exact same question:
https://groups.google.com/forum/#!topic/elasticsearch/gds3sZKA-SI

On Sunday, 28 October 2012 14:45:25 UTC-4, James Hu wrote:

I have a field that is string type. Sometimes it will contain integer
values such as 1, 2, 10, 20. Currently how it sorts those is 1, 10, 2,

How do I go aboutsorting that numerically so that it's 1, 2, 10, 20?

--

--


(Igor Motov) #6

I think the idea that Radu outlined might work. But I would propose a few
changes to the implementation. First of all, retrieving and parsing source
for every single result can be very expensive especially if records have
large sources. So, it might be better to replace _source lookup with doc or
_field lookup.

The expression "_source.foo is Integer" would work only if "foo" was an
integer in the source. In other words, if it was indexed as {"foo": 123} it
would work, but if it was indexed as {"foo": "123"} it wouldn't. In order
to handle the latter case, we need to actually parse the string.

To answer, Nick's question, we can simplify handling of non-numeric cases,
by switching from custom_score query to sort script .

So, if we combine all these suggestions, we will get something like this:

{
"query" : {
"match_all": {}
},
"sort": {
"_script":{
"script" : "s = doc['''foo'''].value; n =
org.elasticsearch.common.primitives.Ints.tryParse(s); if (n != null) {
String.format("%010d",n)} else { s }",
"type" : "string"
}
}
}'

That still might be too slow for the large result list. So, an ideal
solution here would be to pad integer values with 0 so all integers have
the same size during indexing. In other words, do the same thing that the
script above is doing but during indexing.

Here is a complete example if somebody wants to play with
it: https://gist.github.com/3973641

On Monday, October 29, 2012 8:59:26 AM UTC-4, Nick Hoffman wrote:

Hi Radu. That's an interesting idea. How would you score fields that
aren't numeric, though?

On Monday, 29 October 2012 05:49:37 UTC-4, Radu Gheorghe wrote:

Hello,

You can use the _source field, if you got it stored, in a script
within a custom_score query. For example, assuming "foo" is the name
of our field:

curl -XPOST localhost:9200/test/test/_search?pretty=true -d '{
"query": {
"custom_score": {
"query": {
"match_all": {}
},
"script": "if (_source.foo is Integer) { _source.foo }
else { _score/5 }"
}
}
}'

This will get your numeric values sorted as numbers. You'd need to
fill out the "else" part of the script with whatever values you want
to give to your non-numeric fields.

Best regards,
Radu

http://sematext.com/ -- ElasticSearch -- Solr -- Lucene

On Sun, Oct 28, 2012 at 9:10 PM, Nick Hoffman ni...@deadorange.com
wrote:

Hah, I asked the exact same question:
https://groups.google.com/forum/#!topic/elasticsearch/gds3sZKA-SI

On Sunday, 28 October 2012 14:45:25 UTC-4, James Hu wrote:

I have a field that is string type. Sometimes it will contain integer
values such as 1, 2, 10, 20. Currently how it sorts those is 1, 10, 2,

How do I go aboutsorting that numerically so that it's 1, 2, 10, 20?

--

--


(rpsandiford) #7

Another index-time possibility is to use a magnitude notation prefix for all the numeric portions.

So - if the number is 1 digit long, prefix it with a '1', if it is two digits, prefix with a '2', and so on. If you think you'll exceed 10 digits anywhere, then use a two-digit prefix (i.e. 01 and 02 respectively for a 1 digit and a 2 digit number). That gets around having long tokens (e.g. a whole bunch of leading zeroes) which might be an issue, and also allows you some more flexibility on how long an integer you might encounter without having to re-index everything to put more leading zeroes on the front if you exceed your initial 'max length' number. (i.e. a 2 digit mag notation prefix will handle up to a 99 digit number)

So, 1, 2, 10, 20 would be indexed as 011, 012, 0210, 0220, which will sort 'alphabetically' (i.e. LTR sort) in the correct numeric sequence.

Bob Sandiford | Principal Engineer | SirsiDynix
P: 800.288.8020 X6943 | Bob.Sandiford@sirsidynix.commailto:Bob.Sandiford@sirsidynix.com
www.sirsidynix.comhttp://www.sirsidynix.com/

Join the conversation: Like us on Facebook!http://www.facebook.com/SirsiDynix Follow us on Twitter!http://twitter.com/SirsiDynix

From: Igor Motov-3 [via ElasticSearch Users] [mailto:ml-node+s115913n4024574h41@n3.nabble.com]
Sent: Monday, October 29, 2012 9:58 AM
To: Bob Sandiford
Subject: Re: Sorting a string field numerically

I think the idea that Radu outlined might work. But I would propose a few changes to the implementation. First of all, retrieving and parsing source for every single result can be very expensive especially if records have large sources. So, it might be better to replace _source lookup with doc or _field lookup.

The expression "_source.foo is Integer" would work only if "foo" was an integer in the source. In other words, if it was indexed as {"foo": 123} it would work, but if it was indexed as {"foo": "123"} it wouldn't. In order to handle the latter case, we need to actually parse the string.

To answer, Nick's question, we can simplify handling of non-numeric cases, by switching from custom_score query to sort script .

So, if we combine all these suggestions, we will get something like this:

{
"query" : {
"match_all": {}
},
"sort": {
"_script":{
"script" : "s = doc['''foo'''].value; n = org.elasticsearch.common.primitives.Ints.tryParse(s); if (n != null) { String.format("%010d",n)} else { s }",
"type" : "string"
}
}
}'

That still might be too slow for the large result list. So, an ideal solution here would be to pad integer values with 0 so all integers have the same size during indexing. In other words, do the same thing that the script above is doing but during indexing.

Here is a complete example if somebody wants to play with it: https://gist.github.com/3973641

On Monday, October 29, 2012 8:59:26 AM UTC-4, Nick Hoffman wrote:
Hi Radu. That's an interesting idea. How would you score fields that aren't numeric, though?

On Monday, 29 October 2012 05:49:37 UTC-4, Radu Gheorghe wrote:
Hello,

You can use the _source field, if you got it stored, in a script
within a custom_score query. For example, assuming "foo" is the name
of our field:

curl -XPOST localhost:9200/test/test/_search?pretty=true -d '{
"query": {
"custom_score": {
"query": {
"match_all": {}
},
"script": "if (_source.foo is Integer) { _source.foo }
else { _score/5 }"
}
}
}'

This will get your numeric values sorted as numbers. You'd need to
fill out the "else" part of the script with whatever values you want
to give to your non-numeric fields.

Best regards,
Radu

http://sematext.com/ -- ElasticSearch -- Solr -- Lucene

On Sun, Oct 28, 2012 at 9:10 PM, Nick Hoffman <[hidden email]</user/SendEmail.jtp?type=node&node=4024574&i=0>> wrote:

Hah, I asked the exact same question:
https://groups.google.com/forum/#!topic/elasticsearch/gds3sZKA-SI

On Sunday, 28 October 2012 14:45:25 UTC-4, James Hu wrote:

I have a field that is string type. Sometimes it will contain integer
values such as 1, 2, 10, 20. Currently how it sorts those is 1, 10, 2, 20.
How do I go aboutsorting that numerically so that it's 1, 2, 10, 20?

--

--


If you reply to this email, your message will be added to the discussion below:
http://elasticsearch-users.115913.n3.nabble.com/Sorting-a-string-field-numerically-tp4024557p4024574.html
To start a new topic under ElasticSearch Users, email ml-node+s115913n115913h50@n3.nabble.commailto:ml-node+s115913n115913h50@n3.nabble.com
To unsubscribe from ElasticSearch Users, click herehttp://elasticsearch-users.115913.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=115913&code=Ym9iLnNhbmRpZm9yZEBzaXJzaWR5bml4LmNvbXwxMTU5MTN8LTIxMTYxMTI0NTQ=.
NAMLhttp://elasticsearch-users.115913.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html!nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers!nabble%3Aemail.naml-instant_emails!nabble%3Aemail.naml-send_instant_email!nabble%3Aemail.naml


(rpsandiford) #8

P.S. - you may also want to consider decimals. e.g. if "12.5" should sort before "12.12" (for example), then you'll want to include a magnitude notation following a decimal (i.e. you'd have 312.15 and 312.212). However, if "12.5" should sort after "12.12", then you'll want to avoid placing a magnitude notation prefix on digits following the decimal point (i.e. you'd have 312.5 and 312.12...)

Bob Sandiford | Principal Engineer | SirsiDynix
P: 800.288.8020 X6943 | Bob.Sandiford@sirsidynix.commailto:Bob.Sandiford@sirsidynix.com
www.sirsidynix.comhttp://www.sirsidynix.com/

Join the conversation: Like us on Facebook!http://www.facebook.com/SirsiDynix Follow us on Twitter!http://twitter.com/SirsiDynix

From: Bob Sandiford
Sent: Monday, October 29, 2012 10:05 AM
To: 'Igor Motov-3 [via ElasticSearch Users]'
Subject: RE: Sorting a string field numerically

Another index-time possibility is to use a magnitude notation prefix for all the numeric portions.

So - if the number is 1 digit long, prefix it with a '1', if it is two digits, prefix with a '2', and so on. If you think you'll exceed 10 digits anywhere, then use a two-digit prefix (i.e. 01 and 02 respectively for a 1 digit and a 2 digit number). That gets around having long tokens (e.g. a whole bunch of leading zeroes) which might be an issue, and also allows you some more flexibility on how long an integer you might encounter without having to re-index everything to put more leading zeroes on the front if you exceed your initial 'max length' number. (i.e. a 2 digit mag notation prefix will handle up to a 99 digit number)

So, 1, 2, 10, 20 would be indexed as 011, 012, 0210, 0220, which will sort 'alphabetically' (i.e. LTR sort) in the correct numeric sequence.

Bob Sandiford | Principal Engineer | SirsiDynix
P: 800.288.8020 X6943 | Bob.Sandiford@sirsidynix.commailto:Bob.Sandiford@sirsidynix.com
www.sirsidynix.comhttp://www.sirsidynix.com/

Join the conversation: Like us on Facebook!http://www.facebook.com/SirsiDynix Follow us on Twitter!http://twitter.com/SirsiDynix

From: Igor Motov-3 [via ElasticSearch Users] [mailto:ml-node+s115913n4024574h41@n3.nabble.com]mailto:[mailto:ml-node+s115913n4024574h41@n3.nabble.com]
Sent: Monday, October 29, 2012 9:58 AM
To: Bob Sandiford
Subject: Re: Sorting a string field numerically

I think the idea that Radu outlined might work. But I would propose a few changes to the implementation. First of all, retrieving and parsing source for every single result can be very expensive especially if records have large sources. So, it might be better to replace _source lookup with doc or _field lookup.

The expression "_source.foo is Integer" would work only if "foo" was an integer in the source. In other words, if it was indexed as {"foo": 123} it would work, but if it was indexed as {"foo": "123"} it wouldn't. In order to handle the latter case, we need to actually parse the string.

To answer, Nick's question, we can simplify handling of non-numeric cases, by switching from custom_score query to sort script .

So, if we combine all these suggestions, we will get something like this:

{
"query" : {
"match_all": {}
},
"sort": {
"_script":{
"script" : "s = doc['''foo'''].value; n = org.elasticsearch.common.primitives.Ints.tryParse(s); if (n != null) { String.format("%010d",n)} else { s }",
"type" : "string"
}
}
}'

That still might be too slow for the large result list. So, an ideal solution here would be to pad integer values with 0 so all integers have the same size during indexing. In other words, do the same thing that the script above is doing but during indexing.

Here is a complete example if somebody wants to play with it: https://gist.github.com/3973641

On Monday, October 29, 2012 8:59:26 AM UTC-4, Nick Hoffman wrote:
Hi Radu. That's an interesting idea. How would you score fields that aren't numeric, though?

On Monday, 29 October 2012 05:49:37 UTC-4, Radu Gheorghe wrote:
Hello,

You can use the _source field, if you got it stored, in a script
within a custom_score query. For example, assuming "foo" is the name
of our field:

curl -XPOST localhost:9200/test/test/_search?pretty=true -d '{
"query": {
"custom_score": {
"query": {
"match_all": {}
},
"script": "if (_source.foo is Integer) { _source.foo }
else { _score/5 }"
}
}
}'

This will get your numeric values sorted as numbers. You'd need to
fill out the "else" part of the script with whatever values you want
to give to your non-numeric fields.

Best regards,
Radu

http://sematext.com/ -- ElasticSearch -- Solr -- Lucene

On Sun, Oct 28, 2012 at 9:10 PM, Nick Hoffman <[hidden email]</user/SendEmail.jtp?type=node&node=4024574&i=0>> wrote:

Hah, I asked the exact same question:
https://groups.google.com/forum/#!topic/elasticsearch/gds3sZKA-SI

On Sunday, 28 October 2012 14:45:25 UTC-4, James Hu wrote:

I have a field that is string type. Sometimes it will contain integer
values such as 1, 2, 10, 20. Currently how it sorts those is 1, 10, 2, 20.
How do I go aboutsorting that numerically so that it's 1, 2, 10, 20?

--

--


If you reply to this email, your message will be added to the discussion below:
http://elasticsearch-users.115913.n3.nabble.com/Sorting-a-string-field-numerically-tp4024557p4024574.html
To start a new topic under ElasticSearch Users, email ml-node+s115913n115913h50@n3.nabble.commailto:ml-node+s115913n115913h50@n3.nabble.com
To unsubscribe from ElasticSearch Users, click herehttp://elasticsearch-users.115913.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=115913&code=Ym9iLnNhbmRpZm9yZEBzaXJzaWR5bml4LmNvbXwxMTU5MTN8LTIxMTYxMTI0NTQ=.
NAMLhttp://elasticsearch-users.115913.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html!nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers!nabble%3Aemail.naml-instant_emails!nabble%3Aemail.naml-send_instant_email!nabble%3Aemail.naml


(Nick Hoffman) #9

Hi Igor. Your suggestion to do a field lookup is definitely an improvement.
How would the rest of your solution work if the field's value contains
numeric and non-numeric characters? E.g.
{ "foo" : "hello 30 world" }
{ "foo" : "hello 2 world" }
{ "foo" : "44 bar" }

Thanks,
Nick

On Monday, 29 October 2012 09:57:51 UTC-4, Igor Motov wrote:

I think the idea that Radu outlined might work. But I would propose a few
changes to the implementation. First of all, retrieving and parsing source
for every single result can be very expensive especially if records have
large sources. So, it might be better to replace _source lookup with doc or
_field lookup.

The expression "_source.foo is Integer" would work only if "foo" was an
integer in the source. In other words, if it was indexed as {"foo": 123} it
would work, but if it was indexed as {"foo": "123"} it wouldn't. In order
to handle the latter case, we need to actually parse the string.

To answer, Nick's question, we can simplify handling of non-numeric cases,
by switching from custom_score query to sort script .

So, if we combine all these suggestions, we will get something like this:

{
"query" : {
"match_all": {}
},
"sort": {
"_script":{
"script" : "s = doc['''foo'''].value; n =
org.elasticsearch.common.primitives.Ints.tryParse(s); if (n != null) {
String.format("%010d",n)} else { s }",
"type" : "string"
}
}
}'

That still might be too slow for the large result list. So, an ideal
solution here would be to pad integer values with 0 so all integers have
the same size during indexing. In other words, do the same thing that the
script above is doing but during indexing.

Here is a complete example if somebody wants to play with it:
https://gist.github.com/3973641

On Monday, October 29, 2012 8:59:26 AM UTC-4, Nick Hoffman wrote:

Hi Radu. That's an interesting idea. How would you score fields that
aren't numeric, though?

On Monday, 29 October 2012 05:49:37 UTC-4, Radu Gheorghe wrote:

Hello,

You can use the _source field, if you got it stored, in a script
within a custom_score query. For example, assuming "foo" is the name
of our field:

curl -XPOST localhost:9200/test/test/_search?pretty=true -d '{
"query": {
"custom_score": {
"query": {
"match_all": {}
},
"script": "if (_source.foo is Integer) { _source.foo }
else { _score/5 }"
}
}
}'

This will get your numeric values sorted as numbers. You'd need to
fill out the "else" part of the script with whatever values you want
to give to your non-numeric fields.

Best regards,
Radu

http://sematext.com/ -- ElasticSearch -- Solr -- Lucene

On Sun, Oct 28, 2012 at 9:10 PM, Nick Hoffman ni...@deadorange.com
wrote:

Hah, I asked the exact same question:
https://groups.google.com/forum/#!topic/elasticsearch/gds3sZKA-SI

On Sunday, 28 October 2012 14:45:25 UTC-4, James Hu wrote:

I have a field that is string type. Sometimes it will contain integer
values such as 1, 2, 10, 20. Currently how it sorts those is 1, 10,
2, 20.

How do I go aboutsorting that numerically so that it's 1, 2, 10, 20?

--

--


(Igor Motov) #10

In this case, you would have to split the value of the foo field into
words, check if each word is an integer and replace it with a number with
leading zeros if it is.

On Monday, October 29, 2012 3:22:47 PM UTC-4, Nick Hoffman wrote:

Hi Igor. Your suggestion to do a field lookup is definitely an
improvement. How would the rest of your solution work if the field's value
contains numeric and non-numeric characters? E.g.
{ "foo" : "hello 30 world" }
{ "foo" : "hello 2 world" }
{ "foo" : "44 bar" }

Thanks,
Nick

On Monday, 29 October 2012 09:57:51 UTC-4, Igor Motov wrote:

I think the idea that Radu outlined might work. But I would propose a few
changes to the implementation. First of all, retrieving and parsing source
for every single result can be very expensive especially if records have
large sources. So, it might be better to replace _source lookup with doc or
_field lookup.

The expression "_source.foo is Integer" would work only if "foo" was an
integer in the source. In other words, if it was indexed as {"foo": 123} it
would work, but if it was indexed as {"foo": "123"} it wouldn't. In order
to handle the latter case, we need to actually parse the string.

To answer, Nick's question, we can simplify handling of non-numeric
cases, by switching from custom_score query to sort script .

So, if we combine all these suggestions, we will get something like this:

{
"query" : {
"match_all": {}
},
"sort": {
"_script":{
"script" : "s = doc['''foo'''].value; n =
org.elasticsearch.common.primitives.Ints.tryParse(s); if (n != null) {
String.format("%010d",n)} else { s }",
"type" : "string"
}
}
}'

That still might be too slow for the large result list. So, an ideal
solution here would be to pad integer values with 0 so all integers have
the same size during indexing. In other words, do the same thing that the
script above is doing but during indexing.

Here is a complete example if somebody wants to play with it:
https://gist.github.com/3973641

On Monday, October 29, 2012 8:59:26 AM UTC-4, Nick Hoffman wrote:

Hi Radu. That's an interesting idea. How would you score fields that
aren't numeric, though?

On Monday, 29 October 2012 05:49:37 UTC-4, Radu Gheorghe wrote:

Hello,

You can use the _source field, if you got it stored, in a script
within a custom_score query. For example, assuming "foo" is the name
of our field:

curl -XPOST localhost:9200/test/test/_search?pretty=true -d '{
"query": {
"custom_score": {
"query": {
"match_all": {}
},
"script": "if (_source.foo is Integer) { _source.foo }
else { _score/5 }"
}
}
}'

This will get your numeric values sorted as numbers. You'd need to
fill out the "else" part of the script with whatever values you want
to give to your non-numeric fields.

Best regards,
Radu

http://sematext.com/ -- ElasticSearch -- Solr -- Lucene

On Sun, Oct 28, 2012 at 9:10 PM, Nick Hoffman ni...@deadorange.com
wrote:

Hah, I asked the exact same question:
https://groups.google.com/forum/#!topic/elasticsearch/gds3sZKA-SI

On Sunday, 28 October 2012 14:45:25 UTC-4, James Hu wrote:

I have a field that is string type. Sometimes it will contain
integer

values such as 1, 2, 10, 20. Currently how it sorts those is 1, 10,
2, 20.

How do I go aboutsorting that numerically so that it's 1, 2, 10, 20?

--

--


(Nick Hoffman) #11

Ah, that's right. That should work quite well. The only question is how
inefficient this will be. I searched for documentation on the performance
of sort scripts, but couldn't find anything. Do you know what sort of
negative impact this would have on performance?

On Monday, 29 October 2012 15:31:11 UTC-4, Igor Motov wrote:

In this case, you would have to split the value of the foo field into
words, check if each word is an integer and replace it with a number with
leading zeros if it is.

On Monday, October 29, 2012 3:22:47 PM UTC-4, Nick Hoffman wrote:

Hi Igor. Your suggestion to do a field lookup is definitely an
improvement. How would the rest of your solution work if the field's value
contains numeric and non-numeric characters? E.g.
{ "foo" : "hello 30 world" }
{ "foo" : "hello 2 world" }
{ "foo" : "44 bar" }

Thanks,
Nick

On Monday, 29 October 2012 09:57:51 UTC-4, Igor Motov wrote:

I think the idea that Radu outlined might work. But I would propose a
few changes to the implementation. First of all, retrieving and parsing
source for every single result can be very expensive especially if records
have large sources. So, it might be better to replace _source lookup with
doc or _field lookup.

The expression "_source.foo is Integer" would work only if "foo" was an
integer in the source. In other words, if it was indexed as {"foo": 123} it
would work, but if it was indexed as {"foo": "123"} it wouldn't. In order
to handle the latter case, we need to actually parse the string.

To answer, Nick's question, we can simplify handling of non-numeric
cases, by switching from custom_score query to sort script .

So, if we combine all these suggestions, we will get something like this:

{
"query" : {
"match_all": {}
},
"sort": {
"_script":{
"script" : "s = doc['''foo'''].value; n =
org.elasticsearch.common.primitives.Ints.tryParse(s); if (n != null) {
String.format("%010d",n)} else { s }",
"type" : "string"
}
}
}'

That still might be too slow for the large result list. So, an ideal
solution here would be to pad integer values with 0 so all integers have
the same size during indexing. In other words, do the same thing that the
script above is doing but during indexing.

Here is a complete example if somebody wants to play with it:
https://gist.github.com/3973641

On Monday, October 29, 2012 8:59:26 AM UTC-4, Nick Hoffman wrote:

Hi Radu. That's an interesting idea. How would you score fields that
aren't numeric, though?

On Monday, 29 October 2012 05:49:37 UTC-4, Radu Gheorghe wrote:

Hello,

You can use the _source field, if you got it stored, in a script
within a custom_score query. For example, assuming "foo" is the name
of our field:

curl -XPOST localhost:9200/test/test/_search?pretty=true -d '{
"query": {
"custom_score": {
"query": {
"match_all": {}
},
"script": "if (_source.foo is Integer) { _source.foo }
else { _score/5 }"
}
}
}'

This will get your numeric values sorted as numbers. You'd need to
fill out the "else" part of the script with whatever values you want
to give to your non-numeric fields.

Best regards,
Radu

http://sematext.com/ -- ElasticSearch -- Solr -- Lucene

On Sun, Oct 28, 2012 at 9:10 PM, Nick Hoffman ni...@deadorange.com
wrote:

Hah, I asked the exact same question:
https://groups.google.com/forum/#!topic/elasticsearch/gds3sZKA-SI

On Sunday, 28 October 2012 14:45:25 UTC-4, James Hu wrote:

I have a field that is string type. Sometimes it will contain
integer

values such as 1, 2, 10, 20. Currently how it sorts those is 1, 10,
2, 20.

How do I go aboutsorting that numerically so that it's 1, 2, 10,
20?

--

--


(Clinton Gormley) #12

On Mon, 2012-10-29 at 12:36 -0700, Nick Hoffman wrote:

Ah, that's right. That should work quite well. The only question is
how inefficient this will be. I searched for documentation on the
performance of sort scripts, but couldn't find anything. Do you know
what sort of negative impact this would have on performance?

Doing these things live will never perform as well as precalculating
them and storing the extra data. You either pay once at index time, or
every time you search.

Personally I would add a field "sort_order" which would precalculate a
value which could be used for a simple string sort.

For instance, "Nexus 7" and "Nexus 10" might be indexed as "nexus
000007" and "nexus 000010", so a simple string sort would give you the
right order

clint

--


(Igor Motov) #13

I agree with Clinton, it will be fairly inefficient especially if you have
large result sets since this script will be executed for every single
record in your results. The best way to do it is to implement the same
logic on the client or as a custom analyzer.

On Monday, October 29, 2012 3:41:35 PM UTC-4, Clinton Gormley wrote:

On Mon, 2012-10-29 at 12:36 -0700, Nick Hoffman wrote:

Ah, that's right. That should work quite well. The only question is
how inefficient this will be. I searched for documentation on the
performance of sort scripts, but couldn't find anything. Do you know
what sort of negative impact this would have on performance?

Doing these things live will never perform as well as precalculating
them and storing the extra data. You either pay once at index time, or
every time you search.

Personally I would add a field "sort_order" which would precalculate a
value which could be used for a simple string sort.

For instance, "Nexus 7" and "Nexus 10" might be indexed as "nexus
000007" and "nexus 000010", so a simple string sort would give you the
right order

clint

--


(rpsandiford) #14

(Sorry if this ends up being repeated - I originally did a 'reply' from my email and it didn't seem to show up :()

Another index-time possibility is to use a magnitude notation prefix for all the numeric portions.

So – if the number is 1 digit long, prefix it with a ‘1’, if it is two digits, prefix with a ‘2’, and so on. If you think you’ll exceed 10 digits anywhere, then use a two-digit prefix (i.e. 01 and 02 respectively for a 1 digit and a 2 digit number). That gets around having long tokens (e.g. a whole bunch of leading zeroes) which might be an issue, and also allows you some more flexibility on how long an integer you might encounter without having to re-index everything to put more leading zeroes on the front if you exceed your initial ‘max length’ number. (i.e. a 2 digit mag notation prefix will handle up to a 99 digit number)

So, 1, 2, 10, 20 would be indexed as 011, 012, 0210, 0220, which will sort ‘alphabetically’ (i.e. LTR sort) in the correct numeric sequence.

P.S. – you may also want to consider decimals. e.g. if “12.5” should sort before “12.12” (for example), then you’ll want to include a magnitude notation following a decimal (i.e. you’d have 312.15 and 312.212). However, if “12.5” should sort after “12.12”, then you’ll want to avoid placing a magnitude notation prefix on digits following the decimal point (i.e. you’d have 312.5 and 312.12…)

Bob Sandiford | Principal Engineer | SirsiDynix

P: 800.288.8020 X6943 | [hidden email]

www.sirsidynix.com

Join the conversation: Like us on Facebook! Follow us on Twitter!


(Jörg Prante) #15

Hey,

yes, Clinton is of course right, it's more performant to let the Lucene
analyzer create sort keys in advance.

An old technique from the era of the first commercial RDMS (back in the
70/80ties, when RAM was tight) is encoding numbers to be sorted with
different lengths with a prefix containing the length. This encoding
eliminates leading zeroes and allows binary sort.

I just hacked a plugin together with a Lucene token filter. An
implementation of a natural sort key can be found
at https://github.com/jprante/elasticsearch-analysis-naturalsort

Cheers,

Jörg

On Sunday, October 28, 2012 8:33:58 PM UTC+1, James Hu wrote:

Awesome, thanks for pointing that out! I think leading zeros is a great
solution. I'm now using 10 character length with leading zeros and it
works
well. Maybe you should consider using something conservative like 15 so
that
it will never surpass that.

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Sorting-a-string-field-numerically-tp4024557p4024561.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--


(system) #16