How can I sort strings with numbers alphabetically?

Hey guys, one quick question.

I have a set of example records:

Test
123 sesi
501 Alva
rout

and I want those to be sorted in asc/desc order in case insensitive and alphabeticall order, just like this:

501 Alva
rout
123 sesi
Test

so as you can see the first letter is dictating the order from A-Z (or other way around).

How can I achieve that? Right now what I was able to achieve is case insensitive sorting through the normalizer so it sorts like this:

123 sesi
501 Ava
rout
Test

but the problem is that numbers are being sorted first and then the letters. Any ideas how can I solve this? I'm sharing with you guys my current configuration.

      "normalizer":  {
        "my_normalizer": {
          "type": "custom",
          "char_filter": [],
          "filter": ["lowercase", "asciifolding"]
        }
      }

and then I'm using is as:

  "mappings": {
    "dynamic_templates": [
            {
        "data_key": {
          "mapping": {
            "type": "keyword",
            "normalizer": "my_normalizer",
            "index_options": "docs",
            "copy_to": "all_search_fields"
          },
          "match": "*_*_key"
        }
      },
      {
        "data_string": {
          "mapping": {
            "norms": false,
            "type": "text",
            "fields": {
              "raw": {
                "type": "keyword",
                "index_options": "docs",
                "normalizer": "my_normalizer"
              }
            },
            "index_options": "positions",
            "copy_to": "all_search_fields"
          },
          "match": "*_*_string"
        }
      }
      ]

You can to use char_filter to remove number in text and then add it in your normalizer.

Like this:

PUT my-index-000001
{
  "settings": {
    "analysis": {
      "normalizer": {
        "my_normalizer": {
          "type": "custom",
          "char_filter": [
            "my_char_filter"
          ],
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "pattern_replace",
          "pattern": """(\d+ )""",
          "replace": ""
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "description": {
        "type": "text",
        "fields": {
          "raw": {
            "type": "keyword",
            "index_options": "docs",
            "normalizer": "my_normalizer"
          }
        }
      }
    }
  }
}


POST my-index-000001/_bulk
{"index":{}}
{"description":"Test"}
{"index":{}}
{"description":"123 sesi"}
{"index":{}}
{"description":"501 Alva"}
{"index":{}}
{"description":"rout"}

POST my-index-000001/_search?filter_path=hits.hits._source
{
 "sort": [
   {
     "description.raw": {
       "order": "asc"
     }
   }
 ]
}

Reponse:

{
  "hits": {
    "hits": [
      {
        "_source": {
          "description": "501 Alva"
        }
      },
      {
        "_source": {
          "description": "rout"
        }
      },
      {
        "_source": {
          "description": "123 sesi"
        }
      },
      {
        "_source": {
          "description": "Test"
        }
      }
    ]
  }
}
1 Like

Thanks for this hint!

Though now after applying that filter I'm facing this exception

Elasticsearch exception [type=illegal_argument_exception, reason=unknown setting [index.char_filter.my_char_filter.type] please check that any required plugins are installed, or check the breaking changes documentation for removed settings]

any idea what could be the reason?

what is your version ES?

Ok, so my mistake, sorry. Apparently I've placed the char_filter outside analysis. Now I've placed it inside and the application is building.

But, the sorting still works exactly the same as before.

@RabBit_BR - are you sure that the pattern should be "pattern": """(\d+ )""" ? Those three quotation marks are raising my concerns.

ElasticsearchParseException[Failed to parse content to map]; nested: JsonParseException[Unexpected character ('"' (code 34)): was expecting comma to separate Object entries

edit: Im using image elasticsearch:7.16.1

Yes. But you can change to:

"pattern": "(\\d+ )"

Ok, so I went for

      "normalizer":  {
        "my_normalizer": {
          "type": "custom",
          "char_filter": ["my_char_filter"],
          "filter": ["lowercase", "asciifolding"]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "pattern_replace",
          "pattern": "(\\d+ )",
          "replace": ""
        }
      }

and of course added it to my type mapping.

Now it works like a charm, thanks @RabBit_BR , I owe you a large beer :beers:!

1 Like

@RabBit_BR I have one more question, sorry to bother. I noticed one more thing.

Let's say I have a set of data:

Bstring
4 alpha
3 alpha

and I sort it with my number-ignoring and case insensitive normalizer then the output is

4 alpha
3 alpha
Bstring

but I think it would be amazing to add one more thing. Is it possible to achieve this output?

3 alpha
4 alpha
Bstring

So the real question is - if you have two equal string that start with different numbers, can we add numerical sorting to those?

you can try something. You will have another field that indexes as a keyword.
In luck you will have a tiebreaker which is this new field.

Would be like this:

{
   "sort": [
     {
       "description.raw": {
         "order": "asc"
       }
     },
     {
       "description.keyword": {
         "order": "asc"
       }
     }
   ]
  }

All example:

PUT my-index-000001
{
  "settings": {
    "analysis": {
      "normalizer": {
        "my_normalizer": {
          "type": "custom",
          "char_filter": [
            "my_char_filter"
          ],
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "pattern_replace",
          "pattern": """(\d+ )""",
          "replace": ""
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "description": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword"
          },
          "raw": {
            "type": "keyword",
            "index_options": "docs",
            "normalizer": "my_normalizer"
          }
        }
      }
    }
  }
}


POST my-index-000001/_bulk
{"index":{}}
{"description":"Test"}
{"index":{}}
{"description":"123 sesi"}
{"index":{}}
{"description":"501 Alva"}
{"index":{}}
{"description":"rout"}
{"index":{}}
{"description":"Bstring"}
{"index":{}}
{"description":"4 alpha"}
{"index":{}}
{"description":"3 alpha"}

POST my-index-000001/_search?filter_path=hits.hits._source
{
 "sort": [
   {
     "description.raw": {
       "order": "asc"
     }
   },
   {
     "description.keyword": {
       "order": "asc"
     }
   }
 ]
}

Response

{
  "hits": {
    "hits": [
      {
        "_source": {
          "description": "3 alpha"
        }
      },
      {
        "_source": {
          "description": "4 alpha"
        }
      },
      {
        "_source": {
          "description": "501 Alva"
        }
      },
      {
        "_source": {
          "description": "Bstring"
        }
      },
      {
        "_source": {
          "description": "rout"
        }
      },
      {
        "_source": {
          "description": "123 sesi"
        }
      },
      {
        "_source": {
          "description": "Test"
        }
      }
    ]
  }
}
1 Like

I was thinking about something similar, but I'm using dynamic_templates, can both solutions be mixed together?

My settings looks like this

"settings": {
    "analysis": {
      "normalizer":  {
        "my_normalizer": {
          "type": "custom",
          "char_filter": ["my_char_filter"],
          "filter": ["lowercase", "asciifolding"]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "pattern_replace",
          "pattern": "(\\d+ )",
          "replace": ""
        }
      }
    }
}

a then the mappings

  "mappings": {
    "dynamic_templates": [
      {
        "data_key": {
          "mapping": {
            "type": "keyword",
            "normalizer": "my_normalizer",
            "index_options": "docs",
            "copy_to": "all_search_fields"
          },
          "match": "*_*_key"
        }
      },
      {
        "data_string": {
          "mapping": {
            "norms": false,
            "type": "text",
            "fields": {
              "raw": {
                "type": "keyword",
                "index_options": "docs",
                "normalizer": "my_normalizer"
              }
            },
            "index_options": "positions",
            "copy_to": "all_search_fields"
          },
          "match": "*_*_string"
        }
      }
    ]
}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.