How to make an aggregation for 1st character at the right way

Hi all,

Could you please share with me the right solution for making the aggregation in query for the first character at the most efficient way?

I saw some idea from the StackOverflow the below code but still not recommended because of enabling the fielddata.

Blockquote

PUT foo
{
  "mappings": {
    "bar" : {
      "properties": {
        "name" : {
          "type": "text",
          "analyzer": "my_analyzer"
        }
      }
    }
  }, 
  "settings":  {
    "index": {
      "analysis": {
        "analyzer" : {
          "my_analyzer" : {
            "type" : "custom",
            "tokenizer" : "keyword",
            "filter" : [ "my_filter", "lowercase" ]
          }
        },
        "filter": {
          "my_filter": {
            "type": "truncate",
            "length": 1
          }
        }
      }
    }
  }
}

That Stack Overlow question is quite old. You can use a normalizer on a keyword field, that removes all but the first character. That way you won't have to enable fielddata.

PUT foo
{
  "mappings": {
    "_doc": {
      "properties": {
        "name": {
          "type": "text",
          "fields": {
            "first_char": {
              "type": "keyword",
              "normalizer": "my_normalizer"
            }
          }
        }
      }
    }
  },
  "settings": {
    "index": {
      "analysis": {
        "normalizer": {
          "my_normalizer": {
            "type": "custom",
            "char_filter": [
              "my_filter"
            ]
          }
        },
        "char_filter": {
          "my_filter": {
            "type": "pattern_replace",
            "pattern": "(^.{0,1})(.*)",
            "replacement": "$1"
          }
        }
      }
    }
  }
}

PUT foo/_doc/1
{
  "name": "Bar"
}

GET foo/_search
{
  "size": 0,
  "aggs": {
    "common_first_characters": {
      "terms": {
        "field": "name.first_char",
        "size": 10
      }
    }
  }
}

Thanks @abdon , the solution you provide is working perfectly.
I am new to ElasticSearch so that it's quite confused once starting combining different criteria into one. :grin:

Hi @abdon: could you please give me advice as when I deal with Unicode characters, the 1st chars is not recognized, also I want to group the local characters into a default Latin based:

  • 0-9 -> will group into '#' group
  • É Ê -> will group into 'E' group.

some sample:
{
"name": "Én vios bar"
}
{
"name": "Đông tail"
}

Thank you.

To replace all numbers 0-9 with a # you could use a second character filter.

The process of converting characters like É and Đ to to their ASCII equivalents E and D is called folding, which you can achieve with the ASCII Folding token filter.

Putting all of that together would result in something like this:

PUT foo
{
  "mappings": {
    "_doc": {
      "properties": {
        "name": {
          "type": "text",
          "fields": {
            "first_char": {
              "type": "keyword",
              "normalizer": "my_normalizer"
            }
          }
        }
      }
    }
  },
  "settings": {
    "index": {
      "analysis": {
        "normalizer": {
          "my_normalizer": {
            "type": "custom",
            "char_filter": [
              "my_filter1",
              "my_filter2"
            ],
            "filter": "asciifolding"
          }
        },
        "char_filter": {
          "my_filter1": {
            "type": "pattern_replace",
            "pattern": "(^.{0,1})(.*)",
            "replacement": "$1"
          },
          "my_filter2": {
            "type": "pattern_replace",
            "pattern": "(^[0-9])",
            "replacement": "#"
          }
        }
      }
    }
  }
}

PUT foo/_doc/1
{
  "name": "Én vios bar"
}

PUT foo/_doc/2
{
  "name": "Đông tail"
}

PUT foo/_doc/3
{
  "name": "123"
}


GET foo/_search
{
  "size": 0,
  "aggs": {
    "common_first_characters": {
      "terms": {
        "field": "name.first_char",
        "size": 10
      }
    }
  }
}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.