How to make an aggregation for 1st character at the right way

bachqv · April 20, 2019, 10:44am

Hi all,

Could you please share with me the right solution for making the aggregation in query for the first character at the most efficient way?

I saw some idea from the StackOverflow the below code but still not recommended because of enabling the fielddata.

Blockquote

PUT foo
{
  "mappings": {
    "bar" : {
      "properties": {
        "name" : {
          "type": "text",
          "analyzer": "my_analyzer"
        }
      }
    }
  }, 
  "settings":  {
    "index": {
      "analysis": {
        "analyzer" : {
          "my_analyzer" : {
            "type" : "custom",
            "tokenizer" : "keyword",
            "filter" : [ "my_filter", "lowercase" ]
          }
        },
        "filter": {
          "my_filter": {
            "type": "truncate",
            "length": 1
          }
        }
      }
    }
  }
}

abdon · April 20, 2019, 2:07pm

That Stack Overlow question is quite old. You can use a normalizer on a keyword field, that removes all but the first character. That way you won't have to enable fielddata.

PUT foo
{
  "mappings": {
    "_doc": {
      "properties": {
        "name": {
          "type": "text",
          "fields": {
            "first_char": {
              "type": "keyword",
              "normalizer": "my_normalizer"
            }
          }
        }
      }
    }
  },
  "settings": {
    "index": {
      "analysis": {
        "normalizer": {
          "my_normalizer": {
            "type": "custom",
            "char_filter": [
              "my_filter"
            ]
          }
        },
        "char_filter": {
          "my_filter": {
            "type": "pattern_replace",
            "pattern": "(^.{0,1})(.*)",
            "replacement": "$1"
          }
        }
      }
    }
  }
}

PUT foo/_doc/1
{
  "name": "Bar"
}

GET foo/_search
{
  "size": 0,
  "aggs": {
    "common_first_characters": {
      "terms": {
        "field": "name.first_char",
        "size": 10
      }
    }
  }
}

bachqv · April 20, 2019, 3:52pm

Thanks @abdon , the solution you provide is working perfectly.
I am new to ElasticSearch so that it's quite confused once starting combining different criteria into one.

bachqv · April 25, 2019, 6:37am

Hi @abdon: could you please give me advice as when I deal with Unicode characters, the 1st chars is not recognized, also I want to group the local characters into a default Latin based:

0-9 -> will group into '#' group
É Ê -> will group into 'E' group.

some sample:
{
"name": "Én vios bar"
}
{
"name": "Đông tail"
}

Thank you.

abdon · April 25, 2019, 7:18am

To replace all numbers 0-9 with a # you could use a second character filter.

The process of converting characters like É and Đ to to their ASCII equivalents E and D is called folding, which you can achieve with the ASCII Folding token filter.

Putting all of that together would result in something like this:

PUT foo
{
  "mappings": {
    "_doc": {
      "properties": {
        "name": {
          "type": "text",
          "fields": {
            "first_char": {
              "type": "keyword",
              "normalizer": "my_normalizer"
            }
          }
        }
      }
    }
  },
  "settings": {
    "index": {
      "analysis": {
        "normalizer": {
          "my_normalizer": {
            "type": "custom",
            "char_filter": [
              "my_filter1",
              "my_filter2"
            ],
            "filter": "asciifolding"
          }
        },
        "char_filter": {
          "my_filter1": {
            "type": "pattern_replace",
            "pattern": "(^.{0,1})(.*)",
            "replacement": "$1"
          },
          "my_filter2": {
            "type": "pattern_replace",
            "pattern": "(^[0-9])",
            "replacement": "#"
          }
        }
      }
    }
  }
}

PUT foo/_doc/1
{
  "name": "Én vios bar"
}

PUT foo/_doc/2
{
  "name": "Đông tail"
}

PUT foo/_doc/3
{
  "name": "123"
}


GET foo/_search
{
  "size": 0,
  "aggs": {
    "common_first_characters": {
      "terms": {
        "field": "name.first_char",
        "size": 10
      }
    }
  }
}

system · May 23, 2019, 7:18am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Aggregation by the first character of field Elasticsearch	6	1514	August 6, 2020
Distinct collect first letters from field Elasticsearch	4	1574	November 26, 2017
Character problem on aggregation for 5.4.0 Elasticsearch	1	573	June 9, 2017
Aggregations and special characters Elasticsearch	2	1791	July 6, 2017
Elastic Search aggregation query results are case incentive Elasticsearch	6	405	June 12, 2018

How to make an aggregation for 1st character at the right way

Related topics