Distinct collect first letters from field

Good day.

I want to create navigation in index by first letters like in vocabulary.

Ideally I need aggregation which will provide to me a set of all first letters from one field with first document id related to each letter.
Otherwise will be enough just a set of first letters collected from one field.

But I cannot find aggregation which can do this for me.

Any suggestions will be helpful. Thanks.

You can extract the first letter using script in terms aggregation, but I think the most effective way would be to actually index this first letter as a separate field, using a normalizer. Here is a couple of examples to get you started.

DELETE test

PUT test
{
  "settings": {
    "analysis": {
      "char_filter": {
        "first_letter_filter": {
          "type": "pattern_replace",
          "pattern": "(.).*",
          "replacement": "$1"
        }
      },
      "normalizer": {
        "first_letter_normilizer": {
          "type": "custom",
          "char_filter": ["first_letter_filter"],
          "filter": [
            "uppercase"
          ]
        }
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "berry": {
          "type": "keyword",
          "fields": {
            "first": {
              "type": "keyword",
              "normalizer": "first_letter_normilizer"
            }
          }
        }
      }
    }
  }
}

PUT test/doc/_bulk?refresh=true
{"index": {"_id": "1"}}
{"berry": "Blackcurrant"}
{"index": {"_id": "2"}}
{"berry": "Chili pepper"}
{"index": {"_id": "3"}}
{"berry": "Cranberry"}
{"index": {"_id": "4"}}
{"berry": "Eggplant"}
{"index": {"_id": "5"}}
{"berry": "Gooseberry"}
{"index": {"_id": "6"}}
{"berry": "Grape"}
{"index": {"_id": "8"}}
{"berry": "Guava"}
{"index": {"_id": "9"}}
{"berry": "Kiwifruit"}
{"index": {"_id": "10"}}
{"berry": "Lucuma"}
{"index": {"_id": "11"}}
{"berry": "Pomegranate"}
{"index": {"_id": "12"}}
{"berry": "Redcurrant"}
{"index": {"_id": "13"}}
{"berry": "Tomato"}


GET test/doc/_search
{
  "size": 0,
  "aggs": {
    "alphabet": {
      "terms": {
        "field": "berry.first",
        "size": 26,
        "order": {
          "_key": "asc"
        }
      }
    }
  }
}

GET test/doc/_search
{
  "size": 0,
  "aggs": {
    "alphabet": {
      "terms": {
        "field": "berry",
        "size": 26,
        "script" : {
            "source" : "_value.substring(0,1).toUpperCase()",
            "lang" : "painless"
        },
        "order": {
          "_key": "asc"
        }
      }
    }
  }
}

Thanks Igor.

I think it's doable for me.

But I have another question: may be I can store that values in another type which will hold for example only 26 records? And on each index update I can intercept new values and update only one record in that type. In that case I don't need every time to search through all index to find 26 letters.

Of course it can be done outside of ES, but it will significantly increase the time of processing data, because I need first insert to index new value, wait for index update, execute search and find new first record for that letter (I need elasticsearch sorting because I'm trying to move all collation logic to ES from application, and sorting depends on settings of icu_collation_keyword).

Without support for transactions, implementing something like this in a reliable fashion would be quite difficult.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.