How to compute token counts efficiently?


(Luca Weihs) #1

In order to compute a custom script score function I'm working on I need to be able to retrieve the token counts for a collection of fields in my documents. I know that it's difficult to retrieve the field length norm value (or at least I couldn't find any way of doing so) and that instead it's recommend to use the token_count type as in the following example mapping:

"doc": {
    "properties": {
      "title": {
        "type": "string",
        "analyzer" : "my_analyzer",
        "fields": {
          "token_count": {
            "type" : "token_count",
            "store" : "yes",
            "analyzer" : "my_analyzer"
          }
        }
      }
   }
}

This solution works fairly well for me with the downside that since the token_count field has to be reanalyzed it seems to dramatically slow down the speed of indexing (since I'm doing this to almost every field and there are many fields indexing appears to take twice as long). Is there anyway to make this process more efficient? In particular I'm always using the same analyzer for the token_count sub-field as I am for it's super field, can these two not share information in some way rather than having to do the same things twice?


(system) #2