Return substring of analyzed, non-stored text field


(Johnny Park) #1

Hi. I'd like to get the first 100 characters of an analyzed, non-stored text field. (Imagine I'm doing fulltext search on documents, and I want to return the documents with just snippets.) I read https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-scripting-fields.html. I considered the following options:

  1. doc-values: I don't think I can use doc-values because it's an analyzed text field. (Enabling fielddata seems expensive and doesn't seem to fit in this use case).
  2. stored fields: I'm currently not storing the text field
  3. _source: this option seems ok

The following GET works where "text" is the text field.

{
   "script_fields": {
      "substring": {
         "script": {
            "lang": "painless",
            "inline": "params._source.text.substring(0, 100)"
         }
      }
   }
}

Is this the right approach to this scenario? Or should I consider enabling fielddata so that I can use doc-values? Thank you.


(Zachary Tong) #2

Hm, so some caveats here.

As you mentioned, Doc values won't work with analyzed text fields... it's only available for keyword (not analyzed) text fields. The advantage of using keyword is that you can configure normalizers to do some basic analysis, such as lowercasing. And you get the speed benefits of having the data in doc values.

If you go down this route, the best option probably to add a multi-field which is set to keyword, with the inverted index disabled ("index": "no") and keeping doc values enabled ("doc_values": true). As a downside, you'll pay the price of indexing another field w/ doc values (more on-disk usage, impact indexing speed, etc).

OTOH, the approach you're using with _source will work just fine... except it is not retrieving the analyzed text either. The _source is stored verbatim from the original JSON document that you sent to Elasticsearch. So it will represent a pre-analyzed version of the text.

If that's OK with you, the _source option is likely the best so long as your search requests only retrieve reasonably small result sets (10, 100, etc. but not 10,000).

Could you perhaps use/abuse the Highlighters to return the desired fragments? Highlighters perform analysis before returning the fragment, so they may be better suited if you need the analyzed text.


(Johnny Park) #3

Hi Zachary. Thanks for the detailed response! I am actually using highlighters as well. I should clarify the above use case is really to generate document titles where the original source doesn't have one. Once the user clicks on one of the "titles", then the full document is shown with highlighting and other nested data.

Your note about multi-field is interesting and I'll keep it in mind to see if I can apply it somewhere else in the future. Thanks for the tip.

My question is answered, but a more "root cause" question is: is there a good way to generate document titles? I'm not expecting one from Elasticsearch, but just wondering if other users have come across this problem. Thanks readers.


(system) #4

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.