Return substring of analyzed, non-stored text field

jdpark · April 21, 2017, 3:50pm

Hi. I'd like to get the first 100 characters of an analyzed, non-stored text field. (Imagine I'm doing fulltext search on documents, and I want to return the documents with just snippets.) I read https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-scripting-fields.html. I considered the following options:

doc-values: I don't think I can use doc-values because it's an analyzed text field. (Enabling fielddata seems expensive and doesn't seem to fit in this use case).
stored fields: I'm currently not storing the text field
_source: this option seems ok

The following GET works where "text" is the text field.

{
   "script_fields": {
      "substring": {
         "script": {
            "lang": "painless",
            "inline": "params._source.text.substring(0, 100)"
         }
      }
   }
}

Is this the right approach to this scenario? Or should I consider enabling fielddata so that I can use doc-values? Thank you.

polyfractal · April 28, 2017, 1:54pm

Hm, so some caveats here.

As you mentioned, Doc values won't work with analyzed text fields... it's only available for keyword (not analyzed) text fields. The advantage of using keyword is that you can configure normalizers to do some basic analysis, such as lowercasing. And you get the speed benefits of having the data in doc values.

If you go down this route, the best option probably to add a multi-field which is set to keyword, with the inverted index disabled ("index": "no") and keeping doc values enabled ("doc_values": true). As a downside, you'll pay the price of indexing another field w/ doc values (more on-disk usage, impact indexing speed, etc).

OTOH, the approach you're using with _source will work just fine... except it is not retrieving the analyzed text either. The _source is stored verbatim from the original JSON document that you sent to Elasticsearch. So it will represent a pre-analyzed version of the text.

If that's OK with you, the _source option is likely the best so long as your search requests only retrieve reasonably small result sets (10, 100, etc. but not 10,000).

Could you perhaps use/abuse the Highlighters to return the desired fragments? Highlighters perform analysis before returning the fragment, so they may be better suited if you need the analyzed text.

jdpark · May 8, 2017, 5:12pm

Hi Zachary. Thanks for the detailed response! I am actually using highlighters as well. I should clarify the above use case is really to generate document titles where the original source doesn't have one. Once the user clicks on one of the "titles", then the full document is shown with highlighting and other nested data.

Your note about multi-field is interesting and I'll keep it in mind to see if I can apply it somewhere else in the future. Thanks for the tip.

My question is answered, but a more "root cause" question is: is there a good way to generate document titles? I'm not expecting one from Elasticsearch, but just wondering if other users have come across this problem. Thanks readers.

system · June 5, 2017, 5:21pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to filter docs by text field length? Elasticsearch	2	9280	August 31, 2020
Returning part of a field with a regexp Elasticsearch	2	331	July 6, 2017
Elasticsearch Retrieve token_count standard value from search Elasticsearch	3	465	January 14, 2020
Efficient storage of non-analysed text fields in Elasticsearch Elasticsearch	4	646	June 5, 2018
Not_analyzed field with doc_values still in fielddata cache Elasticsearch	3	2583	July 5, 2017

Return substring of analyzed, non-stored text field

Related topics