During indexing, can I create a custom lucene stored field that later can be returned from a query?

Is it possible to write a TokenFilter or some other plugin code that, during indexing of a document, can create a custom lucene stored field for this document?

And, if this is possible, can this new custom field be returned in the _search results using the "fields" parameter?

If so, any pointers on how to implement this would be much appreciated!

You can easily create a new field based on another field using copy_to:
https://www.elastic.co/guide/en/elasticsearch/reference/current/copy-to.html

You can apply different analysis (stored or not stored for example) to
these derived fields and they can be retrieved separately with the fields
parameter.

I see what you are saying, but I cannot see how it can achieve what I want.

What I want to do is do some custom analysis during indexing, and store the results of this analysis, and then during a search, return the results of the analysis.

I could be mistaken, but using copy_to, wouldn't the data returned in the new field be the copy of the first field, and not the results of any analysis done on it?

Thanks for your help on this!

I do not think there is no out-of-the-box to achieve what you want. You can
use the analyze API to analyze the text before it is indexed. If you use
Java, you can recreate the analysis service locally so that you do not even
have to execute an API request of the wire.

A custom plugin should be possible doing exactly the same, just on the
server side. Get field, use the analysis service, index.

Can you elaborate on what kind of result you want to store?

"Storing results of analysis" is what Lucene does by default, it creates a token stream, and "returning the result" is done what is known as matching queries with the index and retrieving document content from _source.

Note, for ES working properly, you need to assure an unmodified _source which is kept throughout the indexing process. So you are correct, the best method to add hidden data to the index is by creating extra fields with attribute store: true, which may be created dynamically ( or not).

The mechanism ES provides to process field content is called field mapper. That means, you can create a plugin that controls a custom field type.

Instead of indexing just a string or number, you can index whatever you want from the input, and under which field name you want. copy_to is just a builtin mechanism for selecting a field name, so there is no need to reinvent the wheel.

For example, the atttachments field mapper is part of Elasticsearch distribution. It can be studied how metadata is indexed alongside binary files: https://github.com/elastic/elasticsearch/tree/master/plugins/mapper-attachments

Another example, while field mapping, it is possible to detect the language and index just the language code instead of the original text. See my lang detect plugin at https://github.com/jprante/elasticsearch-langdetect/

1 Like

@jprante Thanks for the detailed reply.

This is all related to the general idea of doing any kind of custom document processing during indexing and saving the results of this processing in a manner such that it can be retrieved via a _search. And, sometimes these results also participate in queries via a custom QueryParser.

I have looked through a number of mapper implementations, and while I could do this, the complexity and lack of documentation is a concern, particularly around maintenance when upgrading or back-porting to older versions of ES we may need to support.

Having now read quite a lot about document processing in ES, I think my best course of action is to do this processing before indexing the documents. When the processing is brief, I may do it in a custom endpoint in ES. Otherwise I will do it external to ES.

Thanks again for your help.

In your langdetect plugin, is the indexed language code unavailable to be retrieved via GET until after a refresh is done, as mentioned in Get API | Elasticsearch Guide [8.11] | Elastic?

Does retrieving query result document fields have the same limitation?

If so, how would I force a refresh so the language code will always be retrievable via GET or as a field returned in a query?

@jprante
In your langdetect plugin, is the indexed language code unavailable to be retrieved via GET until after a refresh is done, as mentioned in https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-get.html#generated-fields?

Does retrieving query result document fields have the same limitation?

If so, how would I force a refresh so the language code will always be retrievable via GET or as a field returned in a query?

(Sorry for the duplicate post, but I forgot to mention you explicitly so you probably didn't see the prior post.)