We have a certain process that ingests documents( like PDF ) and runs a "script" processor to split text from those documents into multiple chunks and then runs an inference processor for ELSER on each of those chunks which then gets dumped into a single field "ml_text" having an array of the tokens produced by ELSER like
...
"_source":{
"passages":[
{ "text": "abc"},
{ "text": "def"}
],
"ml_text": [
{"tokens": <tokens from ELSER on passages[0].text >, "model_id": ".elser_model_1"},
{"tokens": <tokens from ELSER on passages[1].text >, "model_id": ".elser_model_1"}
]
...
Question is , is it possible to run the text_expansion on "ml_text" for each of the chunks and return a matching chunk text from the "passages" array?
Note the inner_hits property which will ensure the matching nested chunks are returned along with documents, in the order of decreasing relevance per doc (and each having a relevance score). So within a returned document inner_hits.ml_text.hits.hits[0] will contain the most relevant chunk in that doc, inner_hits.ml_text.hits.hits[1] will be the 2nd most relevant and so on.
In order to correlate the chunks with the actual passages you can follow either of these two strategies:
Embed the passage text in the chunks, this way it's directly accessible from inner_hits;
Postprocess the documents: for each doc, get the ordinal of the top chunk (inner_hits.ml_text.hits.hits[0].offset) and look up the passage from the passages` array of the same doc.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.