Is it possible to run text_expansion query on a field that has an array of tokens generated by ELSER?

Hi we are using Elastic 8.10.4

We have a certain process that ingests documents( like PDF ) and runs a "script" processor to split text from those documents into multiple chunks and then runs an inference processor for ELSER on each of those chunks which then gets dumped into a single field "ml_text" having an array of the tokens produced by ELSER like

...
"_source":{
"passages":[
 { "text": "abc"},
 { "text": "def"}
],
"ml_text": [
{"tokens": <tokens from ELSER on passages[0].text >, "model_id": ".elser_model_1"},
{"tokens": <tokens from ELSER on passages[1].text >, "model_id": ".elser_model_1"}
]
...

Question is , is it possible to run the text_expansion on "ml_text" for each of the chunks and return a matching chunk text from the "passages" array?

Hello Sneha!

Yes this is possible. You need to set up your index to contain the chunks in a nested field, then run a text_expansion query within a nested query:

GET chunked_index/_search
{
  "query": {
    "nested": {
      "path": "ml_text",
      "query": { 
        "text_expansion": {
          "ml_text.tokens": {
            "model_id": ".elser_model_1",
            "model_text": "some text"
          }
        }
      },
      "inner_hits": {}
    }
  }
}

Note the inner_hits property which will ensure the matching nested chunks are returned along with documents, in the order of decreasing relevance per doc (and each having a relevance score). So within a returned document inner_hits.ml_text.hits.hits[0] will contain the most relevant chunk in that doc, inner_hits.ml_text.hits.hits[1] will be the 2nd most relevant and so on.

In order to correlate the chunks with the actual passages you can follow either of these two strategies:

  1. Embed the passage text in the chunks, this way it's directly accessible from inner_hits;
  2. Postprocess the documents: for each doc, get the ordinal of the top chunk (inner_hits.ml_text.hits.hits[0].offset) and look up the passage from the passages` array of the same doc.

Hope this helps!

1 Like

Hi Adam,

Thank you so much for your help!

Re: 1. Embed the passage text in the chunks, this way it's directly accessible from inner_hits;

Do you mean that instead of storing the texts in "passage", we store the text field within each object of ml_text?

Yes, correct - store each passage text right next to the corresponding vector embedding.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.