Sentence vector comparisons. Without titles / less direct responses

I had a few questions after reading this article

This seems to assume the documents in the corpus have been summarized by virtue of having a "title" or a "question" that the document is relevant to.

  1. What if I wanted to use embeddings based search on a corpus where documents do not have titles? Do I thencompare the query vector to the whole document vector? Do I compare the query vector to the vectors of each sentence, in each document? Would these sentence vector comparisons still be effective with documents that are at least a paragraph long (and not concise questions / titles)

  2. Furthermore, what if the document is relevant to the query, but no particular sentence in the document answers the query on its own. For example, query is "Queen Elizabeth birth day". Document is "Queen elizabeth is is the queen of england .... few irrelevant sentences ... SHE was born in xxxx". This document answers the query. However two lines together contain the answer to the query (one sentence telling us the birth day, another sentence placed far apart telling us who is this person whose birthday is given)

Hello @abeerunscore96
sorry for a late reply!

A warning: from Elasticsearch engineers' point of view, we are not data scientists or NLP experts. We are building tools that hopefully can help data science people like yourself. The questions you raised are very interesting ones, but they are not about Elasticsearch vector functionality, so I don't know precise answers, and can only suggest the following:

  1. Indeed, Universal Sentence encoder was designed to encode texts up to short paragraphs and may not work well on long texts. You need to test how long texts can be. About testing a query vector with each of the sentences from a document's vector, I can see that it can work. Before sentence embeddings were introduced, people were using word embeddings to find similar sentences; so I can see how sentence embeddings can be used to find similar big paragraphs.

  2. I assume your domain here is Question/Answering. I would think in addition to vector models, you would need to apply other techniques to find an answer (etc entity recognition, parsing an answer from a text).

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.