Text similarity search with vector fields

@Julie_Tibshirani post on Text similarity search with vector fields and the accompanying repo has been really helpful in jump-start a QnA project I am working on. Thank you!

I have now migrated it to TensorFlow v2 and I modified it to suit our needs.

Is there a way I could improve my search functionality further by looking at the question summary (which is a description of what the question covers) and the answer content as well?

I also have some alternative versions of the same questions (re-worded) and I have been thinking of creating additional dense_vectors and adding them in separate fields (title_vector_1, title_vector_2, title_vector_3) and modifying the query so that it does a cosine similarity on each one and then take the max value - is that the best way to go about it? is there a way to not hard code separate vector fields and use a list or array type? What is the best way forward?

Also, the aforementioned post uses the universal-sentence-encoder v2 and I use the next version (v3). Would there be a benefit in using universal-sentence-encoder-multilingual-qa instead? I have tried but I am not sure how to make use of the extra embeddings and the product generated by np.inner() in the example. Any ideas?

1 Like

Or could we perhaps retrain the model with the alternative versions of the questions? How would we do it?

Hello Leonardo, I added some thoughts about your questions below. I haven’t done extensive research in this area, it would be great to hear from other community members about their experiences/ suggestions.

is there a way to not hard code separate vector fields and use a list or array type?

It’s not currently possible to index multiple vectors into one field. So for now, you would need to create separate vector fields as you suggest. I can certainly see why you would want to be able to store multiple vectors in one field -- if you’d like, you could file a GitHub issue with this enhancement request (along with a description of your use case).

Also, the aforementioned post uses the universal-sentence-encoder v2 and I use the next version (v3). Would there be a benefit in using universal-sentence-encoder-multilingual-qa instead?

There are a few different 'universal sentence encoder' models available, including

The universal-sentence-encoder-multilingual model could be an interesting drop-in replacement for the univeral-sentence-encoder model used in the blog post. It would support cross-lingual retrieval, where the user’s question could be in a different language from the indexed questions.

The model universal-sentence-encoder-multilingual-qa is trained in quite a different way from the others -- it has one encoder for questions, and another for answers, and the question vectors are encouraged to be close to relevant answer vectors in terms of distance. The model is designed for the 'question answering' task: given a question, we want to find a passage/ text span in the index that best answers it. This is a different set-up than the ‘find similar questions’ task described in the blog post, since the documents do not contain question-answer pairs. One approach would be to use the QA encoder in addition to the question similarity encoder, in some complementary way.

1 Like