Elasticsearch semantic_text search

samer_saleh · December 19, 2024, 10:10am

Hi,

I'm starting to learn Elasticsearch because I have specific requirement and Im trying to do proof of concept where someone can semantically search document contents (pdf , docx ...etc.).

I have used the FSCrawler to help with reading the file while assigning the following mapping to the created index:

client.indices.create(index="first_index",mappings=
  {
    "dynamic": "true",
    "properties": {
       "semantic_text_field": {
                "type": "semantic_text",
                 "inference_id": "my-elser-endpoint"
              }
	}
  }
)

I have configured the embedding end point as the following:

client.inference.put(inference_id="my-elser-endpoint",task_type="sparse_embedding",inference_config={
    "service": "elser",
  "service_settings": {
    "num_allocations": 1,
    "num_threads": 1
  }
})

Then I used and ingest pipeline to assign my semantic_text_field from the FSCrawler content field:

client.ingest.put_pipeline(id="ml_pipeline", processors= [
    {
      "script": {
        "description": "Chunk body_content into sentences by looking for . followed by a space",
        "lang": "painless",
        "source": """
          ctx.semantic_text_field = ctx.content;
          """
      }
    }
    
  ])

After running the FSCrawler and getting the result, I can see that semantic_text_field being assigned the document contents and I can see multiple chunks with embedding being created.

The problem Im having when I submit query as in the following syntax:

resp = client.search(index="first_index",query={
    "semantic": {
      "field": "semantic_text_field",
      "query": "...."
    }
  }
)

I'm getting hits on all the document and not the chunks. I thought the chunks will be index! How can I get hits on particular chunks and not the full document text? Do I need to re index each chunk independently?

I appreciate and help and guidance in this as I spent a lot of time to get to this point
Thanks

samer_saleh · December 20, 2024, 11:17am

Can someone please help me with this?

richcollier · December 20, 2024, 5:58pm

The other thread implies that you're on 7.x

The semantic_text field type isn't available in 7.x

samer_saleh · December 20, 2024, 6:12pm

Thanks for your reply.
I did actually upgrade to the latest version 8.16.
Can you help me with my question please. I made some progress using inner_hit and sparse_vector to isolate the result of the vector field on each chunk, however can it be done on the semantic_text field.

So basically to get it work using sparse_vector I have to update the FSCrawler mapping as follows to allow nested which what inner_hit needs:

client.indices.create(index="chunk_index",settings={"index.mapping.total_fields.limit": 2500},mappings=
  {
   
    "dynamic": "true",
    "properties": {
      "passages": {
        "type":"nested",
        "properties": {
          "vector": {
            "type":"nested",
            "properties": {
              "predicted_value": {
                "type": "sparse_vector"
               

              }
            }
          }
        }
      },
      "path":{
          "type":"nested"
      }
    }
  }
)

My Search query looks like this:

resp = client.search(
    index="chunk_index",
    body={
	"query": {
					"nested": {
						"path": "passages",
						"query": {
								"nested": {
										"path": "passages.vector",
										"query": {
											"sparse_vector": {
												"field": "passages.vector.predicted_value",
												"inference_id": "my-elser-endpoint",
												"query": "new web service"
											}
										}
									}
						},
                        "inner_hits": {}
					}
	}
}
)
from colorama import Fore, Back, Style

if len(resp["hits"]["hits"]) == 0:
        print("Your search returned no results.")
else:
        
        for hit in resp["hits"]["hits"]:
            id = hit["_id"]
            file_score = hit["_score"]
            filename = hit["_source"]["file"]["filename"]
            pretty_output = f"\nID: {id}\nScore: {file_score}\nFile: {filename}"
            print(Fore.YELLOW+ pretty_output)
            for inner_hits in hit["inner_hits"]["passages"]["hits"]["hits"]:
                   passage_score= inner_hits['_score']
                   passage_text= inner_hits['_source']['text']
                   pretty_output = f"\nPassage Score: {passage_score}\nPassage Text:"
                   print(Fore.CYAN+pretty_output)
                   pretty_output = f"\t\t{passage_text}"
                   print(Fore.BLACK+Back.WHITE+pretty_output)

                   print(Style.RESET_ALL)
                   print("---------------------------------------------------------------------------------------------------------")

The pipeline ingestion is using the script from this blog

However I would like to take advantage of the built in chunking using semantic_text type but it doesnt support nested and therefore I cant use inner_hits. Can it be done?

Topic		Replies	Views
Does FSCrawler support chunking? Elastic Search crawler	8	113	October 4, 2024
Semantic search with the new semantic_text field Elasticsearch elastic-stack-machine-learning , vector-search	9	178	October 15, 2024
Performing semantic searches - ELSER Elasticsearch	3	246	March 30, 2024
Indexing word, pdf documents? Elasticsearch	12	6118	July 7, 2020
Recommended workflow for indexing many binary docs Elasticsearch	4	759	July 6, 2021

Elasticsearch semantic_text search

Related topics