Elasticsearch semantic_text search

Hi,

I'm starting to learn Elasticsearch because I have specific requirement and Im trying to do proof of concept where someone can semantically search document contents (pdf , docx ...etc.).

I have used the FSCrawler to help with reading the file while assigning the following mapping to the created index:

client.indices.create(index="first_index",mappings=
  {
    "dynamic": "true",
    "properties": {
       "semantic_text_field": {
                "type": "semantic_text",
                 "inference_id": "my-elser-endpoint"
              }
	}
  }
)

I have configured the embedding end point as the following:

client.inference.put(inference_id="my-elser-endpoint",task_type="sparse_embedding",inference_config={
    "service": "elser",
  "service_settings": {
    "num_allocations": 1,
    "num_threads": 1
  }
})

Then I used and ingest pipeline to assign my semantic_text_field from the FSCrawler content field:

client.ingest.put_pipeline(id="ml_pipeline", processors= [
    {
      "script": {
        "description": "Chunk body_content into sentences by looking for . followed by a space",
        "lang": "painless",
        "source": """
          ctx.semantic_text_field = ctx.content;
          """
      }
    }
    
  ])

After running the FSCrawler and getting the result, I can see that semantic_text_field being assigned the document contents and I can see multiple chunks with embedding being created.

The problem Im having when I submit query as in the following syntax:

resp = client.search(index="first_index",query={
    "semantic": {
      "field": "semantic_text_field",
      "query": "...."
    }
  }
)

I'm getting hits on all the document and not the chunks. I thought the chunks will be index! How can I get hits on particular chunks and not the full document text? Do I need to re index each chunk independently?

I appreciate and help and guidance in this as I spent a lot of time to get to this point
Thanks

Can someone please help me with this?

The other thread implies that you're on 7.x

The semantic_text field type isn't available in 7.x

Thanks for your reply.
I did actually upgrade to the latest version 8.16.
Can you help me with my question please. I made some progress using inner_hit and sparse_vector to isolate the result of the vector field on each chunk, however can it be done on the semantic_text field.

So basically to get it work using sparse_vector I have to update the FSCrawler mapping as follows to allow nested which what inner_hit needs:

client.indices.create(index="chunk_index",settings={"index.mapping.total_fields.limit": 2500},mappings=
  {
   
    "dynamic": "true",
    "properties": {
      "passages": {
        "type":"nested",
        "properties": {
          "vector": {
            "type":"nested",
            "properties": {
              "predicted_value": {
                "type": "sparse_vector"
               

              }
            }
          }
        }
      },
      "path":{
          "type":"nested"
      }
    }
  }
)

My Search query looks like this:

resp = client.search(
    index="chunk_index",
    body={
	"query": {
					"nested": {
						"path": "passages",
						"query": {
								"nested": {
										"path": "passages.vector",
										"query": {
											"sparse_vector": {
												"field": "passages.vector.predicted_value",
												"inference_id": "my-elser-endpoint",
												"query": "new web service"
											}
										}
									}
						},
                        "inner_hits": {}
					}
	}
}
)
from colorama import Fore, Back, Style

if len(resp["hits"]["hits"]) == 0:
        print("Your search returned no results.")
else:
        
        for hit in resp["hits"]["hits"]:
            id = hit["_id"]
            file_score = hit["_score"]
            filename = hit["_source"]["file"]["filename"]
            pretty_output = f"\nID: {id}\nScore: {file_score}\nFile: {filename}"
            print(Fore.YELLOW+ pretty_output)
            for inner_hits in hit["inner_hits"]["passages"]["hits"]["hits"]:
                   passage_score= inner_hits['_score']
                   passage_text= inner_hits['_source']['text']
                   pretty_output = f"\nPassage Score: {passage_score}\nPassage Text:"
                   print(Fore.CYAN+pretty_output)
                   pretty_output = f"\t\t{passage_text}"
                   print(Fore.BLACK+Back.WHITE+pretty_output)

                   print(Style.RESET_ALL)
                   print("---------------------------------------------------------------------------------------------------------")

The pipeline ingestion is using the script from this blog

However I would like to take advantage of the built in chunking using semantic_text type but it doesnt support nested and therefore I cant use inner_hits. Can it be done?

1 Like

Would you like to share your findings within the FSCrawler documentation? That's something I'd like to have. And may be change the default mapping of FSCrawler to allow with a simple setting to turn on semantic search...

Hi ,

Thanks for creating such tool. Couple things I would like to mention if I may that could be an improvement:

1- Ability of applying multi threading. I read in one post its single thread app and to avoid that you have to run multiple instances which means different jobs and folders for partition the load and the data.

2- Ability of the crawler to do chunking of documents so that its done without having to call pipeline.

As far taking advantage of semantic_text field which will do chunking and embedding for you , however the produced mapping out of FScrawler means you can't query those chunks independent of the whole document because semantic_text field doesnt allow support currently nested structure which means you can't do nested query and get the inner_hits on the chunks like what I did above in using the sparse_vector type where I have to do the chunking myself. Unless there is a way to do that, the only solution is for the FScrawler somehow to support uploading individual chunks into different index and then query those chunks independently.

1 Like

Another thing I forgot to list is ability to read tables out of pdf since everything is read as stream of text where data in each row is not aligned where no intelligence could be made out of the content. Thanks