Hi,
I'm starting to learn Elasticsearch because I have specific requirement and Im trying to do proof of concept where someone can semantically search document contents (pdf , docx ...etc.).
I have used the FSCrawler to help with reading the file while assigning the following mapping to the created index:
client.indices.create(index="first_index",mappings=
{
"dynamic": "true",
"properties": {
"semantic_text_field": {
"type": "semantic_text",
"inference_id": "my-elser-endpoint"
}
}
}
)
I have configured the embedding end point as the following:
client.inference.put(inference_id="my-elser-endpoint",task_type="sparse_embedding",inference_config={
"service": "elser",
"service_settings": {
"num_allocations": 1,
"num_threads": 1
}
})
Then I used and ingest pipeline to assign my semantic_text_field from the FSCrawler content field:
client.ingest.put_pipeline(id="ml_pipeline", processors= [
{
"script": {
"description": "Chunk body_content into sentences by looking for . followed by a space",
"lang": "painless",
"source": """
ctx.semantic_text_field = ctx.content;
"""
}
}
])
After running the FSCrawler and getting the result, I can see that semantic_text_field being assigned the document contents and I can see multiple chunks with embedding being created.
The problem Im having when I submit query as in the following syntax:
resp = client.search(index="first_index",query={
"semantic": {
"field": "semantic_text_field",
"query": "...."
}
}
)
I'm getting hits on all the document and not the chunks. I thought the chunks will be index! How can I get hits on particular chunks and not the full document text? Do I need to re index each chunk independently?
I appreciate and help and guidance in this as I spent a lot of time to get to this point
Thanks