Recommendations for a chatbot search system on a heterogeneous data set

Hello, everyone,
we would welcome suggestions on how to leverage elasticsearch to build a chatbot-like search system where the data are several tens of terabytes and consist of both documents (pdfs and emails especially) and text values associated with entity properties.

We currently use Apache Jackrabbit as our content repository, the data is already indexed through Lucene, however we have two separate applications, so two separate content repositories, we would like to transfer the data from the two repositories into one place to be queried with AI-like searches.

Could you direct us to some documentation related to elasticsearch to implement a scenario like the one described?

Thanks for your help

Translated with DeepL.com (free version)

Your first challenge will be getting the data into elasticsearch and making it indexed in a way thats searchable.

I would suggest starting by learning about semantic search & chunking, then performing queries either through ELSER or dense vector model that retrieves good documents to be consumed by an LLM. Once that is done you can build a query pipeline to summarize results. Langchain could be a good route for this, Question Answering with Langchain and OpenAI notebook is good to look at one example of this workflow.

Additionally this article is an introduction to Retrieval Augmented Generation (RAG).
Then for some more prototyping experience this Chatbot Tutorial is another good resource.

1 Like

Really thanks Rodney,

I've started reading something, I'm not finding if elasticsearch is able to return data based on roles/permissions of the user who did a search, is there the possibility to enhance ingested data with metadata representing who can access them or something similar?

Thanks again

Yes elasticsearch can do that with Document Level Security

1 Like