Recommendations for a chatbot search system on a heterogeneous data set

rgambelli · April 17, 2024, 2:01pm

Hello, everyone,
we would welcome suggestions on how to leverage elasticsearch to build a chatbot-like search system where the data are several tens of terabytes and consist of both documents (pdfs and emails especially) and text values associated with entity properties.

We currently use Apache Jackrabbit as our content repository, the data is already indexed through Lucene, however we have two separate applications, so two separate content repositories, we would like to transfer the data from the two repositories into one place to be queried with AI-like searches.

Could you direct us to some documentation related to elasticsearch to implement a scenario like the one described?

Thanks for your help

Translated with DeepL.com (free version)

Rodney_Norris · April 17, 2024, 2:55pm

Your first challenge will be getting the data into elasticsearch and making it indexed in a way thats searchable.

I would suggest starting by learning about semantic search & chunking, then performing queries either through ELSER or dense vector model that retrieves good documents to be consumed by an LLM. Once that is done you can build a query pipeline to summarize results. Langchain could be a good route for this, Question Answering with Langchain and OpenAI notebook is good to look at one example of this workflow.

Additionally this article is an introduction to Retrieval Augmented Generation (RAG).
Then for some more prototyping experience this Chatbot Tutorial is another good resource.

rgambelli · April 18, 2024, 2:34pm

Really thanks Rodney,

I've started reading something, I'm not finding if elasticsearch is able to return data based on roles/permissions of the user who did a search, is there the possibility to enhance ingested data with metadata representing who can access them or something similar?

Thanks again

Rodney_Norris · April 18, 2024, 4:02pm

Yes elasticsearch can do that with Document Level Security

system · May 16, 2024, 4:02pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Best way to proceed Elasticsearch	6	419	July 6, 2017
NLP (talking to the elasticsearch database) Elasticsearch	1	369	January 18, 2019
Discourse and ElasticSearch Elasticsearch	1	1205	October 17, 2018
Greetings! Elasticsearch	8	921	July 6, 2017
Cost Optimization with Generative AI Using Elasticsearch Elasticsearch	0	14	December 12, 2024

Recommendations for a chatbot search system on a heterogeneous data set

Related topics