Hello, everyone,
we would welcome suggestions on how to leverage elasticsearch to build a chatbot-like search system where the data are several tens of terabytes and consist of both documents (pdfs and emails especially) and text values associated with entity properties.
We currently use Apache Jackrabbit as our content repository, the data is already indexed through Lucene, however we have two separate applications, so two separate content repositories, we would like to transfer the data from the two repositories into one place to be queried with AI-like searches.
Could you direct us to some documentation related to elasticsearch to implement a scenario like the one described?
Your first challenge will be getting the data into elasticsearch and making it indexed in a way thats searchable.
I would suggest starting by learning about semantic search & chunking, then performing queries either through ELSER or dense vector model that retrieves good documents to be consumed by an LLM. Once that is done you can build a query pipeline to summarize results. Langchain could be a good route for this, Question Answering with Langchain and OpenAI notebook is good to look at one example of this workflow.
I've started reading something, I'm not finding if elasticsearch is able to return data based on roles/permissions of the user who did a search, is there the possibility to enhance ingested data with metadata representing who can access them or something similar?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.