Hi, I am very interested in combining the usage of GPT + Elasticsearch for enterprise data (link: ChatGPT and Elasticsearch: OpenAI meets private data — Elastic Search Labs). We have used Azure Cognitive Search + Open AI GPT but faced many implementation issues. For instance, all our documents in Microsoft Words have to be converted into PDF before chunking can take place and we have to use chunking overlap to ensure context linkage between pages.
Before we deep dive further into Elastic Search solution. Can I ask Elasticsearch has similar limitation?
For instance, all our documents in Microsoft Words have to be converted into PDF before chunking
That's odd. I think in our ecosystem, it's more common that you'd want to convert documents to plain-text first. The Attachment Processor can do that for you in an ingest pipeline, or you can use tools like pandoc or Apache Tika to do that outside our ecosystem.
we have to use chunking overlap to ensure context linkage between pages.
This is more of a problem-space pattern than a stack-enforced task. If you don't have overlap in your chunks but your query needs context from two non-overlapping chunks, you won't get a hit. It's a tradeoff you'll have to consider regardless of the tech you use, whether the cost of the extra inference is worth the improved relevance. You have lots of knobs you can turn here, like how big of chunks to make, how much overlap they should have, now many max chunks to make, and even if you want to chunk large text content at all, or first summarize the large content with an LLM, then apply the semantic text model to just the summary.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.