Hi, I am very interested in combining the usage of GPT + Elasticsearch for enterprise data (link: ChatGPT and Elasticsearch: OpenAI meets private data — Elastic Search Labs). We have used Azure Cognitive Search + Open AI GPT but faced many implementation issues. For instance, all our documents in Microsoft Words have to be converted into PDF before chunking can take place and we have to use chunking overlap to ensure context linkage between pages.
Before we deep dive further into Elastic Search solution. Can I ask Elasticsearch has similar limitation?
Hi @Pey !
I am very interested in combining the usage of GPT + Elasticsearch for enterprise data
Fantastic! This is a use case we're very focused on right now. Looks like you'e found one of our relevant blogs - you may want to also read Chunking Large Documents via Ingest pipelines plus nested vectors equals easy passage search — Elastic Search Labs, which offers some ideas on how to approach chunking with Elastic stack components. Privacy-first AI search using LangChain and Elasticsearch — Elastic Search Labs also provides guidance on how to use tools like LangChain to chunk your data.
For instance, all our documents in Microsoft Words have to be converted into PDF before chunking
That's odd. I think in our ecosystem, it's more common that you'd want to convert documents to plain-text first. The Attachment Processor can do that for you in an ingest pipeline, or you can use tools like pandoc or Apache Tika to do that outside our ecosystem.
we have to use chunking overlap to ensure context linkage between pages.
This is more of a problem-space pattern than a stack-enforced task. If you don't have overlap in your chunks but your query needs context from two non-overlapping chunks, you won't get a hit. It's a tradeoff you'll have to consider regardless of the tech you use, whether the cost of the extra inference is worth the improved relevance. You have lots of knobs you can turn here, like how big of chunks to make, how much overlap they should have, now many max chunks to make, and even if you want to chunk large text content at all, or first summarize the large content with an LLM, then apply the semantic text model to just the summary.
Hopefully this is helpful.
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.