I recently game across this article that talks about a strategy to chunk large documents by breaking it up at the sentence level. Is it possible to create a script using HTML so chunking can occur first at the header level? The ideal chunking logic would be like this...
- Split the document at all headers (h1, h2, h3,...)
- If header chunk is below max word count or token count, recombine headers until the count is closest to, but not exceeding max count
- If header chunk is above max word count, split again at the paragraph
<p>
level. - If the paragraph chunk is below max word count or token count, recombine paragraphs until the count is closest to, but not exceeding max count.
- If the paragraph chunk is above max word count, split again at the sentence level using the script in the above referenced post.
I do most of this in a python script, but having it all baked into a single ingest pipeline would be great. The less code I have to deal with the better.