Chunking large documents using HTML during ingest?

I recently game across this article that talks about a strategy to chunk large documents by breaking it up at the sentence level. Is it possible to create a script using HTML so chunking can occur first at the header level? The ideal chunking logic would be like this...

  1. Split the document at all headers (h1, h2, h3,...)
  2. If header chunk is below max word count or token count, recombine headers until the count is closest to, but not exceeding max count
  3. If header chunk is above max word count, split again at the paragraph <p> level.
  4. If the paragraph chunk is below max word count or token count, recombine paragraphs until the count is closest to, but not exceeding max count.
  5. If the paragraph chunk is above max word count, split again at the sentence level using the script in the above referenced post.

I do most of this in a python script, but having it all baked into a single ingest pipeline would be great. The less code I have to deal with the better.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.