I have an index which contains the content of a presentation, with one document per slide.
I am searching for relevant slides with the common terms query, which works quite well.
Now I wish to group search results based on their proximity (slide number) to each other, and have sequential, or near sequential groups, and have the overall group scored. This should prioritise results that are sequential matches, as they are more likely to be related to the actual search query.
Ideally I would be given a list of the slide number in the group, but the first slide in the group would be sufficient.
I am open to many different ways to solving this problem, be it with preprocessing, aggregation, or even post processing the results. Rebuilding indexes is not an issue.
I have considered taking an n-gram style approach to creating the documents in the index.
For example, for slide one we would have three (for example) documents:
- One would have the contents of just slide 1
- One would have the contents of slide 1, and slide 2
- One would have the contents of slide 1, 2, and slide 3
This seems like it would work well, but it is a little bit messy.
Does anyone have any better suggestions?