Index openxml structured documents with Elasticsearch

We have a set of structured documents. The structure is extremely inspired by openxml data model. Briefly speaking, documents are made up of an ordered set of paragraphs, each paragraph itself has id and an ordered set of runs and each run has a textual content and some metadata.

For instance, the following sample document contains two ["Lorem ipsum" , "dolor sit amet"] paragraphs.

      {
        id: 1
        paragraphs : [
            {
                title: "De finibus"
                runs: [
                    {text: "Lorem i", metadata: {} }, 
                    {text: "psu", metadata: {bold: true} }, 
                    {text: "m", metadata: {} }, 
                ] 
            },
            {
                id: 2
                runs: [
                    {text: "dolor sit amet", metadata: {} }, 
                ] 
            },
        ]
    }

We want to index the documents by Elasticsearch, of course, in a way such that it be able to answer the following queries:

  1. Query: dolor sit

    Expected answer: in the document with title="De finibus", in the paragraph with id=2, from the 1th character of the 1s run to the 9th character of the 1rd run

  2. Query: ipsum

    Expected answer: in the document with title="De finibus", in the paragraph with id=1, from the 7th character of the 1s run to the 1st character of the 3rd run

  3. Query: ipsum dolor

    Expected answer: in the document with title="De finibus", from the 7th character of the 1s run of the paragraph with id=1 to the 5th character of the 1rd run of the paragraph with id=2

I familiar with nested field in elastic. It may satisfy the first query. But how should we map our documents to connect the consecutive runs and paragraphs together and make elastic to answer two latter queries?

NOTE: I also asked this question in Stackoverflow.com(see here).

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.