Index openxml structured documents with Elasticsearch

vhd.afr · November 5, 2020, 5:53pm

We have a set of structured documents. The structure is extremely inspired by openxml data model. Briefly speaking, documents are made up of an ordered set of paragraphs, each paragraph itself has id and an ordered set of runs and each run has a textual content and some metadata.

For instance, the following sample document contains two ["Lorem ipsum" , "dolor sit amet"] paragraphs.

      {
        id: 1
        paragraphs : [
            {
                title: "De finibus"
                runs: [
                    {text: "Lorem i", metadata: {} }, 
                    {text: "psu", metadata: {bold: true} }, 
                    {text: "m", metadata: {} }, 
                ] 
            },
            {
                id: 2
                runs: [
                    {text: "dolor sit amet", metadata: {} }, 
                ] 
            },
        ]
    }

We want to index the documents by Elasticsearch, of course, in a way such that it be able to answer the following queries:

Query: dolor sit

Expected answer: in the document with title="De finibus", in the paragraph with id=2, from the 1th character of the 1s run to the 9th character of the 1rd run
Query: ipsum

Expected answer: in the document with title="De finibus", in the paragraph with id=1, from the 7th character of the 1s run to the 1st character of the 3rd run
Query: ipsum dolor

Expected answer: in the document with title="De finibus", from the 7th character of the 1s run of the paragraph with id=1 to the 5th character of the 1rd run of the paragraph with id=2

I familiar with nested field in elastic. It may satisfy the first query. But how should we map our documents to connect the consecutive runs and paragraphs together and make elastic to answer two latter queries?

NOTE: I also asked this question in Stackoverflow.com(see here).

system · December 3, 2020, 5:53pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Nested objects queries Elasticsearch	3	515	September 21, 2019
Creating fields from XML that allow for phrasing through and around nested elements Elasticsearch	1	506	April 14, 2017
Two (probably) documents within the same content-structrure couln't be indexed Elasticsearch	4	340	July 6, 2017
Create and search nested document Elasticsearch	1	224	December 9, 2021
Support for indexing Elasticsearch	3	288	July 6, 2017

Index openxml structured documents with Elasticsearch

Related topics