Splitting Content

I'm looking for some input on how to solve a specific issue. I have an idea already, but I'm worried I'm getting a bit off track by just adding more and more analyzers to a single field.

Our index contains documents in XML format:

<document>
    <text>
      Lorem ipsum dolor sit amet consectetur adipisicing elit.
    </text>
    <footnotes>
      Lorem ipsum dolor sit amet consectetur adipisicing elit.
    </footnotes >
</document>

These documents can be in one of four languages, (or a mixture of multiple) and are analyzed with specific analyzers for those languages.

"content": {
	"type": "text",
	"analyzer": "html_exact",
	"fields": {
		"de": {
			"type": "text",
			"analyzer": "our_german_analyzer",
		},
		"en": {
			"type": "text",
			"analyzer": "our_english_analyzer",
		},
		"fr": {
			"type": "text",
			"analyzer": "our_french_analyzer",
		},
		"it": {
			"type": "text",
			"analyzer": "our_italian_analyzer",
		}
	}
}

I've cut out some parts and configurations which I think are unimportant (other than maybe to display my potential overusage of fields with different analyzers). Now I received the requirement that some customers would like to optionally disable searching through the footnotes.

The way I see it I can either

  • Split the document apart before indexing and index the data in two completely separate fields. This would probably create some overhead for our highlighting (especially because we have some custom stuff there)

  • Create two pattern tokenizers that filter out text and footnotes respectively. I could then create specific content subfields with text and footnotes:

    "content": {
      "type": "text",
      "analyzer": "html_exact",
      "fields": {
      	"de_text": {
      		"type": "text",
      		"analyzer": "our_german_text_analyzer",
      	},
      	"de_footnotes": {
      		"type": "text",
      		"analyzer": "our_german_footnote_analyzer",
      	},
      	"en_text": {
      		"type": "text",
      		"analyzer": "our_english_text_analyzer",
      	},
      	"en_footnotes": {
      		"type": "text",
      		"analyzer": "our_english_footnote_analyzer",
      	},
      	"fr_text": {
      		"type": "text",
      		"analyzer": "our_french_text_analyzer",
      	},
      	"fr_footnotes": {
      		"type": "text",
      		"analyzer": "our_french_footnote_analyzer",
      	},
      	"it_text": {
      		"type": "text",
      		"analyzer": "our_italian_text_analyzer",
      	}
      	"it_footnotes": {
      		"type": "text",
      		"analyzer": "our_italian_footnote_analyzer",
      	}
      }
    }
    

I'm trending toward the second option since with fvh and position offsets, any highlighting unification would be done automatically, but I have the fear that I'm going off the deep end with custom analyzers. Does anyone have experiences or feedback in this direction? Is there any advantage to one of these solutions, or is there another approach I haven't thought off?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.