Splitting Content

NashSLX · October 22, 2019, 3:08pm

I'm looking for some input on how to solve a specific issue. I have an idea already, but I'm worried I'm getting a bit off track by just adding more and more analyzers to a single field.

Our index contains documents in XML format:

<document>
    <text>
      Lorem ipsum dolor sit amet consectetur adipisicing elit.
    </text>
    <footnotes>
      Lorem ipsum dolor sit amet consectetur adipisicing elit.
    </footnotes >
</document>

These documents can be in one of four languages, (or a mixture of multiple) and are analyzed with specific analyzers for those languages.

"content": {
	"type": "text",
	"analyzer": "html_exact",
	"fields": {
		"de": {
			"type": "text",
			"analyzer": "our_german_analyzer",
		},
		"en": {
			"type": "text",
			"analyzer": "our_english_analyzer",
		},
		"fr": {
			"type": "text",
			"analyzer": "our_french_analyzer",
		},
		"it": {
			"type": "text",
			"analyzer": "our_italian_analyzer",
		}
	}
}

I've cut out some parts and configurations which I think are unimportant (other than maybe to display my potential overusage of fields with different analyzers). Now I received the requirement that some customers would like to optionally disable searching through the footnotes.

The way I see it I can either

Split the document apart before indexing and index the data in two completely separate fields. This would probably create some overhead for our highlighting (especially because we have some custom stuff there)

Create two pattern tokenizers that filter out text and footnotes respectively. I could then create specific content subfields with text and footnotes:

"content": {
  "type": "text",
  "analyzer": "html_exact",
  "fields": {
  	"de_text": {
  		"type": "text",
  		"analyzer": "our_german_text_analyzer",
  	},
  	"de_footnotes": {
  		"type": "text",
  		"analyzer": "our_german_footnote_analyzer",
  	},
  	"en_text": {
  		"type": "text",
  		"analyzer": "our_english_text_analyzer",
  	},
  	"en_footnotes": {
  		"type": "text",
  		"analyzer": "our_english_footnote_analyzer",
  	},
  	"fr_text": {
  		"type": "text",
  		"analyzer": "our_french_text_analyzer",
  	},
  	"fr_footnotes": {
  		"type": "text",
  		"analyzer": "our_french_footnote_analyzer",
  	},
  	"it_text": {
  		"type": "text",
  		"analyzer": "our_italian_text_analyzer",
  	}
  	"it_footnotes": {
  		"type": "text",
  		"analyzer": "our_italian_footnote_analyzer",
  	}
  }
}

I'm trending toward the second option since with fvh and position offsets, any highlighting unification would be done automatically, but I have the fear that I'm going off the deep end with custom analyzers. Does anyone have experiences or feedback in this direction? Is there any advantage to one of these solutions, or is there another approach I haven't thought off?

system · November 19, 2019, 3:08pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Using different analyzers for a single field (depends on condition) Elasticsearch	3	690	July 6, 2017
Specifying analyzer on a per field basis at index time Elasticsearch	6	421	July 6, 2017
Language and HTML analyzer Elasticsearch	4	641	July 5, 2017
Java way to analyze a field with two analyzers Elasticsearch	2	272	April 5, 2022
MultiLingual Index Elasticsearch	3	1055	July 5, 2017

Splitting Content

Related topics