I use the word "document" here in the sense of "Lucene Document" or LDoc, i.e. the thing which gets put into the index, analysed, etc.
I'm parsing and then indexing a whole load of .docx and .docm text files in a directory tree. To do that I'm dividing them up into blocks of 10 paragraphs (overlapping). Each 10-paragraph block constitutes an LDoc. I'm creating the index using the _bulk endpoint.
There's quite a lot of non-English text here. In a later stage I shall attempt to use a language analyser module to try and identify non-English languages using Latin script. At the moment I'm scratching my head how to handle LDocs where the string to be entered contains Greek script.
So one such LDoc text is like this:
"After the loyal things happened pledge was taken, said Klearkhos"
As soon as the pledge was taken, Clearchus spoke:
--ἄγε δή, ὦ Ἀριαῖε, ἐπείπερ ὁ αὐτὸς ὑμῖν στόλος ἐστὶ καὶ ἡμῖν, εἰπὲ τίνα γνώμην ἔχεις περὶ τῆς πορείας, πότερον ἄπιμεν ἥνπερ ἤλθομεν ἢ ἄλλην τινὰ ἐννενοηκέναι δοκεῖς ὁδὸν κρείττω.
ἄγω ἄγε: 2s pres. act. imperative "command!"
ἄγε interjection: come on; let's go; ἄγε δή: "so" {seemingly}
ἐπείπερ conj.: "seeing that"
στόλος: expedition; army; fleet; troop
γνώμη: sign; mark; mind; intelligence; judgment; understanding; will; opinion
ἔχεις: 2s pai
περὶ prep.: (+ gen.) about; concerning; because of
Examining the results returned from an (English) stemmer query on an (English) stemmed version of the field, I find this is returned for a search on "Klearkhos":
"loyal things happened pledge was taken, said <span style=\"background-color: yellow\">Klearkhos</span>\"\nAs soon as the pledge was taken, Clearchus spoke:"
(NB I'm using a highlighter, hence the span)
At first I thought that the stemmer, on encountering non-Latin text, might simply have hung up the phone and decided the rest of the LDoc text isn't worth bothering with. (NB I'm not clear why the beginning, "After the
, hasn't been included...).
Actually it turns out that it isn't doing that. A search on "intelligence judgment expedition" returns results including this:
"that expedition; army; fleet; troop: sign; mark; mind; intelligence; judgment; understanding;"
(highlighting tags omitted...)
So in fact the stemmer function appears to be dividing the submitted text into a number of different, very small, LDocs. This probably isn't the ideal way to handle these LDoc texts.
I think the best thing is probably to strip out the Greek script and just stem the remaining English. But I want the _source field to contain the whole text, regardless.
I can strip out the Greek text in my (Rust) module by detecting non-Latin characters. But how can I tell the ES server to use, for stemming purposes, a different text from the one submitted for the "whole text"?
PS naturally I'd then think about stripping out all the English and stemming all the Greek text in a given LDoc using a Greek stemmer...