I've indexed a whole load of .docx files: each file is split up into (overlapping) sequences of 10 paragraphs while parsing. So each LDoc (Lucene document) in the index consists of 10 paragraphs of text.
In these documents there are various languages. At the moment I am trying to tackle correctly indexing English and Greek, which is far easier than the next step, i.e. distinguishing Latin-script non-English languages. Greek characters can obviously be identified from the Unicode code pages.
So I have a field "text_content", which is the completely unchanged text.
Then I have a field "latin_normalised_content" with a stemmer field "latin_normalised_content.english_stemmed". All the non-latin characters in this "content" field are replaced with a placeholder character ("?").
(By "normalising" here I mean stripping of accents/diacritics, which for Greek and most Latin languages is usually going to be appropriate for search purposes).
And I also have a field "greek_normalised_content" with a stemmer field "greek_normalised_content.greek_stemmed". All the non-greek characters in this "content" field are replaced with a placeholder character ("?").
After normalising the query I am getting beautiful multi-coloured highlighting using the Fast Vector Highlighter, by identifying whether this is a Latin-script or Greek-script query. Currently I'm not permitting a mixed Latin-script and Greek-script query (after struggling with various permutations in constructing the query dict for quite some time).
So at the moment things look like this:
data = \
{
"query": {
"simple_query_string": {
"query": self.query_text,
"fields": ["greek_normalised_content.greek_stemmed" if self.is_greek else "latin_normalised_content.english_stemmed",]
}
},
"highlight": {
"number_of_fragments": 0,
"fields": {
"greek_normalised_content.greek_stemmed" if self.is_greek else "latin_normalised_content.english_stemmed": {
"matched_fields": ["greek_normalised_content.greek_stemmed" if self.is_greek else "latin_normalised_content.english_stemmed"],
"type": "fvh",
"pre_tags" : [
"<span style=\"background-color: yellow\">",
"<span style=\"background-color: skyblue\">",
"<span style=\"background-color: lightgreen\">",
"<span style=\"background-color: plum\">",
"<span style=\"background-color: lightcoral\">",
"<span style=\"background-color: silver\">",
],
"post_tags" : ["</span>", "</span>", "</span>", "</span>", "</span>", "</span>",],
},},},}
(NB the results are then processed to put the HTML span
tags back into the original non-normalised text, so I can display the non-normalised text for the user, with beautiful highlighting. That's the reason for the placeholder chars: so the "Latin-only" and "Greek-only" strings can be the same length as the unmodified content string).
What I want to achieve is, with a mixed Latin-script/Greek-script query, somehow the results (score) from both queries (on latin_normalised_content.english_stemmed and on greek_normalised_content.greek_stemmed) are combined ... and crucially the highlighting of the two is also combined.
I am aware that "highlight" --> "fields" --> "matched_fields" can be a list with more than one index. But none of my attempts so far has succeeded.
Initially I had just one field, "normalised_content", with an English stemmer field and Greek stemmer field attached to it (and no placeholders). The problem with that is that a Greek word as entered into the query will also appear in the English stemmer field (unchanged) ... so I'd tend to get Greek words which should be in the same colour having two different colours... horrible.
I've also seen and tried "combined_fields" query and "multi_match" query. No success so far.