Is there a way to combine scoring and highlighting from two separate (stemmer) fields?

mrodent · May 8, 2024, 9:43am

I've indexed a whole load of .docx files: each file is split up into (overlapping) sequences of 10 paragraphs while parsing. So each LDoc (Lucene document) in the index consists of 10 paragraphs of text.

In these documents there are various languages. At the moment I am trying to tackle correctly indexing English and Greek, which is far easier than the next step, i.e. distinguishing Latin-script non-English languages. Greek characters can obviously be identified from the Unicode code pages.

So I have a field "text_content", which is the completely unchanged text.

Then I have a field "latin_normalised_content" with a stemmer field "latin_normalised_content.english_stemmed". All the non-latin characters in this "content" field are replaced with a placeholder character ("?").

(By "normalising" here I mean stripping of accents/diacritics, which for Greek and most Latin languages is usually going to be appropriate for search purposes).

And I also have a field "greek_normalised_content" with a stemmer field "greek_normalised_content.greek_stemmed". All the non-greek characters in this "content" field are replaced with a placeholder character ("?").

After normalising the query I am getting beautiful multi-coloured highlighting using the Fast Vector Highlighter, by identifying whether this is a Latin-script or Greek-script query. Currently I'm not permitting a mixed Latin-script and Greek-script query (after struggling with various permutations in constructing the query dict for quite some time).

So at the moment things look like this:

    data = \
    {
        "query": {
            "simple_query_string": {
                "query": self.query_text,
                "fields": ["greek_normalised_content.greek_stemmed" if self.is_greek else "latin_normalised_content.english_stemmed",]
            }
        },
        "highlight": {
            "number_of_fragments": 0, 
            "fields": {
                "greek_normalised_content.greek_stemmed" if self.is_greek else "latin_normalised_content.english_stemmed": {
                    "matched_fields": ["greek_normalised_content.greek_stemmed" if self.is_greek else "latin_normalised_content.english_stemmed"],
                    "type": "fvh",
                    "pre_tags" : [
"<span style=\"background-color: yellow\">",
"<span style=\"background-color: skyblue\">",
"<span style=\"background-color: lightgreen\">",
"<span style=\"background-color: plum\">",
"<span style=\"background-color: lightcoral\">",
"<span style=\"background-color: silver\">",
                    ], 
                    "post_tags" : ["</span>", "</span>", "</span>", "</span>", "</span>", "</span>",],
                },},},}

(NB the results are then processed to put the HTML span tags back into the original non-normalised text, so I can display the non-normalised text for the user, with beautiful highlighting. That's the reason for the placeholder chars: so the "Latin-only" and "Greek-only" strings can be the same length as the unmodified content string).

What I want to achieve is, with a mixed Latin-script/Greek-script query, somehow the results (score) from both queries (on latin_normalised_content.english_stemmed and on greek_normalised_content.greek_stemmed) are combined ... and crucially the highlighting of the two is also combined.

I am aware that "highlight" --> "fields" --> "matched_fields" can be a list with more than one index. But none of my attempts so far has succeeded.

Initially I had just one field, "normalised_content", with an English stemmer field and Greek stemmer field attached to it (and no placeholders). The problem with that is that a Greek word as entered into the query will also appear in the English stemmer field (unchanged) ... so I'd tend to get Greek words which should be in the same colour having two different colours... horrible.

I've also seen and tried "combined_fields" query and "multi_match" query. No success so far.

mrodent · May 8, 2024, 10:57am

Oh, amazing, seem to have made some progress... this appears to deliver pretty good results:

stemmer_fields = []
if self.is_latin: 
    stemmer_fields.append("latin_normalised_content.english_stemmed")
if self.is_greek:
     stemmer_fields.append("greek_normalised_content.greek_stemmed")
data = \
{
    "query": {
        "simple_query_string": {
            "query": self.query_text,
            "fields": stemmer_fields,
        }
    },
    "highlight": {
        "number_of_fragments": 0, # whole field is highlighted
        "pre_tags" : [
"<span style=\"background-color: yellow\">",
"<span style=\"background-color: skyblue\">",
"<span style=\"background-color: lightgreen\">",
"<span style=\"background-color: plum\">",
"<span style=\"background-color: lightcoral\">",
"<span style=\"background-color: silver\">",
        ], 
        "post_tags" : ["</span>", "</span>", "</span>", "</span>", "</span>", "</span>",],
        "fields": {
            "latin_normalised_content.english_stemmed": {
                "matched_fields": stemmer_fields,
                "type": "fvh",
            }, 
        },
    },
}

The highlighting colours are a bit hit-and-miss: with six specified colours a query with 4 terms should not be repeating colours on more than one of those 4 terms. Apart from that, however, this seems to work. (In the case of a query which is NOT mixed Latin-script-and-Greek-script, but one or the other, the colours seem to work as hoped: terms 1 to 6 are all highlighted in sequence ... and the 7th term will be yellow again.)

Typically a mixed scripts query will deliver highlights like this (search was "κλεαρχος house village τροπος "):

"highlight": {
  "latin_normalised_content.english_stemmed": [
	"\"It was already late: not however and he did not (?) turn aside 
however, he did not turn aside , being on his guard not he might 
seem to flee, but in a straight direction (adverb?) leading at once 
while the sun was setting into the nearest unwalled <span 
style=\"background-color: silver\">villages</span> the first men 
leading he encamped, out of which he it (i.e. things???) had been 
plundered  by the royal army and these wooden things which from the 
<span style=\"background-color: lightgreen\">houses</span>.\" and 
the wooden things themselves from the <span 
style=\"background-color: lightgreen\">houses</span> --&gt; \"even 
the wood from the <span style=\"background-color: 
lightgreen\">houses</span>\"\nand indeed it was already late. On 
the other hand he had no mind either to swerve from his 
route--guarding against any appearance of flight. Accordingly he 
marched straight as an arrow, and with sunset entered the nearest 
<span style=\"background-color: silver\">villages</span> with his 
vanguard and took up quarters. These <span 
style=\"background-color: silver\">villages</span> had been 
thoroughly sacked and dismantled by the royal army--down to the 
very woodwork and furniture of the <span style=\"background-color: 
lightgreen\">houses</span>.\n[2.17] ?? ??? ??? ?????? ???? <span 
style=\"background-color: skyblue\">?????</span> ???? 
?????????????????, ?? ?? ??????? ???????? ?????????? ?? ????????? 
??????? ?????????, ??? ??????? ?????? ??????? ????????? ????????, 
???? ??? ???? ????????? ???????: ???? ?? ??? ???????? ??? ???????? 
??? ?????? ?? ??? ??????????.\n???? conj.: all the same; 
nevertheless\n???? adj.: one and the same; common; joint. ????: 
Doric acc. m. pl.\n????: I unite. ????: 2s various inflections, all 
Doric.\n???? adverb: equally; likewise; alike\n<span 
style=\"background-color: skyblue\">??????</span>: turn; direction; 
course; way. ???? <span style=\"background-color: 
skyblue\">?????</span> ????: \"in some sort of fashion\"\n??? 
relative interrogative: ????: dat. indeclinable \"to which 
...?\"\n??? relative pron.: ????: dat. indeclinable \"to which\""
  ]
}

The overall scoring seems to be combining scoring from both fields as far as I can tell.

Despite the fact that only the "english_stemmed" field is specified here, actually the correctly stemmed Greek results are also delivered and highlighted properly: as can be seen from the fact that in the above some sequences of placeholder characters are given a span markup: these are Greek words. One such word in the above is "τρόπῳ" ... this is the dative singular of "τρόπος", so Greek stemming working good.

Some kind of powerful magic going on here.

These highlighted placeholder-filled strings would be pretty useless, but for the fact that, as mentioned, I'm reconstructing the original, non-normalised, strings in order to insert the delivered span markup.